DSPy Prompt自动生成最佳实践
文章目录
- 什么是DSPy
- 数据准备
- 数据读取
- 定义评估指标
- 配置LLM、优化器
- 开始
- 优化后的prompt
- 总结
什么是DSPy
这里直接引用文章 DSPy 入门: 再见提示,你好编程 中的介绍,详细内容请看原文
DSPy(“Declarative Self-improving Language Programs (in Python)”,发音为 “dee-es-pie”)是斯坦福大学 NLP 研究人员开发的 "基础模型编程 "框架。它强调编程而非提示,并将构建基于 LM 的管道从操作提示转向编程。因此,它旨在解决构建基于 LM 应用程序时的脆弱性问题。
DSPy Github
可以参考官方Github先安装一下
数据准备
我利用豆包生成了一部分数据
[{"query": "腾讯2021年和2022年分别盈利多少?","answer": ["腾讯2021年盈利多少?","腾讯2022年盈利多少?"]},{"query": "苹果公司和微软公司哪个市值更高?","answer": ["苹果公司的市值是多少?","微软公司的市值是多少?"]},{"query": "为什么电动汽车越来越受欢迎?","answer": ["电动汽车的市场份额如何变化?","电动汽车有哪些优势?","政策和基础设施如何支持电动汽车发展?"]},{"query": "什么是机器学习?","answer": ["机器学习的定义是什么?","机器学习有哪些主要类型?","机器学习的应用领域有哪些?","机器学习与人工智能的关系是什么?"]},{"query": "如何制定一个有效的健身计划?","answer": ["制定健身计划前需要评估哪些身体指标?","健身目标有哪些类型?","如何根据目标选择适合的训练方式?","如何安排训练频率和强度?","如何制定合理的饮食计划?"]},{"query": "华为、苹果和三星在智能手机市场的份额分别是多少?","answer": ["华为在智能手机市场的份额是多少?","苹果在智能手机市场的份额是多少?","三星在智能手机市场的份额是多少?"]},{"query": "2023年全球票房最高的电影是哪部,它的导演是谁,票房收入是多少?","answer": ["2023年全球票房排名如何?","2023年全球票房最高的电影是哪部?","这部电影的导演是谁?","这部电影的全球票房收入是多少?"]},{"query": "世界上最高的三座山峰分别是什么,它们的海拔高度是多少,位于哪个国家或地区?","answer": ["世界上海拔最高的山峰有哪些?","世界上最高的三座山峰分别是什么?","这三座山峰的海拔高度分别是多少?","这三座山峰分别位于哪个国家或地区?"]},{"query": "如果一个正方形的边长增加20%,那么它的面积会增加多少百分比?","answer": ["正方形的面积公式是什么?","边长增加20%后新的边长是多少?","新的面积是多少?","面积增加的百分比如何计算?"]},{"query": "请列出所有位于北半球、人口超过1000万且属于发达国家的城市。","answer": ["如何确定城市是否位于北半球?","哪些城市的人口超过1000万?","如何定义发达国家?","如何筛选同时满足这三个条件的城市?"]},{"query": "从经济学和环境科学的角度分析,推广电动汽车对减少碳排放和促进经济发展有哪些影响?","answer": ["电动汽车如何减少碳排放?","推广电动汽车的经济成本和效益是什么?","电动汽车产业对经济发展有哪些促进作用?","如何平衡环保目标和经济发展需求?"]},{"query": "企业在决定是否推出新产品时,应该考虑哪些因素?","answer": ["市场需求和竞争情况如何?","新产品的研发成本和生产难度如何?","新产品的营销策略和渠道有哪些?","新产品的潜在风险和回报如何?","企业的资源和能力是否支持新产品的推出?"]},{"query": "分析某城市过去十年的人口变化情况,并预测未来五年的人口趋势。","answer": ["如何获取该城市过去十年的人口数据?","人口变化的主要原因是什么?","如何分析人口增长或减少的趋势?","有哪些因素可能影响未来人口趋势?","如何建立人口预测模型?"]},{"query": "解释为什么植物需要阳光进行光合作用。","answer": ["什么是光合作用?","阳光在光合作用中的作用是什么?","光合作用的化学过程是什么?","植物如何捕获和利用光能?","光合作用对植物生长和生存的重要性是什么?"]},{"query": "为什么越来越多的年轻人选择独居生活?","answer": ["独居生活的定义是什么?","独居生活在年轻人中的流行趋势如何?","经济因素如何影响年轻人的居住选择?","社会观念和价值观的变化如何影响独居现象?","独居生活对个人和社会有哪些影响?"]},{"query": "如何设计一个吸引人的用户界面?","answer": ["用户界面设计的基本原则有哪些?","如何了解目标用户的需求和偏好?","如何选择合适的颜色、字体和布局?","如何设计直观的导航和交互元素?","如何测试和优化用户界面设计?"]}
]
数据读取
import dspywith open("test.json", "r", encoding="utf-8") as f:data = json.loads(f.read())all_data = []
for item in data:all_data.append(dspy.Example(question=item["query"], sub_questions=item["answer"]).with_inputs("question"))np.random.shuffle(all_data)
train = all_data
val = train
test = train
准备的数据量太小,所以我这里 train、val、test 都用训练集了
定义评估指标
# 定义metric、evals
def TaskMetric(example: dspy.Example, prediction: dspy.Prediction, trace=None):result = evaluate_decomposition(prediction.sub_questions, example.sub_questions)acc = result['accuracy']if acc > 0.8:return Trueelse:return False
在文本生成领域,评估指标的定义无法简单地用准确率来衡量。在复杂问题分解这个任务上来说,表现为:即使生成的子问题和标准答案并不是完全一样,但他们的语义一样,那也是正确的。反之,亦如此。
为了简单起见,我这里使用编辑距离来度量生成的子问题和标准答案之间的相似度。(标准做法应该使用语义相似度)
即:当 A 与 B 句子之间的编辑距离小于某个阈值(比如0.2)时,表示生成的该子问题是正确的。
详情见:
import numpy as np
from typing import List, Union, Optional# 尝试导入第三方库
try:from Levenshtein import distance as levenshtein_distanceLEVENSHTEIN_AVAILABLE = True
except ImportError:LEVENSHTEIN_AVAILABLE = Falseprint("警告: python-Levenshtein库未安装,将使用较慢的纯Python实现")def levenshtein_distance(s1: str, s2: str) -> int:"""纯Python实现的Levenshtein距离计算"""if len(s1) < len(s2):return levenshtein_distance(s2, s1)if len(s2) == 0:return len(s1)previous_row = range(len(s2) + 1)for i, c1 in enumerate(s1):current_row = [i + 1]for j, c2 in enumerate(s2):insertions = previous_row[j + 1] + 1deletions = current_row[j] + 1substitutions = previous_row[j] + (c1 != c2)current_row.append(min(insertions, deletions, substitutions))previous_row = current_rowreturn previous_row[-1]try:from rapidfuzz.distance import LevenshteinRAPIDFUZZ_AVAILABLE = True
except ImportError:RAPIDFUZZ_AVAILABLE = Falseprint("警告: rapidfuzz库未安装,将使用python-Levenshtein或纯Python实现")def normalized_levenshtein(s1: str, s2: str) -> float:"""计算归一化的Levenshtein距离 (0-1之间,0表示完全相同)"""if not s1 and not s2:return 0.0if RAPIDFUZZ_AVAILABLE:# 使用rapidfuzz计算归一化距离return Levenshtein.normalized_distance(s1, s2)else:# 使用python-Levenshtein或纯Python实现max_len = max(len(s1), len(s2))return levenshtein_distance(s1, s2) / max_lendef edit_distance_accuracy(gold_subquestions: List[str], predicted_subquestions: List[str], threshold: float = 0.2,allow_mapping: bool = True
) -> float:"""使用编辑距离评估问题分解准确率参数:gold_subquestions: 标准答案中的子问题列表predicted_subquestions: 系统预测的子问题列表threshold: 编辑距离阈值,小于此值视为匹配 (0-1之间)allow_mapping: 是否允许非顺序匹配(为True时会尝试最优匹配)返回:准确率 (0-1之间)"""if not gold_subquestions and not predicted_subquestions:return 1.0if not gold_subquestions or not predicted_subquestions:return 0.0# 预处理问题文本gold_subquestions = [q.strip().lower() for q in gold_subquestions]predicted_subquestions = [q.strip().lower() for q in predicted_subquestions]# 方法1:允许非顺序匹配,使用匈牙利算法找到最优匹配if allow_mapping and len(gold_subquestions) > 0 and len(predicted_subquestions) > 0:# 计算距离矩阵distance_matrix = np.zeros((len(gold_subquestions), len(predicted_subquestions)))for i, gold_q in enumerate(gold_subquestions):for j, pred_q in enumerate(predicted_subquestions):distance_matrix[i, j] = normalized_levenshtein(gold_q, pred_q)# 贪心地找到最优匹配matched_pairs = []used_gold = set()used_pred = set()# 按距离排序所有可能的匹配all_pairs = []for i in range(len(gold_subquestions)):for j in range(len(predicted_subquestions)):all_pairs.append((distance_matrix[i, j], i, j))all_pairs.sort() # 按距离升序排列# 贪心地选择匹配for dist, i, j in all_pairs:if i not in used_gold and j not in used_pred:if dist <= threshold:matched_pairs.append((i, j, dist))used_gold.add(i)used_pred.add(j)# 计算准确率return len(matched_pairs) / max(len(gold_subquestions), len(predicted_subquestions))# 方法2:顺序匹配else:# 确保两个列表长度相同min_len = min(len(gold_subquestions), len(predicted_subquestions))# 计算匹配的子问题数量match_count = 0for i in range(min_len):dist = normalized_levenshtein(gold_subquestions[i], predicted_subquestions[i])if dist <= threshold:match_count += 1# 返回准确率return match_count / max(len(gold_subquestions), len(predicted_subquestions))def calculate_edit_distance_f1(gold_subquestions: List[str], predicted_subquestions: List[str], threshold: float = 0.2,allow_mapping: bool = True
) -> float:"""使用编辑距离计算问题分解的F1分数参数:gold_subquestions: 标准答案中的子问题列表predicted_subquestions: 系统预测的子问题列表threshold: 编辑距离阈值,小于此值视为匹配 (0-1之间)allow_mapping: 是否允许非顺序匹配返回:F1分数 (0-1之间)"""if not gold_subquestions and not predicted_subquestions:return 1.0if not gold_subquestions or not predicted_subquestions:return 0.0# 预处理问题文本gold_subquestions = [q.strip().lower() for q in gold_subquestions]predicted_subquestions = [q.strip().lower() for q in predicted_subquestions]# 方法1:允许非顺序匹配if allow_mapping:# 计算距离矩阵distance_matrix = np.zeros((len(gold_subquestions), len(predicted_subquestions)))for i, gold_q in enumerate(gold_subquestions):for j, pred_q in enumerate(predicted_subquestions):distance_matrix[i, j] = normalized_levenshtein(gold_q, pred_q)# 贪心地找到最优匹配matched_pairs = []used_gold = set()used_pred = set()all_pairs = []for i in range(len(gold_subquestions)):for j in range(len(predicted_subquestions)):all_pairs.append((distance_matrix[i, j], i, j))all_pairs.sort() # 按距离升序排列for dist, i, j in all_pairs:if i not in used_gold and j not in used_pred:if dist <= threshold:matched_pairs.append((i, j, dist))used_gold.add(i)used_pred.add(j)# 计算精确率和召回率true_positives = len(matched_pairs)precision = true_positives / len(predicted_subquestions) if predicted_subquestions else 0recall = true_positives / len(gold_subquestions) if gold_subquestions else 0# 计算F1分数f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0return f1# 方法2:顺序匹配else:# 计算匹配的子问题数量match_count = 0for i in range(min(len(gold_subquestions), len(predicted_subquestions))):dist = normalized_levenshtein(gold_subquestions[i], predicted_subquestions[i])if dist <= threshold:match_count += 1# 计算精确率和召回率precision = match_count / len(predicted_subquestions) if predicted_subquestions else 0recall = match_count / len(gold_subquestions) if gold_subquestions else 0# 计算F1分数f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0return f1def evaluate_decomposition(gold_subquestions: List[str], predicted_subquestions: List[str], threshold: float = 0.2,allow_mapping: bool = True
) -> dict:"""评估问题分解的综合指标参数:gold_subquestions: 标准答案中的子问题列表predicted_subquestions: 系统预测的子问题列表threshold: 编辑距离阈值,小于此值视为匹配 (0-1之间)allow_mapping: 是否允许非顺序匹配返回:包含各项评估指标的字典"""accuracy = edit_distance_accuracy(gold_subquestions, predicted_subquestions, threshold=threshold, allow_mapping=allow_mapping)f1 = calculate_edit_distance_f1(gold_subquestions, predicted_subquestions, threshold=threshold, allow_mapping=allow_mapping)# 计算精确率和召回率if allow_mapping:# 复用F1计算中的匹配逻辑gold_processed = [q.strip().lower() for q in gold_subquestions]pred_processed = [q.strip().lower() for q in predicted_subquestions]# 计算距离矩阵distance_matrix = np.zeros((len(gold_processed), len(pred_processed)))for i, gold_q in enumerate(gold_processed):for j, pred_q in enumerate(pred_processed):distance_matrix[i, j] = normalized_levenshtein(gold_q, pred_q)# 贪心地找到最优匹配matched_pairs = []used_gold = set()used_pred = set()all_pairs = []for i in range(len(gold_processed)):for j in range(len(pred_processed)):all_pairs.append((distance_matrix[i, j], i, j))all_pairs.sort() # 按距离升序排列for dist, i, j in all_pairs:if i not in used_gold and j not in used_pred:if dist <= threshold:matched_pairs.append((i, j, dist))used_gold.add(i)used_pred.add(j)true_positives = len(matched_pairs)precision = true_positives / len(pred_processed) if pred_processed else 0recall = true_positives / len(gold_processed) if gold_processed else 0else:# 顺序匹配match_count = 0min_len = min(len(gold_subquestions), len(predicted_subquestions))for i in range(min_len):dist = normalized_levenshtein(gold_subquestions[i], predicted_subquestions[i])if dist <= threshold:match_count += 1precision = match_count / len(predicted_subquestions) if predicted_subquestions else 0recall = match_count / len(gold_subquestions) if gold_subquestions else 0# 计算子问题数量差异count_diff = abs(len(gold_subquestions) - len(predicted_subquestions))return {"accuracy": accuracy,"f1": f1,"precision": precision,"recall": recall,"subquestion_count_diff": count_diff,"gold_count": len(gold_subquestions),"predicted_count": len(predicted_subquestions)}# 示例用法
if __name__ == "__main__":# 示例数据gold_subquestions = ["什么是机器学习?","机器学习有哪些主要类型?","如何评估机器学习模型的性能?"]predicted_subquestions = ["什么是机器学习?","机器学习的主要分类是什么?", # 与标准答案有细微差异"如何评估一个机器学习模型的好坏?", # 与标准答案有细微差异"机器学习有哪些应用场景?" # 额外的子问题]# 评估结果results = evaluate_decomposition(gold_subquestions, predicted_subquestions, threshold=0.2, # 可以调整阈值allow_mapping=True # 允许非顺序匹配)# 打印结果print("问题分解评估结果:")print(f"准确率: {results['accuracy']:.4f}")print(f"F1分数: {results['f1']:.4f}")print(f"精确率: {results['precision']:.4f}")print(f"召回率: {results['recall']:.4f}")print(f"子问题数量差异: {results['subquestion_count_diff']}")print(f"标准答案子问题数量: {results['gold_count']}")print(f"预测子问题数量: {results['predicted_count']}")
配置LLM、优化器
# 配置task lm
model = "教师模型(通常参数量较大、效果好)"
api_key = "你的api key"
api_base = "你的模型base url"local_model = "实际使用的模型"
local_api_key = "你的api key"
local_api_base = "你的模型base url"lm = dspy.LM(f"openai/{local_model}", api_key=local_api_key, api_base=local_api_base, temperature=0, cache=False
)
dspy.configure(lm=lm)class Task(dspy.Signature):"""将给定的问题拆分成子问题。"""question: str = dspy.InputField(description="待分解的原始问题")sub_questions: list[str] = dspy.OutputField(description="分解后的子问题列表")task_cot = dspy.ChainOfThought(Task)# 测试test
metric = evaluate_correctness(task_cot, devset=test)
print("before optimize", metric)
dspy.inspect_history(n=1)# 优化program
# prompt/teacher LLM
big_lm = dspy.LM(f"openai/{model}",api_key=api_key,api_base=api_base,temperature=0.8,cache=False,
)optimizer = dspy.MIPROv2(metric=TaskMetric,auto="heavy", # 优化力度num_threads=4,prompt_model=big_lm, # 写提示词的LLMinit_temperature=0.8, # prompt LLM的tempteacher_settings=dict(lm=big_lm), # 生成bootstrap examples的LLMseed=42,verbose=False, # 是否显示优化过程
)
Task中的描述定义的越清晰,生成的Prompt效果会更好
优化器参数的详细内容可以在 miprov2 找到
论文:Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
开始
optimized_classifier = optimizer.compile(task_cot,trainset=train,valset=val,max_bootstrapped_demos=3,max_labeled_demos=3,requires_permission_to_run=False,minibatch=False,minibatch_size=len(val),
)# 存储optimized program
optimized_classifier.save("optimized.json")
# 批量测试test
metric = evaluate_correctness(optimized_classifier, devset=test)
print("after optimize", metric)dspy.inspect_history(n=1)
优化后的prompt
如果顺利的话,在终端可以看到最终的prompt内容:
System message:Your input fields are:
1. `question` (str): 待分解的原始问题Your output fields are:
1. `reasoning` (str)
2. `sub_questions` (list[str]): 分解后的子问题列表All interactions will be structured in the following way, with the appropriate values filled in.[[ ## question ## ]]
{question}[[ ## reasoning ## ]]
{reasoning}[[ ## sub_questions ## ]]
{sub_questions} # note: the value you produce must be pareseable according to the following JSON schema: {"type": "array", "items": {"type": "string"}}[[ ## completed ## ]]In adhering to this structure, your objective is: To effectively analyze and address complex questions, decompose the given question into a set of sub-questions that isolate the distinct components required for a comprehensive response. Begin by understanding the main question through a systematic step-by-step reasoning process, identifying each element that necessitates further exploration. Generate a list of sub-questions that precisely target these individual aspects, ensuring clarity and focus in the information retrieval process. This approach aids in gathering detailed and well-informed answers, facilitating thorough investigation across diverse domains such as business, science, and more. Provide a structured and clear breakdown of the question, encouraging critical thinking and detailed analysis, potentially in multilingual contexts.User message:This is an example of the task, though some input or output fields are not supplied.[[ ## question ## ]]
如果一个正方形的边长增加20%,那么它的面积会增加多少百分比?Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## sub_questions ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.Assistant message:[[ ## reasoning ## ]]
Not supplied for this particular example.[[ ## sub_questions ## ]]
["正方形的面积公式是什么?", "边长增加20%后新的边长是多少?", "新的面积是多少?", "面积增加的百分比如何计算?"][[ ## completed ## ]]User message:This is an example of the task, though some input or output fields are not supplied.[[ ## question ## ]]
华为、苹果和三星在智能手机市场的份额分别是多少?Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## sub_questions ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.Assistant message:[[ ## reasoning ## ]]
Not supplied for this particular example.[[ ## sub_questions ## ]]
["华为在智能手机市场的份额是多少?", "苹果在智能手机市场的份额是多少?", "三星在智能手机市场的份额是多少?"][[ ## completed ## ]]User message:[[ ## question ## ]]
2023年全球票房最高的电影是哪部,它的导演是谁,票房收入是多少?Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## sub_questions ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.Assistant message:[[ ## reasoning ## ]]
To determine the highest-grossing film of 2023 globally, we need to identify the film's title and then find additional details such as the director and the specific box office revenue. This involves examining global box office reports and industry publications that track movie earnings.[[ ## sub_questions ## ]]
["2023年全球票房最高的电影是哪部?", "这部电影的导演是谁?", "这部电影的票房收入是多少?"][[ ## completed ## ]]User message:[[ ## question ## ]]
如何制定一个有效的健身计划?Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## sub_questions ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.Response:[[ ## reasoning ## ]]
To create an effective fitness plan, we need to consider various factors such as individual goals, current fitness level, available resources, and time constraints. This involves understanding the components of a well-rounded fitness plan, including exercise types, frequency, intensity, and recovery. Additionally, it is important to address nutritional needs and consistency in the plan.[[ ## sub_questions ## ]]
["制定健身计划前需要明确哪些个人目标?", "如何评估当前的健身水平?", "有效的健身计划应包含哪些类型的锻炼?", "如何确定锻炼的频率和强度?", "如何确保健身计划的可持续性和一致性?", "健身计划中是否需要考虑营养和饮食?"][[ ## completed ## ]]
总结
DSPy 整个体验下来,有点像在写 Torch的模型训练,其中DSPy中的 metric 有点像深度学习中的损失函数。所以,metric的写法对最终的效果影响还是蛮大的
DSPy 的强大之处远不止于此,感兴趣的同学可以继续深入 DSPy