langsmith进行agent评估的方法
历史文章:
多模态大模型评估:https://blog.csdn.net/qq_43814415/article/details/140994190?fromshare=blogdetail&sharetype=blogdetail&sharerId=140994190&sharerefer=PC&sharesource=qq_43814415&sharefrom=from_linkRAG系统的评估:https://blog.csdn.net/qq_43814415/article/details/141229613?fromshare=blogdetail&sharetype=blogdetail&sharerId=141229613&sharerefer=PC&sharesource=qq_43814415&sharefrom=from_link
LLM
的评估:https://blog.csdn.net/qq_43814415/article/details/140823902?fromshare=blogdetail&sharetype=blogdetail&sharerId=140823902&sharerefer=PC&sharesource=qq_43814415&sharefrom=from_link
本文介绍agent的评估。当前agent非常火,最主流的评估框架是langsmith。
一、评估对象
一个智能体由非常多的异构组件构成,所以一般从以下三个方面评估。
1.最终响应
将agent视作黑盒。
输入:用户消息、可选的工具列表。
输出:agent的最终回应。
评估者:LLM;人工。
评估指标:与任务强相关。
优点:简单直接;与业务直接关联。
缺点:运行耗时高;无法定位到故障;评估指标难定义。
2.单个步骤
针对agent的某个关键步骤进行评估。
输入:待评估步骤的入参。
输出:该步骤的输出。
评估者:正则或其他解析算法。
评估指标:正确率。
优点:能定位到潜在故障节点;运行速度快,时间成本低。
缺点:无法反应业务关联;数据集难构造。
3.智能体的轨迹
评估agent采取的所有步骤。
输入:端到端的agent输入。
输出:工具调用列表。即一条完整的agent轨迹。
评估者:针对轨迹的自定义函数。
评估指标:正确率、错误数。
优点:完整监控agent。
缺点:时间成本高;评估标准不好制定。
二、评估一个graph
评估使用langgraph构建的图。
1.端到端评估
①创建数据集
from langsmith import Clientquestions = ["what's the weather in sf","whats the weather in san fran","whats the weather in tangier"
]answers = ["It's 60 degrees and foggy.","It's 60 degrees and foggy.","It's 90 degrees and sunny.",
]ls_client = Client()
dataset = ls_client.create_dataset("weather agent",inputs=[{"question": q} for q in questions],outputs=[{"answers": a} for a in answers],
)
②创建评估器
接收agent输出和参考答案
judge_llm = init_chat_model("gpt-4o")async def correct(outputs: dict, reference_outputs: dict) -> bool:instructions = ("Given an actual answer and an expected answer, determine whether"" the actual answer contains all of the information in the"" expected answer. Respond with 'CORRECT' if the actual answer"" does contain all of the expected information and 'INCORRECT'"" otherwise. Do not include anything else in your response.")# Our graph outputs a State dictionary, which in this case means# we'll have a 'messages' key and the final message should# be our actual answer.actual_answer = outputs["messages"][-1].contentexpected_answer = reference_outputs["answer"]user_msg = (f"ACTUAL ANSWER: {actual_answer}"f"\n\nEXPECTED ANSWER: {expected_answer}")response = await judge_llm.ainvoke([{"role": "system", "content": instructions},{"role": "user", "content": user_msg}])return response.content.upper() == "CORRECT"
③运行评估
from langsmith import aevaluatedef example_to_state(inputs: dict) -> dict:return {"messages": [{"role": "user", "content": inputs['question']}]}# We use LCEL declarative syntax here.
# Remember that langgraph graphs are also langchain runnables.
target = example_to_state | appexperiment_results = await aevaluate(target,data="weather agent",evaluators=[correct],max_concurrency=4, # optionalexperiment_prefix="claude-3.5-baseline", # optional
)
2.评估中间步骤
langgraph的优点在于,图的输出是一个状态对象,该对象通常已经包含了关于所采取的中间步骤的信息。
以查看是否在第一步执行搜索为例
def right_tool(outputs: dict) -> bool:tool_calls = outputs["messages"][1].tool_callsreturn bool(tool_calls and tool_calls[0]["name"] == "search")experiment_results = await aevaluate(target,data="weather agent",evaluators=[correct, right_tool],max_concurrency=4, # optionalexperiment_prefix="claude-3.5-baseline", # optional
)
参考:
https://docs.langchain.com/langsmith/evaluate-graph
https://docs.langchain.com/langsmith/evaluation-approaches
https://deephub.blog.csdn.net/article/details/149863739?fromshare=blogdetail&sharetype=blogdetail&sharerId=149863739&sharerefer=PC&sharesource=qq_43814415&sharefrom=from_link