当前位置：首页 > news >正文

7.8 Evaluating the finetuned LLM

news 来源：原创 2025/6/13 15:31:31

Chapter 7-Fine-tuning to follow instructions

7.8 Evaluating the finetuned LLM

在本节中，使用另一个更大的LLM自动评估微调LLM的响应

# This cell is optional; it allows you to restart the notebook
# and only run section 7.7 without rerunning any of the previous code
import json
from tqdm import tqdmfile_path = "instruction-data-with-response.json"with open(file_path, "r") as file:test_data = json.load(file)def format_input(entry):instruction_text = (f"Below is an instruction that describes a task. "f"Write a response that appropriately completes the request."f"\n\n### Instruction:\n{entry['instruction']}")input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""return instruction_text + input_text

书中使用ollama调用llama3 8b来评估，这里我直接调qwen32b的api来评估

import urllib.requestdef query_model(# prompt,# model="llama3",# url="http://localhost:11434/api/chat"prompt,model="qwen/qwen2.5-32b-instruct",url="https://api.ppinfra.com/v3/openai/chat/completions"
):# Create the data payload as a dictionarydata = {"model": model,"messages": [{"role": "user", "content": prompt}],"options": {     # Settings below are required for deterministic responses"seed": 123,"temperature": 0,"num_ctx": 2048}}# Convert the dictionary to a JSON formatted string and encode it to bytespayload = json.dumps(data).encode("utf-8")# Create a request object, setting the method to POST and adding necessary headersrequest = urllib.request.Request(url,data=payload,method="POST")request.add_header("Content-Type", "application/json")request.add_header("Authorization", "Bearer {api_key}") # 注意更换api_key# Send the request and capture the responseresponse_data = ""with urllib.request.urlopen(request) as response:# Read and decode the responsewhile True:line = response.readline().decode("utf-8")if not line:breakresponse_json = json.loads(line)# print(response_json)# response_data += response_json["message"]["content"]response_data += response_json['choices'][0]["message"]["content"]return response_datamodel = "qwen/qwen2.5-32b-instruct"
result = query_model("What do Llamas eat?", model)
print(result)"""输出"""
Llamas are herbivores and primarily graze on a variety of plant materials. In their natural habitat, their diet consists mainly of:1. Grass: Llamas prefer to graze on a variety of grasses, including native grasses, hay, and pasture grass.
2. Leaves and Shrubs: They also consume leaves, twigs, and shrubs.
3. Weeds: Llamas can eat a variety of weeds and other plants that may be considered invasive or undesirable in pastures.In domesticated settings, llamas are often fed:1. Hay: Timothy hay, alfalfa hay, or a mix of both are commonly used.
2. Pasture: Llamas can graze on well-managed pastures.
3. Pellets: Commercial llama feed or pellets formulated for their nutritional needs can be provided as a supplement.
4. Minerals: Salt and mineral supplements are often provided to ensure they get the necessary nutrients.It's important to note that llamas are sensitive to sudden changes in their diet, and care should be taken when transitioning them between different feed types.

现在，使用我们上面定义的query_model函数，我们可以评估我们的finetuned模型的响应；让我们在上一节中看到的前3个测试集响应上尝试一下

for entry in test_data[:3]:prompt = (f"Given the input `{format_input(entry)}` "f"and correct output `{entry['output']}`, "f"score the model response `{entry['model_response']}`"f" on a scale from 0 to 100, where 100 is the best score. ")print("\nDataset response:")print(">>", entry['output'])print("\nModel response:")print(">>", entry["model_response"])print("\nScore:")print(">>", query_model(prompt))print("\n-------------------------")"""输出"""
Dataset response:
>> The car is as fast as lightning.Model response:
>> The car is as fast as a cheetah.Score:
>> To score the response `The car is as fast as a cheetah.` on a scale from 0 to 100, we should consider the criteria for a good simile:1. **Relevance**: The comparison should be relevant to the subject (the car).
2. **Clarity**: The simile should be clear and understandable.
3. **Descriptiveness**: The simile should effectively convey the intended quality (speed).### Analysis:
- **Relevance**: A cheetah is known for its speed, making it a relevant comparison to a fast car.
- **Clarity**: The simile is clear and easy to understand.
- **Descriptiveness**: The simile effectively conveys the idea of speed.Given these criteria, the response `The car is as fast as a cheetah.` is a good simile and meets the requirements well.### Score:
On a scale from 0 to 100, the response would likely score around **85**. It effectively conveys the idea of speed and is relevant, but it is not as vivid or immediately striking as "as fast as lightning," which is a very common and powerful simile.If the model output had been `The car is as fast as lightning`, it would have scored 100.-------------------------Dataset response:
>> The type of cloud typically associated with thunderstorms is cumulonimbus.Model response:
>> The type of cloud associated with thunderstorms is a cumulus cloud.Score:
>> The response given, "The type of cloud associated with thunderstorms is a cumulus cloud," is not entirely accurate. While cumulus clouds can develop into cumulonimbus clouds, which are the ones typically associated with thunderstorms, the response does not correctly identify the specific type of cloud that is actually linked to thunderstorms.Therefore, on a scale from 0 to 100, where 100 is the best score, the response would score around **30**. The response shows some understanding of cloud types but contains a critical inaccuracy regarding the cloud type associated with thunderstorms.-------------------------Dataset response:
>> Jane Austen.Model response:
>> The author of 'Pride and Prejudice' is Jane Austen.Score:
>> The model response "The author of 'Pride and Prejudice' is Jane Austen." conveys the correct information and is factually accurate. However, the instruction requests a name, and the model provided a full sentence rather than just the name.Given that the response is correct but not in the exact format requested, I would score it a **90**. The information is accurate and complete, which is very important, but it could have been more precisely formatted to match the instruction exactly.-------------------------

可以看到第一个打了100分，第二个打了30分，第三个打了90分

def generate_model_scores(json_data, json_key, model="llama3"):scores = []for entry in tqdm(json_data, desc="Scoring entries"):prompt = (f"Given the input `{format_input(entry)}` "f"and correct output `{entry['output']}`, "f"score the model response `{entry[json_key]}`"f" on a scale from 0 to 100, where 100 is the best score. "f"Respond with the integer number only.")score = query_model(prompt, model)try:scores.append(int(score))except ValueError:print(f"Could not convert score: {score}")continuereturn scoresscores = generate_model_scores(test_data, "model_response")
print(f"Number of scores: {len(scores)} of {len(test_data)}")
print(f"Average score: {sum(scores)/len(scores):.2f}\n")"""输出"""
Scoring entries: 100%|████████████████████████| 110/110 [01:10<00:00,  1.57it/s]Number of scores: 110 of 110
Average score: 50.32

我们的模型达到 50 分以上的平均分数，我们可以将其作为参考点，将模型与其他模型进行比较或尝试其他可能改进模型的训练设置。

书中给的评测结果

- For reference, the original- Llama 3 8B base model achieves a score of 58.51- Llama 3 8B instruct model achieves a score of 82.65

Linux下OLLAMA安装卡住怎么办？

uni-app项目怎么实现多服务环境切换

LangChain--(1)

如何将一个url地址打包成一个windows桌面版本的应用程序

质因数分解_java

Redis哨兵机制

基于SpringAI实现专家系统

echarts中给饼图加圆点

关于深度学习网络中的归一化BN

【Java面试笔记：实战】41、Java面试核心考点！AQS原理及应用生态全解析

【亲测有效】MybatisPlus中MetaObjectHandler自动填充字段失效

【cv学习笔记】YOLO系列笔记

树莓派5 ubuntu 24.04 docker配置镜像Docker pull时报错：https://registry-1.docker.io/v2/

海外广告投放｜FB IG 速推帖子有效吗？

测试过程中有哪些风险？

3.4_1 流量控制、可靠传输与滑动窗口机制

【Spring AI 1.0.0】Spring AI 1.0.0框架快速入门(2)——提示词

简述Python里面search和match的区别

【富士康租赁德克萨斯州工厂以扩大AI服务器产能】

Java并发编程实战 Day 20：响应式编程与并发

企业做网站的困惑/seo关键词推广价格

台湾金融机构网站架构/北京整站线上推广优化

长春好的做网站公司有哪些/seo sem是什么职位

重庆网站开发服务/google网址直接打开

网站建设平台哪个公司好/创建网站要钱吗

营销团队网站建设/seo优化培训公司

Chapter 7-Fine-tuning to follow instructions

7.8 Evaluating the finetuned LLM

相关文章：