7.8 Evaluating the finetuned LLM
Chapter 7-Fine-tuning to follow instructions
7.8 Evaluating the finetuned LLM
-
在本节中,使用另一个更大的LLM自动评估微调LLM的响应
# This cell is optional; it allows you to restart the notebook # and only run section 7.7 without rerunning any of the previous code import json from tqdm import tqdmfile_path = "instruction-data-with-response.json"with open(file_path, "r") as file:test_data = json.load(file)def format_input(entry):instruction_text = (f"Below is an instruction that describes a task. "f"Write a response that appropriately completes the request."f"\n\n### Instruction:\n{entry['instruction']}")input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""return instruction_text + input_text
书中使用ollama调用llama3 8b来评估,这里我直接调qwen32b的api来评估
import urllib.requestdef query_model(# prompt,# model="llama3",# url="http://localhost:11434/api/chat"prompt,model="qwen/qwen2.5-32b-instruct",url="https://api.ppinfra.com/v3/openai/chat/completions" ):# Create the data payload as a dictionarydata = {"model": model,"messages": [{"role": "user", "content": prompt}],"options": { # Settings below are required for deterministic responses"seed": 123,"temperature": 0,"num_ctx": 2048}}# Convert the dictionary to a JSON formatted string and encode it to bytespayload = json.dumps(data).encode("utf-8")# Create a request object, setting the method to POST and adding necessary headersrequest = urllib.request.Request(url,data=payload,method="POST")request.add_header("Content-Type", "application/json")request.add_header("Authorization", "Bearer {api_key}") # 注意更换api_key# Send the request and capture the responseresponse_data = ""with urllib.request.urlopen(request) as response:# Read and decode the responsewhile True:line = response.readline().decode("utf-8")if not line:breakresponse_json = json.loads(line)# print(response_json)# response_data += response_json["message"]["content"]response_data += response_json['choices'][0]["message"]["content"]return response_datamodel = "qwen/qwen2.5-32b-instruct" result = query_model("What do Llamas eat?", model) print(result)"""输出""" Llamas are herbivores and primarily graze on a variety of plant materials. In their natural habitat, their diet consists mainly of:1. Grass: Llamas prefer to graze on a variety of grasses, including native grasses, hay, and pasture grass. 2. Leaves and Shrubs: They also consume leaves, twigs, and shrubs. 3. Weeds: Llamas can eat a variety of weeds and other plants that may be considered invasive or undesirable in pastures.In domesticated settings, llamas are often fed:1. Hay: Timothy hay, alfalfa hay, or a mix of both are commonly used. 2. Pasture: Llamas can graze on well-managed pastures. 3. Pellets: Commercial llama feed or pellets formulated for their nutritional needs can be provided as a supplement. 4. Minerals: Salt and mineral supplements are often provided to ensure they get the necessary nutrients.It's important to note that llamas are sensitive to sudden changes in their diet, and care should be taken when transitioning them between different feed types.
-
现在,使用我们上面定义的query_model函数,我们可以评估我们的finetuned模型的响应;让我们在上一节中看到的前3个测试集响应上尝试一下
for entry in test_data[:3]:prompt = (f"Given the input `{format_input(entry)}` "f"and correct output `{entry['output']}`, "f"score the model response `{entry['model_response']}`"f" on a scale from 0 to 100, where 100 is the best score. ")print("\nDataset response:")print(">>", entry['output'])print("\nModel response:")print(">>", entry["model_response"])print("\nScore:")print(">>", query_model(prompt))print("\n-------------------------")"""输出""" Dataset response: >> The car is as fast as lightning.Model response: >> The car is as fast as a cheetah.Score: >> To score the response `The car is as fast as a cheetah.` on a scale from 0 to 100, we should consider the criteria for a good simile:1. **Relevance**: The comparison should be relevant to the subject (the car). 2. **Clarity**: The simile should be clear and understandable. 3. **Descriptiveness**: The simile should effectively convey the intended quality (speed).### Analysis: - **Relevance**: A cheetah is known for its speed, making it a relevant comparison to a fast car. - **Clarity**: The simile is clear and easy to understand. - **Descriptiveness**: The simile effectively conveys the idea of speed.Given these criteria, the response `The car is as fast as a cheetah.` is a good simile and meets the requirements well.### Score: On a scale from 0 to 100, the response would likely score around **85**. It effectively conveys the idea of speed and is relevant, but it is not as vivid or immediately striking as "as fast as lightning," which is a very common and powerful simile.If the model output had been `The car is as fast as lightning`, it would have scored 100.-------------------------Dataset response: >> The type of cloud typically associated with thunderstorms is cumulonimbus.Model response: >> The type of cloud associated with thunderstorms is a cumulus cloud.Score: >> The response given, "The type of cloud associated with thunderstorms is a cumulus cloud," is not entirely accurate. While cumulus clouds can develop into cumulonimbus clouds, which are the ones typically associated with thunderstorms, the response does not correctly identify the specific type of cloud that is actually linked to thunderstorms.Therefore, on a scale from 0 to 100, where 100 is the best score, the response would score around **30**. The response shows some understanding of cloud types but contains a critical inaccuracy regarding the cloud type associated with thunderstorms.-------------------------Dataset response: >> Jane Austen.Model response: >> The author of 'Pride and Prejudice' is Jane Austen.Score: >> The model response "The author of 'Pride and Prejudice' is Jane Austen." conveys the correct information and is factually accurate. However, the instruction requests a name, and the model provided a full sentence rather than just the name.Given that the response is correct but not in the exact format requested, I would score it a **90**. The information is accurate and complete, which is very important, but it could have been more precisely formatted to match the instruction exactly.-------------------------
可以看到第一个打了100分,第二个打了30分,第三个打了90分
def generate_model_scores(json_data, json_key, model="llama3"):scores = []for entry in tqdm(json_data, desc="Scoring entries"):prompt = (f"Given the input `{format_input(entry)}` "f"and correct output `{entry['output']}`, "f"score the model response `{entry[json_key]}`"f" on a scale from 0 to 100, where 100 is the best score. "f"Respond with the integer number only.")score = query_model(prompt, model)try:scores.append(int(score))except ValueError:print(f"Could not convert score: {score}")continuereturn scoresscores = generate_model_scores(test_data, "model_response") print(f"Number of scores: {len(scores)} of {len(test_data)}") print(f"Average score: {sum(scores)/len(scores):.2f}\n")"""输出""" Scoring entries: 100%|████████████████████████| 110/110 [01:10<00:00, 1.57it/s]Number of scores: 110 of 110 Average score: 50.32
我们的模型达到 50 分以上的平均分数,我们可以将其作为参考点,将模型与其他模型进行比较或尝试其他可能改进模型的训练设置。
书中给的评测结果
- For reference, the original- Llama 3 8B base model achieves a score of 58.51- Llama 3 8B instruct model achieves a score of 82.65