当前位置: 首页 > news >正文

【AI】Ubuntu 22.04 evalscope 模型评测 Qwen3-4B-FP8

安装evalscope

mkdir evalscope
cd evalscope/
python3 -m venv venv
source venv/bin/activate
pip install 'evalscope[app,perf]' -U -i https://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.compip install tiktoken omegaconf -i https://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com

下载通用评测数据集(暂时未用上)

pip install -U modelscopemodelscope download --dataset modelscope/mmlu --local_dir /data/ai/evalscope_data/mmlu
modelscope download --dataset modelscope/gsm8k --local_dir /data/ai/evalscope_data/gsm8k
modelscope download --dataset modelscope/human_eval --local_dir /data/ai/evalscope_data/human_eval

本地部署Qwen3-4B-FP8

modelscope download --model Qwen/Qwen3-4B-FP8
vllm serve /home/yeqiang/.cache/modelscope/hub/models/Qwen/Qwen3-4B-FP8 --served-model-name Qwen3-4B-FP8 --port 8000 --dtype auto --gpu-memory-utilization 0.8 --max-model-len 40960 --tensor-parallel-size 1

编辑评测脚本(采用EvalScope-Qwen3-Test数据集)

eval_qwen3_mmlu.py

from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(model='Qwen3-4B-FP8',api_url='http://localhost:8000/v1/chat/completions',eval_type='service',datasets=['data_collection',],dataset_args={'data_collection': {'dataset_id': 'modelscope/EvalScope-Qwen3-Test','filters': {'remove_until': '</think>'}  # 过滤掉思考的内容}},eval_batch_size=128,generation_config={'max_tokens': 30000,  # 最大生成token数,建议设置为较大值避免输出截断'temperature': 0.6,  # 采样温度 (qwen 报告推荐值)'top_p': 0.95,  # top-p采样 (qwen 报告推荐值)'top_k': 20,  # top-k采样 (qwen 报告推荐值)'n': 1,  # 每个请求产生的回复数量},timeout=60000,  # 超时时间stream=True,  # 是否使用流式输出limit=100,  # 设置为100条数据进行测试
)run_task(task_cfg=task_cfg)

执行评测

(venv) yeqiang@yeqiang-Default-string:/data/ai/evalscope$ python eval_qwen3_mmlu.py 
2025-05-06 22:44:04,363 - evalscope - INFO - Args: Task config is provided with TaskConfig type.
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
2025-05-06 22:44:06,473 - evalscope - INFO - Loading dataset from modelscope: > dataset_name: modelscope/EvalScope-Qwen3-Test
Downloading Dataset to directory: /home/yeqiang/.cache/modelscope/hub/datasets/modelscope/EvalScope-Qwen3-Test
2025-05-06 22:44:08,753 - evalscope - INFO - Dump task config to ./outputs/20250506_224404/configs/task_config_7d0e13.yaml
2025-05-06 22:44:08,755 - evalscope - INFO - {"model": "Qwen3-4B-FP8","model_id": "Qwen3-4B-FP8","model_args": {"revision": "master","precision": "torch.float16"},"model_task": "text_generation","template_type": null,"chat_template": null,"datasets": ["data_collection"],"dataset_args": {"data_collection": {"dataset_id": "modelscope/EvalScope-Qwen3-Test","filters": {"remove_until": "</think>"}}},"dataset_dir": "/home/yeqiang/.cache/modelscope/hub/datasets","dataset_hub": "modelscope","generation_config": {"max_tokens": 30000,"temperature": 0.6,"top_p": 0.95,"top_k": 20,"n": 1},"eval_type": "service","eval_backend": "Native","eval_config": null,"stage": "all","limit": 100,"eval_batch_size": 128,"mem_cache": false,"use_cache": null,"work_dir": "./outputs/20250506_224404","outputs": null,"debug": false,"dry_run": false,"seed": 42,"api_url": "http://localhost:8000/v1/chat/completions","api_key": "EMPTY","timeout": 60000,"stream": true,"judge_strategy": "auto","judge_worker_num": 1,"judge_model_args": {}
}Getting answers:  15%|████████████████████████████████▊                                            [00:50<02:06,  1.49s/it]
Getting answers:  32%|█████████████████████████████████████████████████████████████████████                                                            
Getting answers:  33%|█████████████████████████████████████████████████████████████████████| 33/100 [02:28<07:44,  6.93s/it]

nvidia-smi

vllm服务状态

报告

2025-05-06 23:13:37,099 - evalscope - INFO - subset_level Report:
+-------------+-------------------------+-----------------+----------------------------+---------------+-------+
|  task_type  |         metric          |  dataset_name   |        subset_name         | average_score | count |
+-------------+-------------------------+-----------------+----------------------------+---------------+-------+
|    exam     |     AverageAccuracy     |    mmlu_pro     |           health           |    0.6667     |   9   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |            math            |      1.0      |   7   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |        engineering         |    0.6667     |   6   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |         chemistry          |      0.5      |   6   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |         psychology         |    0.6667     |   6   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |          biology           |      0.8      |   5   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |            law             |      0.2      |   5   |
| instruction | prompt_level_strict_acc |     ifeval      |          default           |     0.75      |   4   |
| instruction |  inst_level_strict_acc  |     ifeval      |          default           |     0.75      |   4   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |          physics           |     0.75      |   4   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |           other            |      0.5      |   4   |
| instruction | prompt_level_loose_acc  |     ifeval      |          default           |      1.0      |   4   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |      computer science      |      1.0      |   4   |
| instruction |  inst_level_loose_acc   |     ifeval      |          default           |      1.0      |   4   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |          business          |    0.6667     |   3   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |          history           |    0.6667     |   3   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |         philosophy         |    0.6667     |   3   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |         prehistory         |      1.0      |   2   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |         economics          |      0.5      |   2   |
|    exam     |     AverageAccuracy     |      ceval      |     education_science      |      1.0      |   1   |
|    exam     |     AverageAccuracy     |      ceval      |            law             |      0.0      |   1   |
|    exam     |     AverageAccuracy     |      ceval      |       tax_accountant       |      0.0      |   1   |
|    exam     |     AverageAccuracy     |      iquiz      |             EQ             |      1.0      |   1   |
|    exam     |     AverageAccuracy     |      ceval      |    high_school_biology     |      1.0      |   1   |
|    code     |         Pass@1          | live_code_bench |           v5_v6            |      0.0      |   1   |
|    exam     |     AverageAccuracy     |      ceval      |       basic_medicine       |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |          anatomy           |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |  college_computer_science  |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |    college_mathematics     |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      abstract_algebra      |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |  high_school_mathematics   |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    | high_school_macroeconomics |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |   high_school_chemistry    |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |    high_school_biology     |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |     conceptual_physics     |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    | high_school_world_history  |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |       miscellaneous        |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      medical_genetics      |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |          virology          |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      security_studies      |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |   professional_medicine    |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      moral_scenarios       |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      world_religions       |      0.0      |   1   |
|  knowledge  |      AveragePass@1      |      gpqa       |        gpqa_diamond        |      1.0      |   1   |
|    math     |      AveragePass@1      |    math_500     |          Level 3           |      1.0      |   1   |
|    math     |      AveragePass@1      |    math_500     |          Level 5           |      1.0      |   1   |
+-------------+-------------------------+-----------------+----------------------------+---------------+-------+
2025-05-06 23:13:37,099 - evalscope - INFO - dataset_level Report:
+-------------+-------------------------+-----------------+---------------+-------+
|  task_type  |         metric          |  dataset_name   | average_score | count |
+-------------+-------------------------+-----------------+---------------+-------+
|    exam     |     AverageAccuracy     |    mmlu_pro     |    0.6716     |  67   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |    0.6842     |  19   |
|    exam     |     AverageAccuracy     |      ceval      |      0.6      |   5   |
| instruction | prompt_level_loose_acc  |     ifeval      |      1.0      |   4   |
| instruction | prompt_level_strict_acc |     ifeval      |     0.75      |   4   |
| instruction |  inst_level_loose_acc   |     ifeval      |      1.0      |   4   |
| instruction |  inst_level_strict_acc  |     ifeval      |     0.75      |   4   |
|    math     |      AveragePass@1      |    math_500     |      1.0      |   2   |
|    code     |         Pass@1          | live_code_bench |      0.0      |   1   |
|    exam     |     AverageAccuracy     |      iquiz      |      1.0      |   1   |
|  knowledge  |      AveragePass@1      |      gpqa       |      1.0      |   1   |
+-------------+-------------------------+-----------------+---------------+-------+
2025-05-06 23:13:37,099 - evalscope - INFO - task_level Report:
+-------------+-------------------------+---------------+-------+
|  task_type  |         metric          | average_score | count |
+-------------+-------------------------+---------------+-------+
|    exam     |     AverageAccuracy     |    0.6739     |  92   |
| instruction |  inst_level_loose_acc   |      1.0      |   4   |
| instruction |  inst_level_strict_acc  |     0.75      |   4   |
| instruction | prompt_level_loose_acc  |      1.0      |   4   |
| instruction | prompt_level_strict_acc |     0.75      |   4   |
|    math     |      AveragePass@1      |      1.0      |   2   |
|    code     |         Pass@1          |      0.0      |   1   |
|  knowledge  |      AveragePass@1      |      1.0      |   1   |
+-------------+-------------------------+---------------+-------+
2025-05-06 23:13:37,100 - evalscope - INFO - tag_level Report:
+------+-------------------------+---------------+-------+
| tags |         metric          | average_score | count |
+------+-------------------------+---------------+-------+
|  en  |     AverageAccuracy     |    0.6744     |  86   |
|  zh  |     AverageAccuracy     |    0.6667     |   6   |
|  en  |  inst_level_strict_acc  |     0.75      |   4   |
|  en  |  inst_level_loose_acc   |      1.0      |   4   |
|  en  | prompt_level_loose_acc  |      1.0      |   4   |
|  en  | prompt_level_strict_acc |     0.75      |   4   |
|  en  |      AveragePass@1      |      1.0      |   3   |
|  en  |         Pass@1          |      0.0      |   1   |
+------+-------------------------+---------------+-------+
2025-05-06 23:13:37,100 - evalscope - INFO - category_level Report:
+-----------+--------------+-------------------------+---------------+-------+
| category0 |  category1   |         metric          | average_score | count |
+-----------+--------------+-------------------------+---------------+-------+
|   Qwen3   |   English    |     AverageAccuracy     |    0.6744     |  86   |
|   Qwen3   |   Chinese    |     AverageAccuracy     |    0.6667     |   6   |
|   Qwen3   |   English    |  inst_level_loose_acc   |      1.0      |   4   |
|   Qwen3   |   English    |  inst_level_strict_acc  |     0.75      |   4   |
|   Qwen3   |   English    | prompt_level_strict_acc |     0.75      |   4   |
|   Qwen3   |   English    | prompt_level_loose_acc  |      1.0      |   4   |
|   Qwen3   | Math&Science |      AveragePass@1      |      1.0      |   3   |
|   Qwen3   |     Code     |         Pass@1          |      0.0      |   1   |
+-----------+--------------+-------------------------+---------------+-------+

参考

https://evalscope.readthedocs.io/zh-cn/latest/best_practice/qwen3.html

相关文章:

  • Linux——https基础理论
  • 【AI论文】FormalMATH:大型语言模型形式化数学推理能力基准测试
  • C语言内存函数及大小端字节序!
  • Mysql order by 用法
  • Vue + Element UI 表单弹窗输入法卡顿问题解决方案
  • 用 Tailwind CSS 优化你的 Vue 3 项目! ! !
  • 计算机硬件:AMD X670E与B650主板的PCIe通道分配
  • 在 Laravel 12 中实现 WebSocket 通信时进行身份验证
  • 水质监控预警管理平台
  • 构建 Web 浏览 AI Agent:Pydantic + MCP 实现指南
  • C#学习第21天:安全与加密(Security and Cryptography)
  • Linux/AndroidOS中进程间的通信线程间的同步 - 虚拟内存操作
  • 企业级RAG架构设计:从FAISS索引到HyDE优化的全链路拆解,金融/医疗领域RAG落地案例与避坑指南(附架构图)
  • PCIe - ZCU106(RC) + KU5P(EP) + 固化
  • 利用 Kali Linux 进行信息收集和枚举
  • 用python实现鼠标监听与手势交互
  • 【KWDB 创作者计划】一文掌握KWDB的时序表管理
  • iOS与HTTPS抓包调试小结
  • QT生成保存 Excel 文件的默认路径,导出的文件后缀自动加(1)(2)等等
  • 上传下载接口
  • 央行将增加3000亿元科技创新和技术改造再贷款额度
  • 央行宣布优化两项支持资本市场的货币政策工具
  • 美政府称不再对哈佛大学提供联邦资助
  • 五一假期上海境外来沪消费4.55亿元,同比增长211.6%
  • 黎巴嫩9年来首次举行地方选举
  • 菲律宾首都机场航站楼外发生汽车冲撞事故致2死多伤