当前位置：首页 > news >正文

大模型评测调研报告

news 2025/11/1 5:54:41

一、LLM Evaluation综述
●Evaluation Guide Book：
https://github.com/huggingface/evaluation-guidebook
●一文了解大模型性能评测数据、指标以及框架：
https://zhuanlan.zhihu.com/p/25471631745
1.1 LLM Evaluation Benchmark
知名开源：
CMMLU, MMLU, CEval, AGI-Eval, JEC-QA, MEDMCQA, MEDQA-MCMLE, MEDQA-USMLE, GAOKAO-Bench
车载：
●InCA（InCA: Rethinking In-Car Conversational System Assessment Leveraging Large Language Models）
●LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs
https://github.com/PurdueDigitalTwin/LaMPilot
●SuperCLUE-Auto
汽车行业中文大模型测评基准，基于多轮开放式问题的细粒度评测
排名榜单：https://www.superclueai.com/
https://github.com/CLUEbenchmark/SuperCLUE-Auto
上述几个都是论文为主。。。开源仓库是展示用，仅能做参考用
Benchmark也可自定义
1.2 LLM Evaluation数据集
●开源：
开源Benchmark也包含LLM评测的数据集，这里的开源评测数据集一般指Benchmark中附带着开源的评测数据集。
●商业：
https://hub.opencompass.org.cn/home
●自制：
可根据选取的LLM Evaluation框架使用的评测集规则自制评测数据集。

1.3 加载模型方式
加载模型权重或调用API评测
1.4 评测方法
客观评测
做填空题、单选题、多选题
主观评测
开放式主观问答题
人类或LLM对模型的回答进行打分
长文本大海捞针(Needle In A Haystack)测试
二、LLM Evaluation榜单
国外：https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/
国内：https://rank.opencompass.org.cn/home
车载：https://www.superclueai.com/
三、LLM Evaluation框架
Tips：
前三个（OpenAI/Eval、lm-evaluation-harness、OpenCompass）工程性高，使用人数多，后面几个LLM评测框架偏向论文创新，可在写专利时参考思路，实用性、可运行性未知。OpenCompass中文社区强大，可参考资料多，可与其作者上海人工智能实验室团队进行沟通交流。VLMEvalKit为OpenCompass评测系列中的多模态大模型评测框架。

Idea：
先使用OpenAI/Eval进行小规模数据的实验，然后选取lm-evaluation-harness或Opencompass进行本地部署。

3.1 OpenAI/Eval
项目地址：https://github.com/openai/evals
Tutorial：
1.https://www.aidoczh.com/docs/openai_cookbook/examples/evaluation/Getting_Started_with_OpenAI_Evals/
2.https://cookbook.openai.com/examples/evaluation/getting_started_with_openai_evals
注：需在OpenAI Platform上用境外卡或apple充值购买api。
3.2 lm-evaluation-harness
项目地址：https://github.com/EleutherAI/lm-evaluation-harness
Tutorial：
https://zhuanlan.zhihu.com/p/671235487
https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md
https://blog.csdn.net/qq_41185868/article/details/139787790
3.3 OpenCompass

官网地址：https://opencompass.org.cn/home
项目地址：https://github.com/open-compass/opencompass
Tutorial：https://opencompass.readthedocs.io/zh-cn/latest/get_started/installation.html
3.4 VLMEvalKit
项目地址：https://github.com/open-compass/VLMEvalKit
Tutorial：https://vlmevalkit.readthedocs.io/zh-cn/latest/

3.5 FreeEval

项目地址：
https://github.com/WisdomShell/FreeEval

3.6 UltraEval

项目地址：https://github.com/OpenBMB/UltraEval
3.7 Auto-Arena-LLMs
Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
创新性的自动化评测工具，通过多种 LLM 代理之间的对战（peer-battles）和委员会讨论（committee discussions），全面评估 LLM 的能力。
项目首页：
https://auto-arena.github.io/
项目地址：
https://github.com/DAMO-NLP-SG/Auto-Arena-LLMs

三、LLM Evaluation论文
LLM Evaluation综述性论文：
https://arxiv.org/abs/2307.03109
会议论文选集：
https://mp.weixin.qq.com/s/wHqVVJToP18zgLzEizd3Tg
InCA：
●https://arxiv.org/abs/2311.07469
FreeEval：
●https://aclanthology.org/2024.emnlp-demo.1.pdf
●论文解读：https://zhuanlan.zhihu.com/p/13035659633
UltraEval:https://arxiv.org/abs/2404.07584

LLM Evaluation 综述

Evaluation Guide Book

Evaluation Guide Book
一文了解大模型性能评测数据、指标以及框架

1.1 LLM Evaluation Benchmark

知名开源

CMMLU, MMLU, CEval, AGI-Eval, JEC-QA, MEDMCQA, MEDQA-MCMLE, MEDQA-USMLE, GAOKAO-Bench

车载

InCA: InCA: Rethinking In-Car Conversational System Assessment Leveraging Large Language Models
LaMPilot: LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs
SuperCLUE-Auto: 汽车行业中文大模型测评基准，基于多轮开放式问题的细粒度评测
- GitHub

上述几个都是论文为主，开源仓库是展示用，仅能做参考用。Benchmark也可自定义。

1.2 LLM Evaluation 数据集

开源

开源Benchmark也包含LLM评测的数据集，这里的开源评测数据集一般指Benchmark中附带着开源的评测数据集。

商业

OpenCompass

自制

可根据选取的LLM Evaluation框架使用的评测集规则自制评测数据集。

1.3 加载模型方式

加载模型权重或调用API评测

1.4 评测方法

客观评测

做填空题、单选题、多选题

主观评测

开放式主观问答题
人类或LLM对模型的回答进行打分
长文本大海捞针(Needle In A Haystack)测试

二、LLM Evaluation 榜单

国外: Open LLM Leaderboard
国内: OpenCompass Ranking
车载: SuperCLUE-Auto

三、LLM Evaluation 框架

Tips:

前三个（OpenAI/Eval、lm-evaluation-harness、OpenCompass）工程性高，使用人数多。
后面几个LLM评测框架偏向论文创新，可在写专利时参考思路，实用性、可运行性未知。
OpenCompass中文社区强大，可参考资料多，可与其作者上海人工智能实验室团队进行沟通交流。
VLMEvalKit为OpenCompass评测系列中的多模态大模型评测框架。

Idea:

先使用OpenAI/Eval进行小规模数据的实验，然后选取lm-evaluation-harness或OpenCompass进行本地部署。

3.1 OpenAI/Eval

项目地址: OpenAI/Eval
Tutorial:
1. Getting Started with OpenAI Evals
2. OpenAI Cookbook

注：需在OpenAI Platform上用境外卡或apple充值购买api。

3.2 lm-evaluation-harness

项目地址: lm-evaluation-harness
Tutorial:
- 知乎教程
- New Task Guide
- CSDN教程

3.3 OpenCompass

官网地址: OpenCompass
项目地址: OpenCompass GitHub
Tutorial: OpenCompass 文档

3.4 VLMEvalKit

项目地址: VLMEvalKit
Tutorial: VLMEvalKit 文档

3.5 FreeEval

项目地址: FreeEval

3.6 UltraEval

项目地址: UltraEval

3.7 Auto-Arena-LLMs

Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
- 创新性的自动化评测工具，通过多种 LLM 代理之间的对战（peer-battles）和委员会讨论（committee discussions），全面评估 LLM 的能力。
项目首页: Auto-Arena-LLMs
项目地址: Auto-Arena-LLMs GitHub