ArtifactsBench: 弥合LLM 代码生成评估中的视觉交互差距
一、前言
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code
Generation Evaluation
腾讯混元开源的benchmark:这是一个用于自动化、多模态评估视觉代码生成的全新基准和范例。
主要的动机也是因为,目前更多的前端代码从静态->动态发展,当前的大多数benchmark主要侧重于算法的正确性,忽略了视觉保真效果以及交互的完整性。
- 主页:https://artifactsbenchmark.github.io/
- paper:https://arxiv.org/pdf/2507.04952
- 仓库:https://github.com/Tencent-Hunyuan/ArtifactsBenchmark
二、数据集类别
主要是9大种类:
9 个主要主题:“游戏开发”、“SVG 生成”、“Web 应用”、“模拟”、“数据科学”、“管理系统”、“多媒体编辑”、“快速工具”和“其他”
三、数据pipeline
提取和筛选、手动和基于 LLM 的重写和润色、分类和难度筛选、小样本注释、检查表生成、模型生成、手动 QA 检查和质量控制以及最终数据整合。
3.1 提取&筛选
- 手动收集高质量的数据
- 开源数据集(Svgen-500k 和 Instruct-SVG)
- 通过网上收集案例
- 通过复杂的web截图给大模型,以生成高保真、描述性的提示,用于前端代码生成。
筛选过滤:
- 删除非许可的
- 删除非可视化
筛选过滤的prompt
You are a professional Query Evaluation Expert with advanced analytical reasoning abilities.
Your task is to conduct a comprehensive, step-by-step evaluation of a given question using a detailed Chain of Thought
(CoT) approach.
Evaluation Framework:
Stage 1: Comprehensive Problem Understanding
- Carefully analyze the original problem statement
- Identify explicit and implicit requirements
- Assess technical complexity and contextual constraints
Stage 2: Please rate the given question based on the following five dimensions and provide the rating result:
- Quality: Does the question have a clear logical structure, is it expressed accurately, and does it avoid ambiguity?
- Creativity: Does the question offer novelty, providing new perspectives or problem-solving opportunities compared
to existing ones?
- Relevance: Does the question have practical value, such as applicability to specific use cases or its ability to assess
valuable knowledge points or skills?
- Completeness: Is the question description clear and comprehensive, with no critical information missing? Are the
given conditions sufficient and reasonable to support the derivation of a correct answer?
- Privacy: Does the question avoid requesting or involving any sensitive personal information, such as phone numbers,
addresses, or other identifiable details?
Output Specifications:
- Rating Range:
- Each evaluation dimension and the overall score range from 1 to 10, with 1 being the worst and 10 being the best.
- Reasoning: All scores must have clear, specific reasoning to support them.
- Structure: The final output must be clear, concise, and professional.
- Objectivity: The rating must be neutral and fair.
- The final output should be a JSON object with the five scores and an overall score (average of the five dimensions),
as shown below:
“‘json
{
”Quality”: ”8”,
”Creativity”: ”7”,
”Relevance”: ”9”,
”Completeness”: ”9”,
”Privacy”: ”7”,
”Total Score”: ”8.0”
}
“‘
Please rate the following question according to the above standards:
——–question-start——–
Question
——–question-end——–
3.2 分类&难度过滤
每个任务都会标记 class类别和难度等级
过滤掉两种问题
- 简单问题
- 过于复杂或含糊不清的问题
确保难度的均衡分布(30%简单,40%中等,30%困难),保持跨类别的覆盖率。
3.3 小样本注释和清单生成
为每一个样本生成结构化的checklist,会进行抽检10%来确保人评一致性
checklist包含:功能性、稳健性、工程实践、功能冗余、创造力、美学品质和用户体验,可扩展性、运行时性能。
每个维度10分制的方式进行打分,一共1k+的题目的checklist经过人类细致的审核和完善。
生成checklist的prompt:
You are a senior and meticulous code review expert, proficient in multiple programming languages, front - end
technologies, and interaction design. Your task is to generate a check - list for the received [Query]. The responses to
the [Query] mainly include source code (in multiple programming languages), algorithm implementation, data structure
design, system architecture diagrams, front - end visualization code (such as HTML/SVG/JavaScript), descriptions of
interaction logic, and related technical explanations, with a primary focus on front - end visualization. Please use your
code knowledge and aesthetic experience to modify the following check - list, and the full score should be 100 points.
Role Positioning
• Responsibility: Like an authoritative technical review committee member in the industry, you must be objective,
comprehensive, and unbiased.
• Attitude: Meticulous, professional, and uncompromising, good at identifying various details and potential risks.
• Others: Possess high aesthetic talent, with excellent aesthetics and high requirements for user experience.
Example:
Query:
You are a code expert. Please use your professional knowledge to generate accurate and professional responses. Note to
ensure that the generated code can be executed and displayed as much as possible.
Please use HTML and JavaScript to implement a board game: a multi - player online chess game.
Task: Design a multi - player online chess game system that allows players to play against each other over the network
and save the game progress.
Hint: You can use server - side synchronization to manage the game state and design a reconnection mechanism.
Checklist:
1. Is the chess game combat system fully implemented?
• Review whether the code accurately implements the chessboard coordinate system through HTML/JavaScript, and
whether it includes collision detection for piece movement and validation of legal moves (including special rules such as
castling/en passant). Score 0 if the core interaction logic is not implemented, 5 if only basic movement is implemented,
and 10 if all international chess rules are fully included.
2. Is the player online combat function implemented?
• Check whether the WebSocket implementation includes a heartbeat mechanism, a packet verification sequence, and
automatic degradation on disconnection (transfer to local temporary storage). Two - way state verification between the
front - end and back - end is required. Deduct 5 points if the retransmission mechanism is missing, and 3 points if
network latency compensation is not handled. The full score is 10 points.
3. Is the server - side synchronization mechanism designed and a reconnection function provided?
• Evaluate whether the server synchronization strategy uses differential incremental synchronization instead of full -
scale updates, and whether an operation prediction mechanism is adopted. Two - way verification of client prediction
and server correction is required. Deduct 5 points if the state drift exceeds 200ms. Check whether a disconnection
reconnection mechanism is designed to ensure that players can resume the game after being disconnected. The full
score is 10 points.
4. Is the complete game lifecycle management constructed?
• Check whether the code includes complete game lifecycle management, including state management such as
game pause/resume, multi - game history backtracking, and spectator mode. Deduct 5 points if game serialization
storage is not implemented, and 3 points if the crash recovery mechanism is missing. Give 10 points if fully implemented.
5. Is the code robust?
• Evaluate whether the code can handle common abnormal situations (such as out - of - bounds input, network
interruption, user operation errors, etc.) and provide friendly error prompts or recovery mechanisms. Code with strong
robustness should be able to effectively handle these edge cases, giving 10 points. If the robustness is average, give 5
points, and if no exceptions are handled, give 0 points.
6. Are there any innovative features that are eye - catching?
• Check whether the code includes surprise features that enhance the experience (e.g., 1. Real - time AI move scoring
2. Exporting game recordings with commentary 3. Interactive bullet screens for friends watching). Add 3 points for
each practical innovative feature implemented (maximum 10 points).
7. Are there any redundant features?
• Strictly check three types of redundancy: 1. Redundant implementation of similar functions (e.g., multiple undo
logics coexisting) 2. Function modules unrelated to chess (e.g., a built - in music player) 3. Fancy effects that affect
performance (e.g., particle explosion animations). Deduct 3 points for each redundancy found, and directly deduct 10
points if the core functions are interfered with by redundant code.
8. Does the code have engineering quality?
• Review modular design (such as separating game logic/view/network layers), unit test coverage, and build process
automation. Deduct 5 points if global state pollution is found or design patterns are not used; deduct 5 points if the code
duplication rate is too high (over 30%); deduct 5 points if the build process is not automated. The full score is 10 points.
9. Does the interface vision meet professional design standards?
• Evaluate whether the overall design follows modern design principles: 1) Harmonious color matching (no more
than 3 primary colors) 2) Proper layout spacing (element spacing follows the 8px multiple principle) 3) Professional
font system (body font size ≥ 14px, line height over 1.5 times). Deduct 3 points for each crowded visual element, 5
points for a glaring color combination, and 5 points for chaotic text - image layout. The full score is 10 points.
10. Is the dynamic interaction smooth and seamless?
• Judge whether the dynamic effects conform to human perception characteristics: 1) Click feedback delay ≤ 100ms
2) Transition animation duration controlled between 300 - 500ms 3) Clear visual focus guidance. Deduct 5 points for
each operation without feedback, 3 points for visual after - images during fast sliding, and 5 points for hard - to - find
key function buttons. The full score is 10 points.
• I hope you can modify items 1 - 4 according to the [Query] I give you. The other items can be fine - tuned but try to
be consistent with the example I provided. Note that each item should be judged in combination with screenshots
as much as possible. There must be 10 items. Ensure that the detection difficulty of the check - list is high and the
requirements are relatively strict. The final output should be a complete check - list wrapped in a JSON block, without
including any other content. Refer to the following example:
“‘json
{
”checklist”: ¡specific checklist¿
}
“‘
Please generate the checklist for the following query according to the above standards:
——–Query starts——–
Query
——–Query ends——–
3.4 模型生成
用多个LLM进行推理,得到多个response
- 验证任务的可解决以及难度标签
- 识别由于模糊性或规格不足而引发的模型失败的任务
制醋失败的任务被标记为需要修改或者删除
3.5 手动质量保证检查和质量控制
最后,由人工专家团队对所有任务、清单和示例模型解决方案进行全面审核,以确保其质量和一
致性
3.6 基准构成与分析
难度分层。为了便于对模型能力进行细致入微的评估,将 ArtifactsBench 分为三个难度等级。对 30 多个最先进 (SOTA) 的 LLM 的总体性能进行基准测试。大约三分之一的数据集的平均得分低于 33,将其标记为“困难”。另外三分之一的数据的平均得分超过 40,将其标记为“简单”,而其余平均得分在 33 至 40 之间的查询则标记为“中等”。
为了更细致的评估,需要更加细致的场景分类, prompt如下:
You are a master at categorizing queries into specific classes. You will receive a series of queries, and your task is to
accurately assign them to predefined major and minor categories based on their content. The output format for each
query should be ”MajorCategory-MinorCategory”.
Evaluation Framework:
Phase 1: Comprehensive Understanding of the Problem
- Carefully analyze the original problem statement
- Identify explicit and implicit requirements
- Assess technical complexity and contextual constraints
Phase 2: Based on the query content, refer to the following classification structure for judgment. If the result falls under
”Other,” specify the minor category and output it in the required format:
1. **Game Development**:
- Puzzle | Sports | Shooting | Casual | Strategy
- Simulation/Management | Role-Playing | Adventure | Action/Rhythm
2. **Web Applications**:
- Communication | Online Shopping | Education/Learning | Blogs/Forums | Web Visuals
3. **Management Systems**:
- Frontend/Backend Platforms | File Management | Hardware Management
4. **Multimedia Editing**:
- Image Editing | Audio Editing | Video Production
5. **Data Science**:
- Data Visualization Dashboards | Statistical Analysis | Predictive Modeling | Machine Learning
6. **Simulation & Modeling**:
- Physics Simulation | Mathematical Abstraction | 3D Simulation
7. **SVG Generation**:
- SVG Icons/Logos | SVG Images | SVG Posters
8. **Mermaid Flowcharts**:
- Code Flowcharts | Logic Flowcharts | Mind Maps
9. **Other**
The final output should be a JSON object containing the category, as shown below:
“‘json
{
”Class”: ”Game Development-Strategy”
}
“‘
Please categorize the following question based on the above criteria:
——–question-start——–
Question
——–question-end——–
四、评估流程
4.1 代码提取
使用正则表达式从模型的原始文本输出中可靠地提取可执行代码片段。
代码:https://github.com/Tencent-Hunyuan/ArtifactsBenchmark/blob/main/src/code_parser.py
通过 ```filepath 将代码和filepath一起抽取,filepath的后缀可以infer出code language
4.2 动态渲染与捕获
提取的代码随后使用 Playwright 在沙盒环境中执行。
为了捕捉关键动态和交互性,系统会在执行过程中以固定的时间间隔连续截取三张屏幕截图。
这种离散的时间采样旨在捕捉交互的关键状态
- 事件发生之前
- 事件期间
- 事件结果
从而为评估动画、状态转换和用户反馈等常见动态行为提供可靠的代理。
- 抽取代码,如果是单个html则输出,如果是html,js和css的多文件,则将js脚本和css样式直接和html组合成一个html完整文件
def replace_references_with_code(result):"""Replaces <link> and <script> tags in the HTML content with the corresponding file content.If there are unmatched CSS or JS files, they are directly inserted into the HTML.Parameters:- result: List of file content and file names returned by `parse_code`.Returns:- The modified HTML content with inline CSS and JS code."""html_content = next(item["content"] for item in result if item["file_name"] == "index.html")file_content_map = {item["file_name"]: item["content"]for item in resultif item["language"] in ["css", "javascript"]}css_to_insert = [item["file_name"] for item in result if item["language"] == "css"]js_to_insert = [item["file_name"] for item in result if item["language"] == "javascript"]html_content = re.sub(r'<link[^>]*href=["\'](.*?)["\'][^>]*>',lambda match: replace_link_tag(match, file_content_map, css_to_insert),html_content,)html_content = re.sub(r'<script[^>]*src=["\'](.*?)["\'][^>]*>',lambda match: replace_script_tag(match, file_content_map, js_to_insert),html_content,)html_content = insert_unmatched_files(html_content, css_to_insert, "style", "</head>")html_content = insert_unmatched_files(html_content, js_to_insert, "script", "</body>")return html_content
- 将和html代码保存在本地
- 通过html代码的路径进行截图
获取截图的func:
def capture_html_screenshots(html_path, img_path, num_screenshots=3, interval=1, max_retries=2, timeout=600000
):"""Captures screenshots of the given HTML content using Playwright.Parameters:- html_path: Path to the HTML file to capture screenshots from.- img_path: List of file paths where the screenshots will be saved.- num_screenshots: The number of screenshots to capture (default is 3).- interval: Time interval between screenshots (default is 1 second).- max_retries: Maximum number of retry attempts in case of failure (default is 2).- timeout: Timeout duration for the page loading and screenshot capture (default is 600000 milliseconds).Returns:- None: This function doesn't return anything but captures screenshots at the specified paths."""try:html_path = Path(html_path) if not isinstance(html_path, Path) else html_pathfor attempt in range(1, max_retries + 1):try:# Launch the browser using Playwrightwith sync_playwright() as pw:browser = pw.chromium.launch(headless=True)try:context = browser.new_context()page = context.new_page()page.set_default_timeout(timeout)page.goto(f"file://{html_path.resolve()}", timeout=timeout)page.wait_for_load_state("networkidle", timeout=timeout)# Capture screenshotsfor i in range(num_screenshots):page.screenshot(path=img_path[i], full_page=True, timeout=timeout)if i < num_screenshots - 1:time.sleep(interval)break # Exit after successful screenshot capturefinally:if context:context.close()if browser:browser.close()except Exception as e:if attempt == max_retries:print(f"Attempt {attempt} failed, Error: {str(e)}")return Noneelse:print(f"Attempt {attempt} failed, retrying... Error: {str(e)}")finally:pass
4.3 MllM Judge
-
开源 MLLM 评判系统:采用 Qwen2.5-VL-72B,一款领先的开源 MLLM 系统。确保了评估路径的透明性、可重复性以及社区可访问性。
-
闭源 MLLM-as-Judge:利用 Gemini-2.5-pro-0506 来代表专有 MLLM 功能的巅峰。这提供了一个高保真评估标准,可以作为人类专家判断的可靠替代。
4.3.1 打分prompt
You are a seasoned and meticulous code review expert, proficient in multiple programming languages, front-end
technologies, and interaction design. Your task is to conduct an in-depth analysis and scoring of the received [question]
and [answer]. The [answer] may include source code (in various programming languages), algorithm implementations,
data structure designs, system architecture diagrams, front-end visualization code (such as HTML/SVG/CSS/JavaScript),
interaction logic descriptions, and related technical explanations. Please leverage your coding expertise and aesthetic
experience to thoroughly examine the [answer] content from the following dimensions and provide scores along with
detailed review comments. You should be very strict and cautious when giving full marks for each dimension.
Role Definition
Responsibilities: Act as an authoritative technical review committee member, ensuring objectivity, comprehensiveness,
and impartiality. Attitude: Rigorous, professional, and unsparing, adept at identifying details and potential risks.
Additional Traits: Possess exceptional aesthetic talent, with high standards for visual appeal and user experience.
I have only extracted the last segment of HTML or SVG code from the provided answer for visualization. The content
is adaptively scrolled to capture the entire page.
**Scoring Criteria:**
$Checklist
- The final output should be a JSON object containing the dimensions above, following this example:
“‘json
{
”Overall Score”: ”35”
}
“‘
Reason:...
Please score the following question according to the standards above:
——–Problem starts——–
$Question
——–Problem ends——–
——–Answer starts——–
$Answer
——–Answer ends——–
五、结果
六、用法
以数据集为例,走一下eval的流程
https://huggingface.co/datasets/tencent/ArtifactsBenchmark/viewer/default/train?row=0&views%5B%5D=train
- infer过程:利用数据集中的question进行推理得到answer字段,之后保存在本地:
{"index": "unique identifier in the dataset that corresponds one-to-one with 'question'","question": "each 'question' in ArtifactsBench","answer": "The answer inferred by your model based on the 'question'"
}
- eval过程
python3 src/infer_gemini.py \$path_with_index \$save_path \$screenshots_dir \$screenshots_count \$api_key \$model_marker \$api_url \$tokenizer_dir \--num_processes $num_processes
这个过程会把上面说的过程走一遍得到score
七、思考&总结
- question这里存疑,我看了下整体eval的代码,主要将html和svg代码抽取出来,否则这条case就失败了
代码函数在这里:https://github.com/Tencent-Hunyuan/ArtifactsBenchmark/blob/main/src/extract_ans.py#L88
所以按照这个bench的思路应该是会倾向于让模型inference的response中输出html或者svg代码,但是我这边随便试了几个case基本上都输出不了代码(prompt中没有任何倾向让模型输出代码,或生成什么代码), 更不用说 html代码
- 截图的逻辑 只是把html渲染,之后没有任何操作,直接在一段时间内连续截图三次,这种只对动画的校验生效,对于交互性不可校验。
对应的代码:https://github.com/Tencent-Hunyuan/ArtifactsBenchmark/blob/main/src/utils.py#L130