当前位置：首页 > news >正文

【大模型】数据集构造方式

news 2025/7/16 3:27:32

1. Alpaca 数据格式

Alpaca 数据格式最早由 Stanford Alpaca 项目提出，目的是用来微调大语言模型（LLM），特别是用于 Instruction Tuning（指令微调）。它基于 Self-Instruct 方法，即使用更强大的模型（如 OpenAI 的 GPT-3）来自动生成高质量的指令数据，从而让小型模型也能理解和执行指令任务。

数据格式示例

Alpaca 数据集的格式通常是 JSON，包含以下几个字段：

{
    "instruction": "Describe the benefits of exercise.",
    "input": "",
    "output": "Regular exercise improves cardiovascular health, strengthens muscles, boosts mental health, and helps with weight management."
}

或者带有输入数据的情况：

{
    "instruction": "Summarize the following paragraph.",
    "input": "Artificial intelligence is transforming various industries, including healthcare, finance, and education...",
    "output": "AI is revolutionizing multiple industries like healthcare, finance, and education."
}

字段解析

instruction：指令，表示用户希望模型执行的任务（如摘要、翻译、编程等）。
input（可选）：额外输入信息，适用于需要上下文的任务。
output：期望的输出，即模型应该生成的答案。

特点

适用于 指令微调，让模型更善于执行任务型对话。
结构清晰，适合 监督学习（Supervised Fine-tuning）。
通过 自动生成 数据，降低人工标注成本。

2. ShareGPT 数据格式

ShareGPT 主要用于 对话数据微调，它是 OpenAI ChatGPT 用户分享的对话数据集合，适用于训练对话式大模型，如 Vicuna、LLaMA-2-Chat 等。

数据格式示例

ShareGPT 数据通常以 JSON 格式存储，结构如下：

{
    "conversations": [
        {"from": "human", "value": "What is the capital of France?"},
        {"from": "gpt", "value": "The capital of France is Paris."},
        {"from": "human", "value": "Can you tell me more about Paris?"},
        {"from": "gpt", "value": "Paris, known as the 'City of Light', is famous for its rich history, art, fashion, and gastronomy."}
    ]
}

字段解析

conversations：存储完整的对话列表，每轮对话包括：
- from：消息来源（“human” 代表用户，“gpt” 代表 AI）。
- value：具体的对话内容。

特点

适用于 对话模型微调，让模型更擅长多轮对话。
结构简单，容易用于 监督微调（SFT） 或 RLHF（强化学习 + 人类反馈）。
数据质量取决于用户分享的对话，有时可能包含噪声。

3. 对比总结

数据格式	适用场景	数据结构	特点
Alpaca	指令微调（Instruction Tuning）	独立的指令-输入-输出	适用于任务型对话，结构清晰
ShareGPT	对话微调（Chat Fine-tuning）	多轮对话（human & gpt）	适用于对话模型，可用于 RLHF

在这里插入图片描述