当前位置：首页 > news >正文

LLM系列笔记之微调数据集格式

news 2025/10/24 0:15:45

LLM系列笔记之数据集构建

参考：https://llamafactory.readthedocs.io/zh-cn/latest/getting_started/data_preparation.html
https://zhuanlan.zhihu.com/p/29621436915

这里写目录标题

LLM系列笔记之数据集构建
- Alpaca格式
- - **预训练**
  - **指令监督微调**
  - **人类偏好学习**
  - **KTO**
  - **多模态**
  - - 图像数据集
    - 视频数据集
    - 音频数据集
- ShareGPT格式
- - 指令监督微调
  - 用户偏好训练
- OpenAI格式
- 总结

Alpaca格式

预训练

通过学习未被标记的文本进行预训练，从而学习语言的表征。通常，预训练数据集从互联网上获得，因为互联网上提供了大量的不同领域的文本信息，有助于提升模型的泛化能力。

[
  {"text": "document"},
  {"text": "document"}
]

只有 text 列中的内容（即document）会用于模型学习。

实例：c4_demo.json

[
  {
    "text": "Don’t think you need all the bells and whistles? No problem. McKinley Heating Service Experts Heating & Air Conditioning offers basic air cleaners that work to improve the quality of the air in your home without breaking the bank. It is a low-cost solution that will ensure you and your family are living comfortably.\nIt’s a good idea to understand the efficiency rate of the filters, which measures what size of molecules can get through the filter. Basic air cleaners can filter some of the dust, dander and pollen that need to be removed. They are 85% efficient, and usually have a 6-inch cleaning surface.\nBasic air cleaners are not too expensive and do the job well. If you do want to hear more about upgrading from a basic air cleaner, let the NATE-certified experts at McKinley Heating Service Experts in Edmonton talk to you about their selection.\nEither way, now’s a perfect time to enhance and protect the indoor air quality in your home, for you and your loved ones.\nIf you want expert advice and quality service in Edmonton, give McKinley Heating Service Experts a call at 780-800-7092 to get your questions or concerns related to your HVAC system addressed."
  },
  ...
]

指令监督微调

通过让模型学习详细的指令以及对应的回答来优化模型在特定指令下的表现。

[
  {
    "instruction": "人类指令（必填）",
    "input": "人类输入（选填）",
    "output": "模型回答（必填）",
    "system": "系统提示词（选填）",
    "history": [
      ["第一轮指令（选填）", "第一轮回答（选填）"],
      ["第二轮指令（选填）", "第二轮回答（选填）"]
    ]
  }
]

instruction 列对应的内容为人类指令， input 列对应的内容为人类输入， output 列对应的内容为模型回答。system 列对应的内容将被作为系统提示词。

history 列是由多个字符串二元组构成的列表，分别代表历史消息中每轮对话的指令和回答。

注1：在进行指令监督微调时， instruction 列对应的内容会与 input 列对应的内容拼接后作为最终的人类输入，即人类输入为 instruction\ninput

注2：在指令监督微调时，历史消息中的回答内容也会被用于模型学习。

举例：

最简的样本

{
  "instruction": "计算这些物品的总费用。 ",
  "input": "输入：汽车 - $3000，衣服 - $100，书 - $20。",
  "output": "汽车、衣服和书的总费用为 $3000 + $100 + $20 = $3120。"
},

多轮对话的样本

[
  {
    "instruction": "今天的天气怎么样？",
    "input": "",
    "output": "今天的天气不错，是晴天。",
    "history": [
      [
        "今天会下雨吗？",
        "今天不会下雨，是个好天气。"
      ],
      [
        "今天适合出去玩吗？",
        "非常适合，空气质量很好。"
      ]
    ]
  }
]

实例：alpaca_zh_demo.json

[
  {
    "instruction": "Describe a process of making crepes.",
    "input": "",
    "output": "Making crepes is an easy and delicious process! Here are step-by-step instructions on how to make them:\n\n1. Assemble your ingredients. For basic crepes, you'll need: 1 cup all-purpose flour, 2 eggs, 1/2 cup milk, 1/2 cup water, 1/4 teaspoon salt, and 2 tablespoons melted butter.\n\n2. Mix the batter: In a large mixing bowl, whisk together the flour and the eggs. Gradually add the milk and water, stirring constantly to ensure that there are no lumps. Add salt and melted butter, and mix well.\n\n3. Let the batter rest: If you can, let the batter sit for an hour or so. This will help the flour to absorb the liquid and make the crepes more tender.\n\n4. Heat your pan: Preheat a non-stick pan over medium heat. Lightly butter the pan or use cooking spray to prevent the crepes from sticking.\n\n5. Pour the batter: Using a ladle or a measuring cup, pour a small amount of batter (about 1/4 cup) onto the center of the pan. Immediately tilt the pan in a circular motion to spread the batter evenly and thinly over the bottom of the pan.\n\n6. Cook the crepe: Cook the crepe for 1-2 minutes until the bottom is lightly golden. Carefully loosen the edges with a spatula and flip the crepe over to cook the other side for another minute.\n\n7. Remove and repeat: Gently slide the crepe onto a plate, and then repeat the process with the remaining batter. Remember to re-butter the pan between each crepe if necessary.\n\n8. Fill and serve: Fill your cooked crepes with your desired filling, such as fresh fruit, whipped cream, Nutella, or ham and cheese. Roll or fold, and serve immediately. Enjoy!"
  },

人类偏好学习

偏好数据集用于奖励模型训练、DPO 训练和 ORPO 训练。对于系统指令和人类输入，偏好数据集给出了一个更优的回答和一个更差的回答。

一些研究表明通过让模型学习“什么更好”可以使得模型更加迎合人类的需求。甚至可以使得参数相对较少的模型的表现优于参数更多的模型。

在 chosen 列中提供更优的回答，并在 rejected 列中提供更差的回答，在一轮问答中其格式如下：

[
  {
    "instruction": "人类指令（必填）",
    "input": "人类输入（选填）",
    "chosen": "优质回答（必填）",
    "rejected": "劣质回答（必填）"
  }
]

实例：dpo_en_demo.json

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "Hi! I'd like to create a new language game simulating the first person perspective of a character named Angela."
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "That sounds like a fun and engaging idea! Here are some tips to help you create the game:\n1. Start with the character's name and background: Before creating the game, it's important to have a clear idea of who Angela is and where she comes from. What is her age, gender, occupation, and interests? Knowing these details will help you create a relatable and realistic character that players can connect with.\n2. Decide on the objective of the game: What is the main objective of the game? Is it to help Angela navigate through a challenging environment, solve puzzles, or interact with other characters? Once you have a clear objective, you can start designing the game mechanics.\n3. Design the game mechanics: The game mechanics are the rules and controls that players use to interact with the game environment. Consider the camera angle, movement, interaction with objects, and any other features that will enhance the player's experience.\n4. Create a storyline: A storyline can add depth to the game and make it more engaging. Consider the obstacles that Angela will face, the challenges she will encounter, and the rewards she will receive.\n5. Add sound effects and music: Sound effects and music can help create atmosphere and enhance the player's experience. Consider using appropriate sound effects for each environment and adding music that fits the storyline.\n6. Test and refine: Once you have created the game, it's important to test it and get feedback from players. Use this feedback to refine the game and make it better.\nWith these tips, you should be well on your way to creating a fun and engaging language game simulating the first person perspective of Angela. Good luck!"
    },
    "rejected": {
      "from": "gpt",
      "value": "Hello! I'd be happy to help you create a language game simulating the first-person perspective of a character named Angela! What kind of game are you imagining? Do you have any specific details or themes in mind for the game, such as a setting, plot, or style? The more information you can provide, the better I can assist you. Additionally, please let me know if you have any specific preferences for the language or tone of the game. I'll do my best to suggest engaging and accurate responses as your helpful, respectful, and honest assistant! 😊"
    }
  },

KTO

KTO数据集与偏好数据集类似，但不同于给出一个更优的回答和一个更差的回答，KTO数据集对每一轮问答只给出一个 true/false 的 label。除了 instruction 以及 input 组成的人类最终输入和模型回答 output ，KTO 数据集还需要额外添加一个 kto_tag 列（true/false）来表示人类的反馈。

[
  {
    "instruction": "人类指令（必填）",
    "input": "人类输入（选填）",
    "output": "模型回答（必填）",
    "kto_tag": "人类反馈 [true/false]（必填）"
  }
]

实例：kto_en_demo.json

[
  {
    "messages": [
      {
        "content": "The Federal Trade Commission is going after spyware, bits of computer code that install themselves on the computers of Internet users to track or disrupt their online activities.\nQ: Which is the best summary of this article?\nPick your answer from:\n(A). World\n(B). Sports\n(C). Business\n(D). Science/Tech\nI think the answer is",
        "role": "user"
      },
      {
        "content": "Ooh ooh ah ah! *giggle* Let me think... Ooh ooh ah ah! *giggle* The best summary of this article is... *drumroll*... D! Science/Tech! Yay! *confetti* The Federal Trade Commission is talking about spyware, which is like a little computer bug that can watch what you do on your computer without you knowing. It's like a secret agent, but not a nice one! *gasp* The FTC wants to stop the spyware from doing its thing, so they're going after it! *cheer* Woohoo!",
        "role": "assistant"
      }
    ],
    "label": true
  },

多模态

Llama-factory支持多模态图像数据集、视频数据集以及音频数据集的输入。

图像数据集

多模态图像数据集需要额外添加一个 images 列，包含输入图像的路径。注意图片的数量必须与文本中所有标记的数量严格一致。

[
  {
    "instruction": "人类指令（必填）",
    "input": "人类输入（选填）",
    "output": "模型回答（必填）",
    "images": [
      "图像路径（必填）"
    ]
  }
]

示例：mllm_demo.json

[
  {
    "messages": [
      {
        "content": "<image>Who are they?",
        "role": "user"
      },
      {
        "content": "They're Kane and Gretzka from Bayern Munich.",
        "role": "assistant"
      },
      {
        "content": "What are they doing?",
        "role": "user"
      },
      {
        "content": "They are celebrating on the soccer field.",
        "role": "assistant"
      }
    ],
    "images": [
      "mllm_demo_data/1.jpg"
    ]
  },

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

视频数据集

多模态视频数据集需要额外添加一个 videos 列，包含输入视频的路径。注意视频的数量必须与文本中所有标记的数量严格一致。

[
  {
    "instruction": "人类指令（必填）",
    "input": "人类输入（选填）",
    "output": "模型回答（必填）",
    "videos": [
      "视频路径（必填）"
    ]
  }
]

示例：mllm_video_demo.json

[
  {
    "messages": [
      {
        "content": "<video>Why is this video funny?",
        "role": "user"
      },
      {
        "content": "Because a baby is reading, and he is so cute!",
        "role": "assistant"
      }
    ],
    "videos": [
      "mllm_demo_data/1.mp4"
    ]
  },

音频数据集

多模态音频数据集需要额外添加一个 audio 列，包含输入图像的路径。注意音频的数量必须与文本中所有标记的数量严格一致。

[
  {
    "instruction": "人类指令（必填）",
    "input": "人类输入（选填）",
    "output": "模型回答（必填）",
    "audios": [
      "音频路径（必填）"
    ]
  }
]

实例：mllm_audio_demo.json

[
  {
    "messages": [
      {
        "content": "<audio>What's that sound?",
        "role": "user"
      },
      {
        "content": "It is the sound of glass shattering.",
        "role": "assistant"
      }
    ],
    "audios": [
      "mllm_demo_data/1.mp3"
    ]
  },

ShareGPT格式

ShareGPT 格式中的 KTO数据集和多模态数据集与 Alpaca 格式的类似。
预训练数据集不支持 ShareGPT 格式。

指令监督微调

相比 alpaca 格式的数据集， sharegpt 格式支持更多的角色种类，例如 human、gpt、observation、function 等等。它们构成一个对象列表呈现在 conversations 列中。

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "人类指令"
      },
      {
        "from": "function_call",
        "value": "工具参数"
      },
      {
        "from": "observation",
        "value": "工具结果"
      },
      {
        "from": "gpt",
        "value": "模型回答"
      }
    ],
    "system": "系统提示词（选填）",
    "tools": "工具描述（选填）"
  }
]

其中 human 和 observation 必须出现在奇数位置，gpt 和 function 必须出现在偶数位置。

举例：

{
  "conversations": [
    {
      "from": "human",
      "value": "你好，我出生于1990年5月15日。你能告诉我我今天几岁了吗？"
    },
    {
      "from": "function_call",
      "value": "{\"name\": \"calculate_age\", \"arguments\": {\"birthdate\": \"1990-05-15\"}}"
    },
    {
      "from": "observation",
      "value": "{\"age\": 31}"
    },
    {
      "from": "gpt",
      "value": "根据我的计算，你今天31岁了。"
    }
  ],
  "tools": "[{\"name\": \"calculate_age\", \"description\": \"根据出生日期计算年龄\", \"parameters\": {\"type\": \"object\", \"properties\": {\"birthdate\": {\"type\": \"string\", \"description\": \"出生日期以YYYY-MM-DD格式表示\"}}, \"required\": [\"birthdate\"]}}]"
}

用户偏好训练

Sharegpt 格式的偏好数据集同样需要在 chosen 列中提供更优的消息，并在 rejected 列中提供更差的消息。

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "人类指令"
      },
      {
        "from": "gpt",
        "value": "模型回答"
      },
      {
        "from": "human",
        "value": "人类指令"
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "优质回答"
    },
    "rejected": {
      "from": "gpt",
      "value": "劣质回答"
    }
  }
]

示例：

{
  "conversations": [
    {
      "from": "human",
      "value": "老虎会吃哪一个，草还是树叶？"
    }
  ],
  "chosen": {
    "from": "gpt",
    "value": "我很高兴能帮助你解答问题！不过，我必须指出，这个问题本身可能没有意义。老虎是食肉动物，通常不吃草或树叶。它们的饮食主要包括大型有蹄动物、小型哺乳动物和鸟类。\n\n如果你提出这个问题是出于好奇或有特定原因，我很乐意帮助你找到更多关于老虎饮食习惯的信息，或者建议资源让你更多了解这些壮丽的生物。请告诉我如何进一步协助你！"
  },
  "rejected": {
    "from": "gpt",
    "value": "一只老虎是食肉动物，这意味着它的饮食主要是肉类。老虎主要猎捕和吃其他动物，如鹿和野猪。它们不会选择吃草或树叶，因为这些不是它们自然饮食的一部分。"
  }
}

OpenAI格式

OpenAI 格式仅仅是 sharegpt 格式的一种特殊情况，其中第一条消息可能是系统提示词。

[
  {
    "messages": [
      {
        "role": "system",
        "content": "系统提示词（选填）"
      },
      {
        "role": "user",
        "content": "人类指令"
      },
      {
        "role": "assistant",
        "content": "模型回答"
      }
    ]
  }
]

总结

下面是两种数据集格式的详细对比，大家可以根据自己的实际需求场景选择合适的格式：

对比维度	Alpaca 格式	ShareGPT 格式
核心设计目标	单轮指令驱动任务（如问答、翻译、摘要）	多轮对话与工具调用（如聊天机器人、API 交互）
数据结构	以 instruction、input、output 为主体的 JSON 对象	以 conversations 列表为核心的多角色对话链（human/gpt/function_call/observation）
对话历史处理	通过 history 字段记录历史对话（格式：[[“指令”, “回答”], …]）	通过 conversations 列表顺序自然体现多轮对话（角色交替出现）
角色与交互逻辑	仅区分用户指令和模型输出，无显式角色标签	支持多种角色标签（如 human、gpt、function_call），强制奇偶位置规则
工具调用支持	不原生支持工具调用，需通过 input 或指令隐式描述	通过 function_call 和 observation 显式实现工具调用，支持外部 API 集成
典型应用场景	- 指令响应（如 Alpaca-7B） - 领域知识问答 - 文本结构化生成	- 多轮对话（如 Vicuna） - 客服系统 - 需实时数据查询的交互（如天气、计算）
优势	- 结构简洁，任务导向清晰 - 适合快速构建单轮任务数据集	- 支持复杂对话流与外部工具扩展 - 更贴近真实人机交互场景
局限	- 多轮对话需手动拼接 history - 缺乏动态工具交互能力	- 数据格式更复杂 - 需严格遵循角色位置规则