开源界迎来重磅核弹!月之暗面开源了自家最新模型 K2
1. 模型简介
Kimi K2 是一款尖端专家混合(MoE)语言模型,激活参数量达320亿,总参数量突破1万亿。该模型采用Muon优化器训练,在前沿知识、推理和编程任务中展现出卓越性能,同时针对智能体能力进行了精细化优化。
核心特性
- 超大规模训练:基于15.5万亿token预训练1万亿参数MoE模型,全程保持训练稳定性
- MuonClip优化器:将Muon优化器应用于前所未有的规模,开发新型优化技术解决扩展过程中的稳定性问题
- 智能体能力:专为工具调用、逻辑推理和自主问题解决设计
模型变体
- Kimi-K2-Base:基础模型,为希望完全掌控微调和定制解决方案的研究者与开发者提供坚实的起点。
- Kimi-K2-Instruct:经过后训练的模型,最适合即插即用的通用聊天及代理体验。它属于无需长思考的反射级模型。
2. 模型概述
| 架构 | 专家混合模型 (MoE) | 
| 总参数量 | 1万亿 | 
| 激活参数量 | 320亿 | 
| 层数 (含全连接层) | 61 | 
| 全连接层数量 | 1 | 
| 注意力隐藏层维度 | 7168 | 
| MoE隐藏层维度 (单专家) | 2048 | 
| 注意力头数量 | 64 | 
| 专家总数 | 384 | 
| 单token选用专家数 | 8 | 
| 共享专家数量 | 1 | 
| 词表大小 | 16万 | 
| 上下文长度 | 12万8千 | 
| 注意力机制 | 多层注意力 | 
| 激活函数 | SwiGLU | 
3. 评估结果
指令模型评估结果
| Benchmark | Metric | Kimi K2 Instruct | DeepSeek-V3-0324 | Qwen3-235B-A22B (non-thinking) | Claude Sonnet 4 (w/o extended thinking) | Claude Opus 4 (w/o extended thinking) | GPT-4.1 | Gemini 2.5 Flash Preview (05-20) | 
|---|---|---|---|---|---|---|---|---|
| Coding Tasks | ||||||||
| LiveCodeBench v6 (Aug 24 - May 25) | Pass@1 | 53.7 | 46.9 | 37.0 | 48.5 | 47.4 | 44.7 | 44.7 | 
| OJBench | Pass@1 | 27.1 | 24.0 | 11.3 | 15.3 | 19.6 | 19.5 | 19.5 | 
| MultiPL-E | Pass@1 | 85.7 | 83.1 | 78.2 | 88.6 | 89.6 | 86.7 | 85.6 | 
| SWE-bench Verified (Agentless Coding) | Single Patch | 51.8 | 36.6 | 39.4 | 50.2 | 53.0 | 40.8 | 32.6 | 
| SWE-bench Verified (Agentic Coding) | Single Attempt (Acc) | 65.8 | 38.8 | 34.4 | 72.7* | 72.5* | 54.6 | — | 
| Multiple Attempts (Acc) | 71.6 | — | — | 80.2 | 79.4* | — | — | |
| SWE-bench Multilingual (Agentic Coding) | Single Attempt (Acc) | 47.3 | 25.8 | 20.9 | 51.0 | — | 31.5 | — | 
| TerminalBench | Inhouse Framework (Acc) | 30.0 | — | — | 35.5 | 43.2 | 8.3 | — | 
| Acc | 25.0 | 16.3 | 6.6 | — | — | 30.3 | 16.8 | |
| Aider-Polyglot | Acc | 60.0 | 55.1 | 61.8 | 56.4 | 70.7 | 52.4 | 44.0 | 
| Tool Use Tasks | ||||||||
| Tau2 retail | Avg@4 | 70.6 | 69.1 | 57.0 | 75.0 | 81.8 | 74.8 | 64.3 | 
| Tau2 airline | Avg@4 | 56.5 | 39.0 | 26.5 | 55.5 | 60.0 | 54.5 | 42.5 | 
| Tau2 telecom | Avg@4 | 65.8 | 32.5 | 22.1 | 45.2 | 57.0 | 38.6 | 16.9 | 
| AceBench | Acc | 76.5 | 72.7 | 70.5 | 76.2 | 75.6 | 80.1 | 74.5 | 
| Math & STEM Tasks | ||||||||
| AIME 2024 | Avg@64 | 69.6 | 59.4* | 40.1* | 43.4 | 48.2 | 46.5 | 61.3 | 
| AIME 2025 | Avg@64 | 49.5 | 46.7 | 24.7* | 33.1* | 33.9* | 37.0 | 46.6 | 
| MATH-500 | Acc | 97.4 | 94.0* | 91.2* | 94.0 | 94.4 | 92.4 | 95.4 | 
| HMMT 2025 | Avg@32 | 38.8 | 27.5 | 11.9 | 15.9 | 15.9 | 19.4 | 34.7 | 
| CNMO 2024 | Avg@16 | 74.3 | 74.7 | 48.6 | 60.4 | 57.6 | 56.6 | 75.0 | 
| PolyMath-en | Avg@4 | 65.1 | 59.5 | 51.9 | 52.8 | 49.8 | 54.0 | 49.9 | 
| ZebraLogic | Acc | 89.0 | 84.0 | 37.7* | 73.7 | 59.3 | 58.5 | 57.9 | 
| AutoLogi | Acc | 89.5 | 88.9 | 83.3 | 89.8 | 86.1 | 88.2 | 84.1 | 
| GPQA-Diamond | Avg@8 | 75.1 | 68.4* | 62.9* | 70.0* | 74.9* | 66.3 | 68.2 | 
| SuperGPQA | Acc | 57.2 | 53.7 | 50.2 | 55.7 | 56.5 | 50.8 | 49.6 | 
| Humanity's Last Exam (Text Only) | - | 4.7 | 5.2 | 5.7 | 5.8 | 7.1 | 3.7 | 5.6 | 
| General Tasks | ||||||||
| MMLU | EM | 89.5 | 89.4 | 87.0 | 91.5 | 92.9 | 90.4 | 90.1 | 
| MMLU-Redux | EM | 92.7 | 90.5 | 89.2 | 93.6 | 94.2 | 92.4 | 90.6 | 
| MMLU-Pro | EM | 81.1 | 81.2* | 77.3 | 83.7 | 86.6 | 81.8 | 79.4 | 
| IFEval | Prompt Strict | 89.8 | 81.1 | 83.2* | 87.6 | 87.4 | 88.0 | 84.3 | 
| Multi-Challenge | Acc | 54.1 | 31.4 | 34.0 | 46.8 | 49.0 | 36.4 | 39.5 | 
| SimpleQA | Correct | 31.0 | 27.7 | 13.2 | 15.9 | 22.8 | 42.3 | 23.3 | 
| Livebench | Pass@1 | 76.4 | 72.4 | 67.6 | 74.8 | 74.6 | 69.8 | 67.8 | 
• 标记有 * 的数据点直接取自模型的技术报告或博客。
• 除SWE-bench Verified (Agentless)外,所有指标均在8k输出标记长度下进行评估。SWE-bench Verified (Agentless)则限制在16k输出标记长度。
• Kimi K2在SWE-bench Verified测试中的单次尝试补丁(无需测试时计算)通过率达到了65.8%(使用bash/编辑器工具)。在相同条件下,其在SWE-bench Multilingual测试中的单次通过率为47.3%。此外,我们报告了利用并行测试时计算的SWE-bench Verified测试结果(71.6%),即通过采样多个序列并通过内部评分模型选择最佳方案。
•为确保评估的稳定性,我们在AIME、HMMT、CNMO、PolyMath-en、GPQA-Diamond、EvalPlus、Tau2上采用了avg@k方法。
• 由于评估成本过高,部分数据点已被省略。
基础模型评估结果
| Benchmark | Metric | Shot | Kimi K2 Base | Deepseek-V3-Base | Qwen2.5-72B | Llama 4 Maverick | 
|---|---|---|---|---|---|---|
| General Tasks | ||||||
| MMLU | EM | 5-shot | 87.8 | 87.1 | 86.1 | 84.9 | 
| MMLU-pro | EM | 5-shot | 69.2 | 60.6 | 62.8 | 63.5 | 
| MMLU-redux-2.0 | EM | 5-shot | 90.2 | 89.5 | 87.8 | 88.2 | 
| SimpleQA | Correct | 5-shot | 35.3 | 26.5 | 10.3 | 23.7 | 
| TriviaQA | EM | 5-shot | 85.1 | 84.1 | 76.0 | 79.3 | 
| GPQA-Diamond | Avg@8 | 5-shot | 48.1 | 50.5 | 40.8 | 49.4 | 
| SuperGPQA | EM | 5-shot | 44.7 | 39.2 | 34.2 | 38.8 | 
| Code Tasks | ||||||
| LiveCodeBench v6 | Pass@1 | 1-shot | 26.3 | 22.9 | 21.1 | 25.1 | 
| EvalPlus | Pass@1 | - | 80.3 | 65.6 | 66.0 | 65.5 | 
| Mathematics Tasks | ||||||
| MATH | EM | 4-shot | 70.2 | 60.1 | 61.0 | 63.0 | 
| GSM8k | EM | 8-shot | 92.1 | 91.7 | 90.4 | 86.3 | 
| Chinese Tasks | ||||||
| C-Eval | EM | 5-shot | 92.5 | 90.0 | 90.9 | 80.9 | 
| CSimpleQA | Correct | 5-shot | 77.6 | 72.1 | 50.5 | 53.5 | 
• 所有模型均采用相同的评估协议进行测试。
4. 部署说明
[!注意]
您可以通过 https://platform.moonshot.ai 访问Kimi K2的API服务,我们提供了兼容OpenAI/Anthropic规范的API接口。其中Anthropic兼容API的温度参数映射关系为
real_temperature = request_temperature * 0.6,以更好地适配现有应用程序。
我们的模型检查点采用block-fp8格式存储,您可以在Huggingface平台获取。
当前推荐在以下推理引擎上运行Kimi-K2模型:
- vLLM
- SGLang
- KTransformers
- TensorRT-LLM
关于vLLM和SGLang的部署示例,请参阅模型部署指南。
5. 模型使用
聊天补全
本地推理服务启动后,您可以通过聊天端点与之交互:
def simple_chat(client: OpenAI, model_name: str):messages = [{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},{"role": "user", "content": [{"type": "text", "text": "Please give a brief self-introduction."}]},]response = client.chat.completions.create(model=model_name,messages=messages,stream=False,temperature=0.6,max_tokens=256)print(response.choices[0].message.content)
[!注意]
Kimi-K2-Instruct 的推荐温度为temperature = 0.6。
如无特殊要求,上述系统提示是良好的默认设置。
工具调用
Kimi-K2-Instruct 具备强大的工具调用能力。
 启用功能需在每次请求中传入可用工具列表,模型将自主决定调用时机与方式。
以下示例展示了端到端的天气工具调用流程:
# Your tool implementation
def get_weather(city: str) -> dict:return {"weather": "Sunny"}# Tool schema definition
tools = [{"type": "function","function": {"name": "get_weather","description": "Retrieve current weather information. Call this when the user asks about the weather.","parameters": {"type": "object","required": ["city"],"properties": {"city": {"type": "string","description": "Name of the city"}}}}
}]# Map tool names to their implementations
tool_map = {"get_weather": get_weather
}def tool_call_with_client(client: OpenAI, model_name: str):messages = [{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},{"role": "user", "content": "What's the weather like in Beijing today? Use the tool to check."}]finish_reason = Nonewhile finish_reason is None or finish_reason == "tool_calls":completion = client.chat.completions.create(model=model_name,messages=messages,temperature=0.6,tools=tools,          # tool list defined abovetool_choice="auto")choice = completion.choices[0]finish_reason = choice.finish_reasonif finish_reason == "tool_calls":messages.append(choice.message)for tool_call in choice.message.tool_calls:tool_call_name = tool_call.function.nametool_call_arguments = json.loads(tool_call.function.arguments)tool_function = tool_map[tool_call_name]tool_result = tool_function(**tool_call_arguments)print("tool_result:", tool_result)messages.append({"role": "tool","tool_call_id": tool_call.id,"name": tool_call_name,"content": json.dumps(tool_result)})print("-" * 100)print(choice.message.content)
tool_call_with_client函数实现了从用户查询到工具执行的完整流程。
 该流程要求推理引擎支持Kimi-K2的原生工具解析逻辑。
 如需了解流式输出和手动工具解析方法,请参阅工具调用指南。
6. 许可协议
代码仓库和模型权重均采用修订版MIT许可证发布。
