《AI大模型开发笔记》MoE模型技术对比(Mixtral,Qwen2-MoE,DeepSeek-v3)
MoE模型技术对比(Mixtral,Qwen2-MoE,DeepSeek-v3)
MoE(混合专家)大模型进入爆发期!本期我们对比三大开源MoE LLM:Mixtral、Qwen2-MoE 和最新爆火的 DeepSeek-V3。从 2023 年 Mixtral 开启风潮,到 2024 年 DeepSeek-V3 让全球瞩目,MoE 模型到底经历了怎样的进化? DeepSeek又凭什么让硅谷发慌呢? 一起来看看!
1 Qwen2.5-MoE模型
技术报告:
Qwen2:https://arxiv.org/pdf/2407.10671
Qwen2.5:https://arxiv.org/pdf/2412.15115
1.1 Qwen系列模型
图片来自:Qwen2.5 Technical Report
根据Qwen2.5的技术报告,阿里云Qwen2.5系列模型中,其中dense模型属于开源模型,但是MoE模型,也就是 Qwen2.5-Turbo 和 Qwen2.5-Plus 属于在阿里云云端托管的模型,这两个模型的表现是可以比肩 gpt-4o-mini 和 gpt-4o 的。
Qwen2和Qwen1.5系列模型中,都有开源的MoE模型,比如:
https://huggingface.co/Qwen/Qwen2-57B-A14B
https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B
所以,我们可以通过Qwen2-MoE的代码以及Qwen2.5的技术报告来了解下Qwen2.5 MoE。
1.2 Qwen MoE模型架构及技术
图片来自:Qwen2.5 Technical Report
-
Qwen2.5-MoE特点:
- 使用MoE层替换dense模型中的FFN层
- 每个MoE层包含shared experts 和 routed experts(Routing机制负责将tokens分配给tok-k个experts)
- 采用了 fine-grained expert segmentation(细粒度专家分割)技术
-
Fine-grained expert segmentation(细粒度专家分割)
这个方法参考自DeepSeekMoE模型,其目的是保持参数量,计算量不变的同时,提高MoE模型的表达能力,从而实现更准确和有针对性的知识获取的潜力。
实现方法如下:- 将FFN中间隐藏层的维度减少到原来的1/m,将每个专家FFN分割成m个更小的专家,所以专家数量变为了mN。
- 非零门控的数量增加到 mK 个(从top-K变为了top-mK)。
- 模型初始化(重新洗牌选择参数,并对50%重新初始化)
其特点如下:
- 参数量,计算量不变,表达能力大幅度增加
- 组合灵活性大幅度增大:假设N=16,K=2情况下,组合数量从 120 增大到 4,426,165,368
-
Shared experts routing(共享专家路由):
在路由专家之外增加一个或者多个共享专家,用于共同知识或信息的处理。
其特点如下:- 共享专家通常包含1个或者多个
- 专门用于捕获和整合不同上下文中的共同知识,其他路由专家之间的参数冗余将得到缓解
- 共享专家隔离的原型由微软Rajbhandari等人提出(2022年)(DeepSeed-MoE:https://arxiv.org/pdf/2201.05596)
图片来自:DeepSeekMoE论文
1.3 Qwen2-MoE代码
从Qwen2-MoE的代码来看,其实现参考了Mixtral的相关代码,在改进方法上参考了DeepSeekMoE。
在Mixtral基础上增加了
- 增加了一个 shared expert (相当于4个routed expert)
- 采用了基于fine-grained experts segmentation的思路
主要代码如下:
- Code 1: 模型主类
https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L891
Qwen2MoeModel
def __init__(self, config: Qwen2MoeConfig):
self.layers = nn.ModuleList(
[Qwen2MoeDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
)
def forward():
if output_router_logits and layer_outputs[-1] is not None:
all_router_logits += (layer_outputs[-1],)
- Code 2: Decoder Layer
https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L668
class Qwen2MoeDecoderLayer(nn.Module):
def __init__(self, config: Qwen2MoeConfig, layer_idx: int):
super().__init__()
self.hidden_size = config.hidden_size
self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
if (layer_idx not in config.mlp_only_layers) and (
config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0
):
self.mlp = Qwen2MoeSparseMoeBlock(config)
else:
self.mlp = Qwen2MoeMLP(config, intermediate_size=config.intermediate_size)
def forward():
hidden_states = self.input_layernorm(hidden_states)
# Self Attention
hidden_states, self_attn_weights, present_key_value = self.self_attn()
# Fully Connected
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
if isinstance(hidden_states, tuple):
hidden_states, router_logits = hidden_states
else:
router_logits = None
outputs = (hidden_states,)
if output_router_logits:
outputs += (router_logits,)
- Code 3: Qwen2MoeSparseMoeBlock(MoE核心处理代码)
https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L606
class Qwen2MoeSparseMoeBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.num_experts = config.num_experts
self.top_k = config.num_experts_per_tok
self.norm_topk_prob = config.norm_topk_prob
# gating
self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False)
self.experts = nn.ModuleList(
[Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)]
)
# 与Mixtral相比,Qwen2-MoE多了 shared_expert 和 shared_expert_gate
self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size)
self.shared_expert_gate = torch.nn.Linear(config.hidden_size, 1, bias=False)
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
""" """
batch_size, sequence_length, hidden_dim = hidden_states.shape
hidden_states = hidden_states.view(-1, hidden_dim)
# router_logits: (batch * sequence_length, n_experts)
router_logits = self.gate(hidden_states)
routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
if self.norm_topk_prob:
routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
# we cast back to the input dtype
routing_weights = routing_weights.to(hidden_states.dtype)
final_hidden_states = torch.zeros(
(batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device
)
# One hot encode the selected experts to create an expert mask
# this will be used to easily index which expert is going to be sollicitated
expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0)
# Loop over all available experts in the model and perform the computation on each expert
for expert_idx in range(self.num_experts):
expert_layer = self.experts[expert_idx]
idx, top_x = torch.where(expert_mask[expert_idx])
# Index the correct hidden states and compute the expert hidden state for
# the current expert. We need to make sure to multiply the output hidden
# states by `routing_weights` on the corresponding tokens (top-1 and top-2)
current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None]
# However `index_add_` only support torch tensors for indexing so we'll use
# the `top_x` tensor here.
final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
# 代码合并处理部分(此处与DeepSeek不一样)
shared_expert_output = self.shared_expert(hidden_states)
shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output
final_hidden_states = final_hidden_states + shared_expert_output
final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)
return final_hidden_states, router_logits
QA:shared_expert_gate 与 router gate的区别:
Router Gate 是整个 MoE 模型的核心组件,负责为每个 token 动态选择专家并通过路由权重对专家的输出进行动态加权。
Shared Expert Gate 是为 Shared Expert 服务的,动态调节共享专家的输出,提升共享专家的适配性。
- Code 4: Expert层的定义(与Mixtral相同)
https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L267
# Modified from transformers.models.mistral.modeling_mistral.MistralMLP with Mistral->Qwen2Moe
class Qwen2MoeMLP(nn.Module):
def __init__(self, config, intermediate_size=None):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = intermediate_size
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
- Code 5:Loss部分(与Mixtral相同)
https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L1328C1-L1341C119
class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin):
_tied_weights_keys = ["lm_head.weight"]
_tp_plan = {"lm_head": "colwise_rep"}
def __init__(self, config):
super().__init__(config)
self.model = Qwen2MoeModel(config)
self.vocab_size = config.vocab_size
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.router_aux_loss_coef = config.router_aux_loss_coef
self.num_experts = config.num_experts
self.num_experts_per_tok = config.num_experts_per_tok
def forward():
loss = None
if labels is not None:
loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)
aux_loss = None
if output_router_logits:
aux_loss = load_balancing_loss_func(
outputs.router_logits if return_dict else outputs[-1],
self.num_experts,
self.num_experts_per_tok,
attention_mask,
)
if labels is not None:
loss += self.router_aux_loss_coef * aux_loss.to(loss.device) # make sure to reside in the same device
- Code 6: Balancing Loss 实现(与Mixtral相同)
https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L65
# Copied from transformers.models.mixtral.modeling_mixtral.load_balancing_loss_func
def load_balancing_loss_func()
2 DeepSeek V3 (MoE)
2.1 DeepSeek v3模型架构
DeepSeek-V3架构
特点:
- 结构:使用 1 个 Shared Expert + 256 个 routed experts
- 结构:fine-grained expert segmentation
- Loss:Auxiliary-loss-free load Balancing
- Loss:Complementary sequence-wise auxiliary Loss
- 使用sigmoid来计算gate,Mixtral, Qwen2-MoE使用softmax
- 与Qwen2-MoE相比,shared expert 的输出没有经过处理,直接与其他其他项相加; Qwen2-MoE使用了 shared expert gate进行了处理
# DeepSeek-v3 config
"n_group": 8,
"n_routed_experts": 256,
"n_shared_experts": 1,
"norm_topk_prob": true,
"num_attention_heads": 128,
"num_experts_per_tok": 8,
DeepSeekMoE MoE处理方法
2.2 DeepSeek v3模型代码
DeepSeek v3代码的几个问题:
- 缺少训练代码,只有推理代码
- 可以参考 DeepSeek v2.5 的代码(更完整)
DeepSeek-v3:
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/modeling_deepseek.py
DeepSeek-v2.5:
https://huggingface.co/deepseek-ai/DeepSeek-V2.5-1210/blob/main/modeling_deepseek.py
下面代码结合v3 和 v2.5
- Code 1: 模型主类(与v2.5类似)
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L1347
class DeepseekV3Model(DeepseekV3PreTrainedModel):
self.layers = nn.ModuleList(
[
DeepseekV3DecoderLayer(config, layer_idx)
for layer_idx in range(config.num_hidden_layers)
]
)
- Code 2: Decoder Layer(与v2.5类似)
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L1143
class DeepseekV3DecoderLayer(nn.Module):
def __init__(self, config: DeepseekV3Config, layer_idx: int):
self.mlp = (
DeepseekV3MoE(config)
if (
config.n_routed_experts is not None
and layer_idx >= config.first_k_dense_replace
and layer_idx % config.moe_layer_freq == 0
)
else DeepseekV3MLP(config)
)
def forward(
hidden_states = self.mlp(hidden_states)
- Code 3: MoE核心处理代码
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L476
class DeepseekV3MoE(nn.Module):
def __init__(self, config):
self.experts = nn.ModuleList(
[
DeepseekV3MLP(
config, intermediate_size=config.moe_intermediate_size
)
for i in range(config.n_routed_experts)
]
)
self.gate = MoEGate(config)
if config.n_shared_experts is not None:
intermediate_size = config.moe_intermediate_size * config.n_shared_experts
self.shared_experts = DeepseekV3MLP(
config=config, intermediate_size=intermediate_size
)
def forward(self, hidden_states):
identity = hidden_states
orig_shape = hidden_states.shape
topk_idx, topk_weight = self.gate(hidden_states)
hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
flat_topk_idx = topk_idx.view(-1)
if not self.training: # 缺少training时的相关代码
y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
if self.config.n_shared_experts is not None:
y = y + self.shared_experts(identity) # share_experts的输出直接与y相加
return y
Deepseek v2.5
https://huggingface.co/deepseek-ai/DeepSeek-V2.5-1210/blob/6f134cbe88cb9284a8ce696e8ac8eefd0bc24ede/modeling_deepseek.py#L521
class DeepseekV2MoE(nn.Module):
"""
A mixed expert module containing shared experts.
"""
def __init__(self, config):
self.experts = nn.ModuleList(
[
DeepseekV2MLP(
config, intermediate_size=config.moe_intermediate_size
)
for i in range(config.n_routed_experts)
]
)
self.gate = MoEGate(config)
if config.n_shared_experts is not None:
intermediate_size = config.moe_intermediate_size * config.n_shared_experts
self.shared_experts = DeepseekV2MLP(
config=config, intermediate_size=intermediate_size
)
def forward(self, hidden_states):
identity = hidden_states
orig_shape = hidden_states.shape
topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
flat_topk_idx = topk_idx.view(-1)
if self.training:
hidden_states = hidden_states.repeat_interleave(
self.num_experts_per_tok, dim=0
)
y = torch.empty_like(hidden_states)
for i, expert in enumerate(self.experts):
y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])
y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)
y = y.to(hidden_states.dtype).view(*orig_shape)
y = AddAuxiliaryLoss.apply(y, aux_loss)
else:
y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
if self.config.n_shared_experts is not None:
y = y + self.shared_experts(identity)
return y
- Code 4: Expert层的定义
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L374
class DeepseekV3MLP(nn.Module):
def __init__(self, config, hidden_size=None, intermediate_size=None):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
self.intermediate_size = (
config.intermediate_size if intermediate_size is None else intermediate_size
)
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
return down_proj
- Code 5: Gate层的定义与Loss计算
由于DeepSeek v3没有提供loss的计算方法,这里我们可以参考DeepSeek v2.5的实现。
Deepseek v2.5的loss在Gate类中。
Gate layer (deepseek v2.5)
https://huggingface.co/deepseek-ai/DeepSeek-V2.5-1210/blob/6f134cbe88cb9284a8ce696e8ac8eefd0bc24ede/modeling_deepseek.py#L393
class MoEGate(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.top_k = config.num_experts_per_tok
self.n_routed_experts = config.n_routed_experts
self.routed_scaling_factor = config.routed_scaling_factor
self.scoring_func = config.scoring_func
self.alpha = config.aux_loss_alpha
self.seq_aux = config.seq_aux
self.topk_method = config.topk_method
self.n_group = config.n_group
self.topk_group = config.topk_group
# topk selection algorithm
self.norm_topk_prob = config.norm_topk_prob
self.gating_dim = config.hidden_size
self.weight = nn.Parameter(
torch.empty((self.n_routed_experts, self.gating_dim))
)
self.reset_parameters()
def reset_parameters(self) -> None:
import torch.nn.init as init
init.kaiming_uniform_(self.weight, a=math.sqrt(5))
def forward(self, hidden_states):
bsz, seq_len, h = hidden_states.shape
### compute gating score
hidden_states = hidden_states.view(-1, h)
logits = F.linear(
hidden_states.type(torch.float32), self.weight.type(torch.float32), None
)
if self.scoring_func == "softmax":
scores = logits.softmax(dim=-1, dtype=torch.float32)
else:
raise NotImplementedError(
f"insupportable scoring function for MoE gating: {self.scoring_func}"
)
### select top-k experts
if self.topk_method == "greedy":
topk_weight, topk_idx = torch.topk(
scores, k=self.top_k, dim=-1, sorted=False
)
elif self.topk_method == "group_limited_greedy":
group_scores = (
scores.view(bsz * seq_len, self.n_group, -1).max(dim=-1).values
) # [n, n_group]
group_idx = torch.topk(
group_scores, k=self.topk_group, dim=-1, sorted=False
)[
1
] # [n, top_k_group]
group_mask = torch.zeros_like(group_scores) # [n, n_group]
group_mask.scatter_(1, group_idx, 1) # [n, n_group]
score_mask = (
group_mask.unsqueeze(-1)
.expand(
bsz * seq_len, self.n_group, self.n_routed_experts // self.n_group
)
.reshape(bsz * seq_len, -1)
) # [n, e]
tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0) # [n, e]
topk_weight, topk_idx = torch.topk(
tmp_scores, k=self.top_k, dim=-1, sorted=False
)
### norm gate to sum 1
if self.top_k > 1 and self.norm_topk_prob:
denominator = topk_weight.sum(dim=-1, keepdim=True) + 1e-20
topk_weight = topk_weight / denominator
else:
topk_weight = topk_weight * self.routed_scaling_factor
### expert-level computation auxiliary loss
if self.training and self.alpha > 0.0:
scores_for_aux = scores
aux_topk = self.top_k
# always compute aux loss based on the naive greedy topk method
topk_idx_for_aux_loss = topk_idx.view(bsz, -1)
if self.seq_aux:
scores_for_seq_aux = scores_for_aux.view(bsz, seq_len, -1)
ce = torch.zeros(
bsz, self.n_routed_experts, device=hidden_states.device
)
ce.scatter_add_(
1,
topk_idx_for_aux_loss,
torch.ones(bsz, seq_len * aux_topk, device=hidden_states.device),
).div_(seq_len * aux_topk / self.n_routed_experts)
aux_loss = (ce * scores_for_seq_aux.mean(dim=1)).sum(
dim=1
).mean() * self.alpha
else:
mask_ce = F.one_hot(
topk_idx_for_aux_loss.view(-1), num_classes=self.n_routed_experts
)
ce = mask_ce.float().mean(0)
Pi = scores_for_aux.mean(0)
fi = ce * self.n_routed_experts
aux_loss = (Pi * fi).sum() * self.alpha
else:
aux_loss = None
return topk_idx, topk_weight, aux_loss
DeepSeek v3 的Gate类代码
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L393
3 Expert 如何初始化
3.1 Qwen2模型初始化
-
标准MoE情况:使用Dense模型的参数对所有Expert进行初始化。
-
使用 fine-grained expert segmentation 时:
-
参数拷贝:
- 初始参数直接来自于已有的密集(Dense)模型,通过拷贝这些参数作为细粒度专家的初始化基础。
-
参数洗牌与选择:
- 由于细粒度专家的网络维度与原始的Dense模型不一致,需要对原始网络的参数进行重新洗牌,在(intermediate dimension)维度上进行shuffle处理。这一步确保适应新维度的需要,并为不同专家增加了多样性。
-
随机重初始化部分参数:
- 在洗牌和选择后,对选中参数的50%进行随机重初始化。这一步骤旨在增加每个细粒度专家的参数多样性,从而增强模型在处理多样化输入时的鲁棒性和泛化能力。
这样的初始化方法确保了细粒度专家既能利用原始模型的知识基础,又通过增加的随机性和定制化的参数调整,更好地适应具体的任务或数据集。
-
内容来自QWEN2 TECHNICAL REPORT:https://arxiv.org/pdf/2407.10671
3.2 DeepSeekMoE 模型初始化
模型参数随机初始化(For initialization, all learnable
parameters are randomly initialized with a standard deviation of 0.006. )
https://arxiv.org/pdf/2401.06066v1
4 三模型对比
-
Mixtral:
- 专家数量:8 experts, topK=2
- Loss:load balancing loss
-
Qwen2-MoE
- 专家数量:60 个 routed experts, 1 shared experts(相当于四个routed experts),topK=4
- loss:load balancing loss
- sigmoid(shared experts) + softmax(routed experts)
基本思想:64experts: 16 experts with fine-grained setting of 4 x 为了保持参数量,计算量相同,只使用60个routed experts,shared expert相当于4个expert 其中四个shared experts是在参数量上的概念,实际上只有一个moe layer,只是 intermediate-size是moe intermidiate size的四倍。 "moe_intermediate_size": 1408, "shared_expert_intermediate_size": 5632, "num_experts_per_tok": 4, "num_experts": 60, https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat/blob/main/config.json
-
DeepSeek V3
- 专家数量:256 routed expert (fine-grained expert segmentation), 1 shared expert, TopK: 8
- Loss 1: load balancing loss(Auxiliary-loss-free 改进版)
- Loss 2: Complementary sequence-wise auxiliary Loss
- shared experts + sigmoid(routed experts)
5 主要改进的效果(实验对比)
-
增加Shared expert, 以及 fine-grained expert segmentation(结果来自DeepSeekMoE技术报告)
从上面结果可以看出:- 增加shared expert略有改进,部分数据大幅度改进
- 通过fine-grained expert segmentation进一步增加experts数量(将一个expert拆分为2个或者4个smaller experts),性能改进更明显(参数量不变,性能大幅度改善)。
- 关于shared expert与routed experts的比例:整体影响不大,1:3 比较好;(???根据结果来看1:7好像更好)
详细结果,请参考:https://arxiv.org/pdf/2401.06066
-
Loss: Aux-Loss-Free的影响
结果来自:https://arxiv.org/pdf/2412.19437v1
-
Loss: batch-wise load balancing vs sequence-wise load balancing
batch wise load balancing 显示出更好的结果,sequence-wise 方法要求更高,限制了专家的专业化能力。但是batch-wise load balancing也有一些挑战:- (1)在某些序列或小批次中的负载不平衡,
- (2)在推理过程中由于领域变化引起的负载不平衡。