当前位置：首页 > news >正文

《AI大模型开发笔记》MoE模型技术对比（Mixtral，Qwen2-MoE，DeepSeek-v3）

news 2025/7/17 13:05:37

MoE模型技术对比（Mixtral，Qwen2-MoE，DeepSeek-v3）

MoE（混合专家）大模型进入爆发期！本期我们对比三大开源MoE LLM：Mixtral、Qwen2-MoE 和最新爆火的 DeepSeek-V3。从 2023 年 Mixtral 开启风潮，到 2024 年 DeepSeek-V3 让全球瞩目，MoE 模型到底经历了怎样的进化？ DeepSeek又凭什么让硅谷发慌呢？一起来看看！

1 Qwen2.5-MoE模型

技术报告：
Qwen2：https://arxiv.org/pdf/2407.10671
Qwen2.5：https://arxiv.org/pdf/2412.15115

1.1 Qwen系列模型

图片来自：Qwen2.5 Technical Report

根据Qwen2.5的技术报告，阿里云Qwen2.5系列模型中，其中dense模型属于开源模型，但是MoE模型，也就是 Qwen2.5-Turbo 和 Qwen2.5-Plus 属于在阿里云云端托管的模型，这两个模型的表现是可以比肩 gpt-4o-mini 和 gpt-4o 的。

Qwen2和Qwen1.5系列模型中，都有开源的MoE模型，比如：
https://huggingface.co/Qwen/Qwen2-57B-A14B
https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B

所以，我们可以通过Qwen2-MoE的代码以及Qwen2.5的技术报告来了解下Qwen2.5 MoE。

1.2 Qwen MoE模型架构及技术

图片来自：Qwen2.5 Technical Report

Qwen2.5-MoE特点：
- 使用MoE层替换dense模型中的FFN层
- 每个MoE层包含shared experts 和 routed experts（Routing机制负责将tokens分配给tok-k个experts）
- 采用了 fine-grained expert segmentation（细粒度专家分割）技术
Fine-grained expert segmentation（细粒度专家分割）
这个方法参考自DeepSeekMoE模型，其目的是保持参数量，计算量不变的同时，提高MoE模型的表达能力，从而实现更准确和有针对性的知识获取的潜力。
实现方法如下：
1. 将FFN中间隐藏层的维度减少到原来的1/m，将每个专家FFN分割成m个更小的专家，所以专家数量变为了mN。
2. 非零门控的数量增加到 mK 个（从top-K变为了top-mK）。
3. 模型初始化（重新洗牌选择参数，并对50%重新初始化）
其特点如下：
1. 参数量，计算量不变，表达能力大幅度增加
2. 组合灵活性大幅度增大：假设N=16，K=2情况下，组合数量从 120 增大到 4,426,165,368
Shared experts routing(共享专家路由):
在路由专家之外增加一个或者多个共享专家，用于共同知识或信息的处理。
其特点如下：
1. 共享专家通常包含1个或者多个
2. 专门用于捕获和整合不同上下文中的共同知识，其他路由专家之间的参数冗余将得到缓解
3. 共享专家隔离的原型由微软Rajbhandari等人提出（2022年）（DeepSeed-MoE：https://arxiv.org/pdf/2201.05596）
图片来自：DeepSeekMoE论文

1.3 Qwen2-MoE代码

从Qwen2-MoE的代码来看，其实现参考了Mixtral的相关代码，在改进方法上参考了DeepSeekMoE。
在Mixtral基础上增加了

增加了一个 shared expert （相当于4个routed expert）
采用了基于fine-grained experts segmentation的思路

主要代码如下：

Code 1: 模型主类
https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L891

Qwen2MoeModel
    def __init__(self, config: Qwen2MoeConfig):
        self.layers = nn.ModuleList(
            [Qwen2MoeDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
    def forward():
        if output_router_logits and layer_outputs[-1] is not None:
            all_router_logits += (layer_outputs[-1],)

Code 2: Decoder Layer
https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L668

class Qwen2MoeDecoderLayer(nn.Module):
    def __init__(self, config: Qwen2MoeConfig, layer_idx: int):
        super().__init__()
        self.hidden_size = config.hidden_size

        self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)

        if (layer_idx not in config.mlp_only_layers) and (
            config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0
        ):
            self.mlp = Qwen2MoeSparseMoeBlock(config)
        else:
            self.mlp = Qwen2MoeMLP(config, intermediate_size=config.intermediate_size)

    def forward():
        hidden_states = self.input_layernorm(hidden_states)
        # Self Attention
        hidden_states, self_attn_weights, present_key_value = self.self_attn()

        # Fully Connected
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        if isinstance(hidden_states, tuple):
            hidden_states, router_logits = hidden_states
        else:
            router_logits = None
        outputs = (hidden_states,)
        if output_router_logits:
            outputs += (router_logits,)

Code 3: Qwen2MoeSparseMoeBlock（MoE核心处理代码）
https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L606

class Qwen2MoeSparseMoeBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.num_experts = config.num_experts
        self.top_k = config.num_experts_per_tok
        self.norm_topk_prob = config.norm_topk_prob

        # gating
        self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False)
        self.experts = nn.ModuleList(
            [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)]
        )

        # 与Mixtral相比，Qwen2-MoE多了 shared_expert 和 shared_expert_gate
        self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size)
        self.shared_expert_gate = torch.nn.Linear(config.hidden_size, 1, bias=False)

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        """ """
        batch_size, sequence_length, hidden_dim = hidden_states.shape
        hidden_states = hidden_states.view(-1, hidden_dim)
        # router_logits: (batch * sequence_length, n_experts)
        router_logits = self.gate(hidden_states)

        routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
        routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
        if self.norm_topk_prob:
            routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
        # we cast back to the input dtype
        routing_weights = routing_weights.to(hidden_states.dtype)

        final_hidden_states = torch.zeros(
            (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device
        )

        # One hot encode the selected experts to create an expert mask
        # this will be used to easily index which expert is going to be sollicitated
        expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0)

        # Loop over all available experts in the model and perform the computation on each expert
        for expert_idx in range(self.num_experts):
            expert_layer = self.experts[expert_idx]
            idx, top_x = torch.where(expert_mask[expert_idx])

            # Index the correct hidden states and compute the expert hidden state for
            # the current expert. We need to make sure to multiply the output hidden
            # states by `routing_weights` on the corresponding tokens (top-1 and top-2)
            current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
            current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None]

            # However `index_add_` only support torch tensors for indexing so we'll use
            # the `top_x` tensor here.
            final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))

        # 代码合并处理部分（此处与DeepSeek不一样）
        shared_expert_output = self.shared_expert(hidden_states)
        shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output

        final_hidden_states = final_hidden_states + shared_expert_output

        final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)
        return final_hidden_states, router_logits

QA：shared_expert_gate 与 router gate的区别：
Router Gate 是整个 MoE 模型的核心组件，负责为每个 token 动态选择专家并通过路由权重对专家的输出进行动态加权。
Shared Expert Gate 是为 Shared Expert 服务的，动态调节共享专家的输出，提升共享专家的适配性。

Code 4: Expert层的定义（与Mixtral相同）
https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L267

# Modified from transformers.models.mistral.modeling_mistral.MistralMLP with Mistral->Qwen2Moe
class Qwen2MoeMLP(nn.Module):
    def __init__(self, config, intermediate_size=None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.intermediate_size = intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))

Code 5：Loss部分（与Mixtral相同）
https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L1328C1-L1341C119

class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin):
    _tied_weights_keys = ["lm_head.weight"]
    _tp_plan = {"lm_head": "colwise_rep"}

    def __init__(self, config):
        super().__init__(config)
        self.model = Qwen2MoeModel(config)
        self.vocab_size = config.vocab_size
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

        self.router_aux_loss_coef = config.router_aux_loss_coef
        self.num_experts = config.num_experts
        self.num_experts_per_tok = config.num_experts_per_tok

    def forward():
        loss = None
        if labels is not None:
            loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)

        aux_loss = None
        if output_router_logits:
            aux_loss = load_balancing_loss_func(
                outputs.router_logits if return_dict else outputs[-1],
                self.num_experts,
                self.num_experts_per_tok,
                attention_mask,
            )
            if labels is not None:
                loss += self.router_aux_loss_coef * aux_loss.to(loss.device)  # make sure to reside in the same device

Code 6: Balancing Loss 实现（与Mixtral相同）
https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L65

# Copied from transformers.models.mixtral.modeling_mixtral.load_balancing_loss_func
def load_balancing_loss_func()

2 DeepSeek V3 (MoE)

2.1 DeepSeek v3模型架构

DeepSeek-V3架构

特点：

结构：使用 1 个 Shared Expert + 256 个 routed experts
结构：fine-grained expert segmentation
Loss：Auxiliary-loss-free load Balancing
Loss：Complementary sequence-wise auxiliary Loss
使用sigmoid来计算gate，Mixtral, Qwen2-MoE使用softmax
与Qwen2-MoE相比，shared expert 的输出没有经过处理，直接与其他其他项相加； Qwen2-MoE使用了 shared expert gate进行了处理

# DeepSeek-v3 config
  "n_group": 8,
  "n_routed_experts": 256,
  "n_shared_experts": 1,
  "norm_topk_prob": true,
  "num_attention_heads": 128,
  "num_experts_per_tok": 8,

DeepSeekMoE MoE处理方法

2.2 DeepSeek v3模型代码

DeepSeek v3代码的几个问题：

缺少训练代码，只有推理代码
可以参考 DeepSeek v2.5 的代码（更完整）

DeepSeek-v3：
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/modeling_deepseek.py
DeepSeek-v2.5：
https://huggingface.co/deepseek-ai/DeepSeek-V2.5-1210/blob/main/modeling_deepseek.py

下面代码结合v3 和 v2.5

Code 1: 模型主类(与v2.5类似)
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L1347

class DeepseekV3Model(DeepseekV3PreTrainedModel):
        self.layers = nn.ModuleList(
            [
                DeepseekV3DecoderLayer(config, layer_idx)
                for layer_idx in range(config.num_hidden_layers)
            ]
        )

Code 2: Decoder Layer(与v2.5类似)
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L1143

class DeepseekV3DecoderLayer(nn.Module):
    def __init__(self, config: DeepseekV3Config, layer_idx: int):
        self.mlp = (
            DeepseekV3MoE(config)
            if (
                config.n_routed_experts is not None
                and layer_idx >= config.first_k_dense_replace
                and layer_idx % config.moe_layer_freq == 0
            )
            else DeepseekV3MLP(config)
        )

    def forward(

        hidden_states = self.mlp(hidden_states)

Code 3: MoE核心处理代码

https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L476

class DeepseekV3MoE(nn.Module):
    def __init__(self, config):
        self.experts = nn.ModuleList(
            [
                DeepseekV3MLP(
                    config, intermediate_size=config.moe_intermediate_size
                )
                for i in range(config.n_routed_experts)
            ]
        )
        self.gate = MoEGate(config)

        if config.n_shared_experts is not None:
            intermediate_size = config.moe_intermediate_size * config.n_shared_experts
            self.shared_experts = DeepseekV3MLP(
                config=config, intermediate_size=intermediate_size
            )

    def forward(self, hidden_states):
        identity = hidden_states
        orig_shape = hidden_states.shape
        topk_idx, topk_weight = self.gate(hidden_states)
        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
        flat_topk_idx = topk_idx.view(-1)
        if not self.training: # 缺少training时的相关代码
            y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
        if self.config.n_shared_experts is not None:
            y = y + self.shared_experts(identity) # share_experts的输出直接与y相加 
        return y

Deepseek v2.5
https://huggingface.co/deepseek-ai/DeepSeek-V2.5-1210/blob/6f134cbe88cb9284a8ce696e8ac8eefd0bc24ede/modeling_deepseek.py#L521

class DeepseekV2MoE(nn.Module):
    """
    A mixed expert module containing shared experts.
    """

    def __init__(self, config):

        self.experts = nn.ModuleList(
            [
                DeepseekV2MLP(
                    config, intermediate_size=config.moe_intermediate_size
                )
                for i in range(config.n_routed_experts)
            ]
        )

        self.gate = MoEGate(config)
        if config.n_shared_experts is not None:
            intermediate_size = config.moe_intermediate_size * config.n_shared_experts
            self.shared_experts = DeepseekV2MLP(
                config=config, intermediate_size=intermediate_size
            )

    def forward(self, hidden_states):
        identity = hidden_states
        orig_shape = hidden_states.shape
        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
        flat_topk_idx = topk_idx.view(-1)
        if self.training:
            hidden_states = hidden_states.repeat_interleave(
                self.num_experts_per_tok, dim=0
            )
            y = torch.empty_like(hidden_states)
            for i, expert in enumerate(self.experts):
                y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])
            y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)
            y = y.to(hidden_states.dtype).view(*orig_shape)
            y = AddAuxiliaryLoss.apply(y, aux_loss)
        else:
            y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
        if self.config.n_shared_experts is not None:
            y = y + self.shared_experts(identity)
        return y

Code 4: Expert层的定义

https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L374

class DeepseekV3MLP(nn.Module):
    def __init__(self, config, hidden_size=None, intermediate_size=None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
        self.intermediate_size = (
            config.intermediate_size if intermediate_size is None else intermediate_size
        )

        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
        return down_proj

Code 5: Gate层的定义与Loss计算

由于DeepSeek v3没有提供loss的计算方法，这里我们可以参考DeepSeek v2.5的实现。
Deepseek v2.5的loss在Gate类中。

Gate layer (deepseek v2.5)

https://huggingface.co/deepseek-ai/DeepSeek-V2.5-1210/blob/6f134cbe88cb9284a8ce696e8ac8eefd0bc24ede/modeling_deepseek.py#L393

class MoEGate(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.top_k = config.num_experts_per_tok
        self.n_routed_experts = config.n_routed_experts
        self.routed_scaling_factor = config.routed_scaling_factor
        self.scoring_func = config.scoring_func
        self.alpha = config.aux_loss_alpha
        self.seq_aux = config.seq_aux
        self.topk_method = config.topk_method
        self.n_group = config.n_group
        self.topk_group = config.topk_group

        # topk selection algorithm
        self.norm_topk_prob = config.norm_topk_prob
        self.gating_dim = config.hidden_size
        self.weight = nn.Parameter(
            torch.empty((self.n_routed_experts, self.gating_dim))
        )
        self.reset_parameters()

    def reset_parameters(self) -> None:
        import torch.nn.init as init

        init.kaiming_uniform_(self.weight, a=math.sqrt(5))

    def forward(self, hidden_states):
        bsz, seq_len, h = hidden_states.shape
        ### compute gating score
        hidden_states = hidden_states.view(-1, h)
        logits = F.linear(
            hidden_states.type(torch.float32), self.weight.type(torch.float32), None
        )
        if self.scoring_func == "softmax":
            scores = logits.softmax(dim=-1, dtype=torch.float32)
        else:
            raise NotImplementedError(
                f"insupportable scoring function for MoE gating: {self.scoring_func}"
            )

        ### select top-k experts
        if self.topk_method == "greedy":
            topk_weight, topk_idx = torch.topk(
                scores, k=self.top_k, dim=-1, sorted=False
            )
        elif self.topk_method == "group_limited_greedy":
            group_scores = (
                scores.view(bsz * seq_len, self.n_group, -1).max(dim=-1).values
            )  # [n, n_group]
            group_idx = torch.topk(
                group_scores, k=self.topk_group, dim=-1, sorted=False
            )[
                1
            ]  # [n, top_k_group]
            group_mask = torch.zeros_like(group_scores)  # [n, n_group]
            group_mask.scatter_(1, group_idx, 1)  # [n, n_group]
            score_mask = (
                group_mask.unsqueeze(-1)
                .expand(
                    bsz * seq_len, self.n_group, self.n_routed_experts // self.n_group
                )
                .reshape(bsz * seq_len, -1)
            )  # [n, e]
            tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0)  # [n, e]
            topk_weight, topk_idx = torch.topk(
                tmp_scores, k=self.top_k, dim=-1, sorted=False
            )

        ### norm gate to sum 1
        if self.top_k > 1 and self.norm_topk_prob:
            denominator = topk_weight.sum(dim=-1, keepdim=True) + 1e-20
            topk_weight = topk_weight / denominator
        else:
            topk_weight = topk_weight * self.routed_scaling_factor
        ### expert-level computation auxiliary loss
        if self.training and self.alpha > 0.0:
            scores_for_aux = scores
            aux_topk = self.top_k
            # always compute aux loss based on the naive greedy topk method
            topk_idx_for_aux_loss = topk_idx.view(bsz, -1)
            if self.seq_aux:
                scores_for_seq_aux = scores_for_aux.view(bsz, seq_len, -1)
                ce = torch.zeros(
                    bsz, self.n_routed_experts, device=hidden_states.device
                )
                ce.scatter_add_(
                    1,
                    topk_idx_for_aux_loss,
                    torch.ones(bsz, seq_len * aux_topk, device=hidden_states.device),
                ).div_(seq_len * aux_topk / self.n_routed_experts)
                aux_loss = (ce * scores_for_seq_aux.mean(dim=1)).sum(
                    dim=1
                ).mean() * self.alpha
            else:
                mask_ce = F.one_hot(
                    topk_idx_for_aux_loss.view(-1), num_classes=self.n_routed_experts
                )
                ce = mask_ce.float().mean(0)
                Pi = scores_for_aux.mean(0)
                fi = ce * self.n_routed_experts
                aux_loss = (Pi * fi).sum() * self.alpha
        else:
            aux_loss = None
        return topk_idx, topk_weight, aux_loss

DeepSeek v3 的Gate类代码
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L393

3 Expert 如何初始化

3.1 Qwen2模型初始化

标准MoE情况：使用Dense模型的参数对所有Expert进行初始化。
使用 fine-grained expert segmentation 时：
1. 参数拷贝：
  - 初始参数直接来自于已有的密集（Dense）模型，通过拷贝这些参数作为细粒度专家的初始化基础。
2. 参数洗牌与选择：
  - 由于细粒度专家的网络维度与原始的Dense模型不一致，需要对原始网络的参数进行重新洗牌，在（intermediate dimension）维度上进行shuffle处理。这一步确保适应新维度的需要，并为不同专家增加了多样性。
3. 随机重初始化部分参数：
  - 在洗牌和选择后，对选中参数的50%进行随机重初始化。这一步骤旨在增加每个细粒度专家的参数多样性，从而增强模型在处理多样化输入时的鲁棒性和泛化能力。
这样的初始化方法确保了细粒度专家既能利用原始模型的知识基础，又通过增加的随机性和定制化的参数调整，更好地适应具体的任务或数据集。

内容来自QWEN2 TECHNICAL REPORT：https://arxiv.org/pdf/2407.10671

3.2 DeepSeekMoE 模型初始化

模型参数随机初始化（For initialization, all learnable
parameters are randomly initialized with a standard deviation of 0.006. ）
https://arxiv.org/pdf/2401.06066v1

4 三模型对比

Mixtral：
- 专家数量：8 experts， topK=2
- Loss：load balancing loss

Qwen2-MoE

专家数量：60 个 routed experts, 1 shared experts（相当于四个routed experts），topK=4
loss：load balancing loss
sigmoid(shared experts) + softmax(routed experts)

基本思想：64experts: 16 experts with fine-grained setting of 4 x
为了保持参数量，计算量相同，只使用60个routed experts，shared expert相当于4个expert

其中四个shared experts是在参数量上的概念，实际上只有一个moe layer，只是 intermediate-size是moe intermidiate size的四倍。

"moe_intermediate_size": 1408,
"shared_expert_intermediate_size": 5632,
"num_experts_per_tok": 4,
"num_experts": 60,

 https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat/blob/main/config.json

DeepSeek V3
- 专家数量：256 routed expert （fine-grained expert segmentation）， 1 shared expert， TopK： 8
- Loss 1： load balancing loss（Auxiliary-loss-free 改进版）
- Loss 2： Complementary sequence-wise auxiliary Loss
- shared experts + sigmoid(routed experts)

5 主要改进的效果（实验对比）

增加Shared expert，以及 fine-grained expert segmentation（结果来自DeepSeekMoE技术报告）

从上面结果可以看出：
- 增加shared expert略有改进，部分数据大幅度改进
- 通过fine-grained expert segmentation进一步增加experts数量（将一个expert拆分为2个或者4个smaller experts），性能改进更明显（参数量不变，性能大幅度改善）。
- 关于shared expert与routed experts的比例：整体影响不大，1：3 比较好；（？？？根据结果来看1：7好像更好）
  详细结果，请参考：https://arxiv.org/pdf/2401.06066
Loss: Aux-Loss-Free的影响

结果来自：https://arxiv.org/pdf/2412.19437v1

Loss: batch-wise load balancing vs sequence-wise load balancing
batch wise load balancing 显示出更好的结果，sequence-wise 方法要求更高，限制了专家的专业化能力。但是batch-wise load balancing也有一些挑战：
- （1）在某些序列或小批次中的负载不平衡，
- （2）在推理过程中由于领域变化引起的负载不平衡。