当前位置: 首页 > news >正文

《AI大模型开发笔记》MoE模型技术对比(Mixtral,Qwen2-MoE,DeepSeek-v3)

MoE模型技术对比(Mixtral,Qwen2-MoE,DeepSeek-v3)

MoE(混合专家)大模型进入爆发期!本期我们对比三大开源MoE LLM:Mixtral、Qwen2-MoE 和最新爆火的 DeepSeek-V3。从 2023 年 Mixtral 开启风潮,到 2024 年 DeepSeek-V3 让全球瞩目,MoE 模型到底经历了怎样的进化? DeepSeek又凭什么让硅谷发慌呢? 一起来看看!

1 Qwen2.5-MoE模型

技术报告:
Qwen2:https://arxiv.org/pdf/2407.10671
Qwen2.5:https://arxiv.org/pdf/2412.15115

1.1 Qwen系列模型


图片来自:Qwen2.5 Technical Report

根据Qwen2.5的技术报告,阿里云Qwen2.5系列模型中,其中dense模型属于开源模型,但是MoE模型,也就是 Qwen2.5-Turbo 和 Qwen2.5-Plus 属于在阿里云云端托管的模型,这两个模型的表现是可以比肩 gpt-4o-mini 和 gpt-4o 的。

Qwen2和Qwen1.5系列模型中,都有开源的MoE模型,比如:
https://huggingface.co/Qwen/Qwen2-57B-A14B
https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B

所以,我们可以通过Qwen2-MoE的代码以及Qwen2.5的技术报告来了解下Qwen2.5 MoE。

1.2 Qwen MoE模型架构及技术


图片来自:Qwen2.5 Technical Report

  • Qwen2.5-MoE特点:

    • 使用MoE层替换dense模型中的FFN层
    • 每个MoE层包含shared experts 和 routed experts(Routing机制负责将tokens分配给tok-k个experts)
    • 采用了 fine-grained expert segmentation(细粒度专家分割)技术
  • Fine-grained expert segmentation(细粒度专家分割)
    这个方法参考自DeepSeekMoE模型,其目的是保持参数量,计算量不变的同时,提高MoE模型的表达能力,从而实现更准确和有针对性的知识获取的潜力。
    实现方法如下:

    1. 将FFN中间隐藏层的维度减少到原来的1/m,将每个专家FFN分割成m个更小的专家,所以专家数量变为了mN。
    2. 非零门控的数量增加到 mK 个(从top-K变为了top-mK)。
    3. 模型初始化(重新洗牌选择参数,并对50%重新初始化)

    其特点如下:

    1. 参数量,计算量不变,表达能力大幅度增加
    2. 组合灵活性大幅度增大:假设N=16,K=2情况下,组合数量从 120 增大到 4,426,165,368
  • Shared experts routing(共享专家路由):
    在路由专家之外增加一个或者多个共享专家,用于共同知识或信息的处理。
    其特点如下:

    1. 共享专家通常包含1个或者多个
    2. 专门用于捕获和整合不同上下文中的共同知识,其他路由专家之间的参数冗余将得到缓解
    3. 共享专家隔离的原型由微软Rajbhandari等人提出(2022年)(DeepSeed-MoE:https://arxiv.org/pdf/2201.05596)


    图片来自:DeepSeekMoE论文

1.3 Qwen2-MoE代码

从Qwen2-MoE的代码来看,其实现参考了Mixtral的相关代码,在改进方法上参考了DeepSeekMoE。
在Mixtral基础上增加了

  • 增加了一个 shared expert (相当于4个routed expert)
  • 采用了基于fine-grained experts segmentation的思路

主要代码如下:

  • Code 1: 模型主类
    https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L891
Qwen2MoeModel
    def __init__(self, config: Qwen2MoeConfig):
        self.layers = nn.ModuleList(
            [Qwen2MoeDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
    def forward():
        if output_router_logits and layer_outputs[-1] is not None:
            all_router_logits += (layer_outputs[-1],)
  • Code 2: Decoder Layer
    https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L668
class Qwen2MoeDecoderLayer(nn.Module):
    def __init__(self, config: Qwen2MoeConfig, layer_idx: int):
        super().__init__()
        self.hidden_size = config.hidden_size

        self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)

        if (layer_idx not in config.mlp_only_layers) and (
            config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0
        ):
            self.mlp = Qwen2MoeSparseMoeBlock(config)
        else:
            self.mlp = Qwen2MoeMLP(config, intermediate_size=config.intermediate_size)

    def forward():
        hidden_states = self.input_layernorm(hidden_states)
        # Self Attention
        hidden_states, self_attn_weights, present_key_value = self.self_attn()

        # Fully Connected
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        if isinstance(hidden_states, tuple):
            hidden_states, router_logits = hidden_states
        else:
            router_logits = None
        outputs = (hidden_states,)
        if output_router_logits:
            outputs += (router_logits,)
  • Code 3: Qwen2MoeSparseMoeBlock(MoE核心处理代码)
    https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L606
class Qwen2MoeSparseMoeBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.num_experts = config.num_experts
        self.top_k = config.num_experts_per_tok
        self.norm_topk_prob = config.norm_topk_prob

        # gating
        self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False)
        self.experts = nn.ModuleList(
            [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)]
        )

        # 与Mixtral相比,Qwen2-MoE多了 shared_expert 和 shared_expert_gate
        self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size)
        self.shared_expert_gate = torch.nn.Linear(config.hidden_size, 1, bias=False)

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        """ """
        batch_size, sequence_length, hidden_dim = hidden_states.shape
        hidden_states = hidden_states.view(-1, hidden_dim)
        # router_logits: (batch * sequence_length, n_experts)
        router_logits = self.gate(hidden_states)

        routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
        routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
        if self.norm_topk_prob:
            routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
        # we cast back to the input dtype
        routing_weights = routing_weights.to(hidden_states.dtype)

        final_hidden_states = torch.zeros(
            (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device
        )

        # One hot encode the selected experts to create an expert mask
        # this will be used to easily index which expert is going to be sollicitated
        expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0)

        # Loop over all available experts in the model and perform the computation on each expert
        for expert_idx in range(self.num_experts):
            expert_layer = self.experts[expert_idx]
            idx, top_x = torch.where(expert_mask[expert_idx])

            # Index the correct hidden states and compute the expert hidden state for
            # the current expert. We need to make sure to multiply the output hidden
            # states by `routing_weights` on the corresponding tokens (top-1 and top-2)
            current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
            current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None]

            # However `index_add_` only support torch tensors for indexing so we'll use
            # the `top_x` tensor here.
            final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))

        # 代码合并处理部分(此处与DeepSeek不一样)
        shared_expert_output = self.shared_expert(hidden_states)
        shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output

        final_hidden_states = final_hidden_states + shared_expert_output

        final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)
        return final_hidden_states, router_logits

QA:shared_expert_gate 与 router gate的区别:
Router Gate 是整个 MoE 模型的核心组件,负责为每个 token 动态选择专家并通过路由权重对专家的输出进行动态加权。
Shared Expert Gate 是为 Shared Expert 服务的,动态调节共享专家的输出,提升共享专家的适配性。

  • Code 4: Expert层的定义(与Mixtral相同)
    https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L267
# Modified from transformers.models.mistral.modeling_mistral.MistralMLP with Mistral->Qwen2Moe
class Qwen2MoeMLP(nn.Module):
    def __init__(self, config, intermediate_size=None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.intermediate_size = intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  • Code 5:Loss部分(与Mixtral相同)
    https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L1328C1-L1341C119
class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin):
    _tied_weights_keys = ["lm_head.weight"]
    _tp_plan = {"lm_head": "colwise_rep"}

    def __init__(self, config):
        super().__init__(config)
        self.model = Qwen2MoeModel(config)
        self.vocab_size = config.vocab_size
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

        self.router_aux_loss_coef = config.router_aux_loss_coef
        self.num_experts = config.num_experts
        self.num_experts_per_tok = config.num_experts_per_tok

    def forward():
        loss = None
        if labels is not None:
            loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)

        aux_loss = None
        if output_router_logits:
            aux_loss = load_balancing_loss_func(
                outputs.router_logits if return_dict else outputs[-1],
                self.num_experts,
                self.num_experts_per_tok,
                attention_mask,
            )
            if labels is not None:
                loss += self.router_aux_loss_coef * aux_loss.to(loss.device)  # make sure to reside in the same device
  • Code 6: Balancing Loss 实现(与Mixtral相同)
    https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L65
# Copied from transformers.models.mixtral.modeling_mixtral.load_balancing_loss_func
def load_balancing_loss_func()

2 DeepSeek V3 (MoE)

2.1 DeepSeek v3模型架构


DeepSeek-V3架构

特点:

  • 结构:使用 1 个 Shared Expert + 256 个 routed experts
  • 结构:fine-grained expert segmentation
  • Loss:Auxiliary-loss-free load Balancing
  • Loss:Complementary sequence-wise auxiliary Loss
  • 使用sigmoid来计算gate,Mixtral, Qwen2-MoE使用softmax
  • 与Qwen2-MoE相比,shared expert 的输出没有经过处理,直接与其他其他项相加; Qwen2-MoE使用了 shared expert gate进行了处理
# DeepSeek-v3 config
  "n_group": 8,
  "n_routed_experts": 256,
  "n_shared_experts": 1,
  "norm_topk_prob": true,
  "num_attention_heads": 128,
  "num_experts_per_tok": 8,


DeepSeekMoE MoE处理方法

2.2 DeepSeek v3模型代码

DeepSeek v3代码的几个问题:

  • 缺少训练代码,只有推理代码
  • 可以参考 DeepSeek v2.5 的代码(更完整)

DeepSeek-v3:
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/modeling_deepseek.py
DeepSeek-v2.5:
https://huggingface.co/deepseek-ai/DeepSeek-V2.5-1210/blob/main/modeling_deepseek.py

下面代码结合v3 和 v2.5

  • Code 1: 模型主类(与v2.5类似)
    https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L1347
class DeepseekV3Model(DeepseekV3PreTrainedModel):
        self.layers = nn.ModuleList(
            [
                DeepseekV3DecoderLayer(config, layer_idx)
                for layer_idx in range(config.num_hidden_layers)
            ]
        )

  • Code 2: Decoder Layer(与v2.5类似)
    https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L1143
class DeepseekV3DecoderLayer(nn.Module):
    def __init__(self, config: DeepseekV3Config, layer_idx: int):
        self.mlp = (
            DeepseekV3MoE(config)
            if (
                config.n_routed_experts is not None
                and layer_idx >= config.first_k_dense_replace
                and layer_idx % config.moe_layer_freq == 0
            )
            else DeepseekV3MLP(config)
        )

    def forward(

        hidden_states = self.mlp(hidden_states)

  • Code 3: MoE核心处理代码

https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L476

class DeepseekV3MoE(nn.Module):
    def __init__(self, config):
        self.experts = nn.ModuleList(
            [
                DeepseekV3MLP(
                    config, intermediate_size=config.moe_intermediate_size
                )
                for i in range(config.n_routed_experts)
            ]
        )
        self.gate = MoEGate(config)

        if config.n_shared_experts is not None:
            intermediate_size = config.moe_intermediate_size * config.n_shared_experts
            self.shared_experts = DeepseekV3MLP(
                config=config, intermediate_size=intermediate_size
            )

    def forward(self, hidden_states):
        identity = hidden_states
        orig_shape = hidden_states.shape
        topk_idx, topk_weight = self.gate(hidden_states)
        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
        flat_topk_idx = topk_idx.view(-1)
        if not self.training: # 缺少training时的相关代码
            y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
        if self.config.n_shared_experts is not None:
            y = y + self.shared_experts(identity) # share_experts的输出直接与y相加 
        return y

Deepseek v2.5
https://huggingface.co/deepseek-ai/DeepSeek-V2.5-1210/blob/6f134cbe88cb9284a8ce696e8ac8eefd0bc24ede/modeling_deepseek.py#L521

class DeepseekV2MoE(nn.Module):
    """
    A mixed expert module containing shared experts.
    """

    def __init__(self, config):

        self.experts = nn.ModuleList(
            [
                DeepseekV2MLP(
                    config, intermediate_size=config.moe_intermediate_size
                )
                for i in range(config.n_routed_experts)
            ]
        )

        self.gate = MoEGate(config)
        if config.n_shared_experts is not None:
            intermediate_size = config.moe_intermediate_size * config.n_shared_experts
            self.shared_experts = DeepseekV2MLP(
                config=config, intermediate_size=intermediate_size
            )

    def forward(self, hidden_states):
        identity = hidden_states
        orig_shape = hidden_states.shape
        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
        flat_topk_idx = topk_idx.view(-1)
        if self.training:
            hidden_states = hidden_states.repeat_interleave(
                self.num_experts_per_tok, dim=0
            )
            y = torch.empty_like(hidden_states)
            for i, expert in enumerate(self.experts):
                y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])
            y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)
            y = y.to(hidden_states.dtype).view(*orig_shape)
            y = AddAuxiliaryLoss.apply(y, aux_loss)
        else:
            y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
        if self.config.n_shared_experts is not None:
            y = y + self.shared_experts(identity)
        return y
  • Code 4: Expert层的定义

https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L374

class DeepseekV3MLP(nn.Module):
    def __init__(self, config, hidden_size=None, intermediate_size=None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
        self.intermediate_size = (
            config.intermediate_size if intermediate_size is None else intermediate_size
        )

        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
        return down_proj
  • Code 5: Gate层的定义与Loss计算

由于DeepSeek v3没有提供loss的计算方法,这里我们可以参考DeepSeek v2.5的实现。
Deepseek v2.5的loss在Gate类中。

Gate layer (deepseek v2.5)

https://huggingface.co/deepseek-ai/DeepSeek-V2.5-1210/blob/6f134cbe88cb9284a8ce696e8ac8eefd0bc24ede/modeling_deepseek.py#L393

class MoEGate(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.top_k = config.num_experts_per_tok
        self.n_routed_experts = config.n_routed_experts
        self.routed_scaling_factor = config.routed_scaling_factor
        self.scoring_func = config.scoring_func
        self.alpha = config.aux_loss_alpha
        self.seq_aux = config.seq_aux
        self.topk_method = config.topk_method
        self.n_group = config.n_group
        self.topk_group = config.topk_group

        # topk selection algorithm
        self.norm_topk_prob = config.norm_topk_prob
        self.gating_dim = config.hidden_size
        self.weight = nn.Parameter(
            torch.empty((self.n_routed_experts, self.gating_dim))
        )
        self.reset_parameters()

    def reset_parameters(self) -> None:
        import torch.nn.init as init

        init.kaiming_uniform_(self.weight, a=math.sqrt(5))

    def forward(self, hidden_states):
        bsz, seq_len, h = hidden_states.shape
        ### compute gating score
        hidden_states = hidden_states.view(-1, h)
        logits = F.linear(
            hidden_states.type(torch.float32), self.weight.type(torch.float32), None
        )
        if self.scoring_func == "softmax":
            scores = logits.softmax(dim=-1, dtype=torch.float32)
        else:
            raise NotImplementedError(
                f"insupportable scoring function for MoE gating: {self.scoring_func}"
            )

        ### select top-k experts
        if self.topk_method == "greedy":
            topk_weight, topk_idx = torch.topk(
                scores, k=self.top_k, dim=-1, sorted=False
            )
        elif self.topk_method == "group_limited_greedy":
            group_scores = (
                scores.view(bsz * seq_len, self.n_group, -1).max(dim=-1).values
            )  # [n, n_group]
            group_idx = torch.topk(
                group_scores, k=self.topk_group, dim=-1, sorted=False
            )[
                1
            ]  # [n, top_k_group]
            group_mask = torch.zeros_like(group_scores)  # [n, n_group]
            group_mask.scatter_(1, group_idx, 1)  # [n, n_group]
            score_mask = (
                group_mask.unsqueeze(-1)
                .expand(
                    bsz * seq_len, self.n_group, self.n_routed_experts // self.n_group
                )
                .reshape(bsz * seq_len, -1)
            )  # [n, e]
            tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0)  # [n, e]
            topk_weight, topk_idx = torch.topk(
                tmp_scores, k=self.top_k, dim=-1, sorted=False
            )

        ### norm gate to sum 1
        if self.top_k > 1 and self.norm_topk_prob:
            denominator = topk_weight.sum(dim=-1, keepdim=True) + 1e-20
            topk_weight = topk_weight / denominator
        else:
            topk_weight = topk_weight * self.routed_scaling_factor
        ### expert-level computation auxiliary loss
        if self.training and self.alpha > 0.0:
            scores_for_aux = scores
            aux_topk = self.top_k
            # always compute aux loss based on the naive greedy topk method
            topk_idx_for_aux_loss = topk_idx.view(bsz, -1)
            if self.seq_aux:
                scores_for_seq_aux = scores_for_aux.view(bsz, seq_len, -1)
                ce = torch.zeros(
                    bsz, self.n_routed_experts, device=hidden_states.device
                )
                ce.scatter_add_(
                    1,
                    topk_idx_for_aux_loss,
                    torch.ones(bsz, seq_len * aux_topk, device=hidden_states.device),
                ).div_(seq_len * aux_topk / self.n_routed_experts)
                aux_loss = (ce * scores_for_seq_aux.mean(dim=1)).sum(
                    dim=1
                ).mean() * self.alpha
            else:
                mask_ce = F.one_hot(
                    topk_idx_for_aux_loss.view(-1), num_classes=self.n_routed_experts
                )
                ce = mask_ce.float().mean(0)
                Pi = scores_for_aux.mean(0)
                fi = ce * self.n_routed_experts
                aux_loss = (Pi * fi).sum() * self.alpha
        else:
            aux_loss = None
        return topk_idx, topk_weight, aux_loss

DeepSeek v3 的Gate类代码
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/4c1f24cc10a2a1894304c7ab52edd9710c047571/modeling_deepseek.py#L393

3 Expert 如何初始化

3.1 Qwen2模型初始化
  • 标准MoE情况:使用Dense模型的参数对所有Expert进行初始化。

  • 使用 fine-grained expert segmentation 时

    1. 参数拷贝

      • 初始参数直接来自于已有的密集(Dense)模型,通过拷贝这些参数作为细粒度专家的初始化基础。
    2. 参数洗牌与选择

      • 由于细粒度专家的网络维度与原始的Dense模型不一致,需要对原始网络的参数进行重新洗牌,在(intermediate dimension)维度上进行shuffle处理。这一步确保适应新维度的需要,并为不同专家增加了多样性。
    3. 随机重初始化部分参数

      • 在洗牌和选择后,对选中参数的50%进行随机重初始化。这一步骤旨在增加每个细粒度专家的参数多样性,从而增强模型在处理多样化输入时的鲁棒性和泛化能力。

    这样的初始化方法确保了细粒度专家既能利用原始模型的知识基础,又通过增加的随机性和定制化的参数调整,更好地适应具体的任务或数据集。


内容来自QWEN2 TECHNICAL REPORT:https://arxiv.org/pdf/2407.10671

3.2 DeepSeekMoE 模型初始化

模型参数随机初始化(For initialization, all learnable
parameters are randomly initialized with a standard deviation of 0.006. )
https://arxiv.org/pdf/2401.06066v1

4 三模型对比

  • Mixtral:

    • 专家数量:8 experts, topK=2
    • Loss:load balancing loss
  • Qwen2-MoE

    • 专家数量:60 个 routed experts, 1 shared experts(相当于四个routed experts),topK=4
    • loss:load balancing loss
    • sigmoid(shared experts) + softmax(routed experts)
    基本思想:64experts: 16 experts with fine-grained setting of 4 x
    为了保持参数量,计算量相同,只使用60个routed experts,shared expert相当于4个expert
    
    其中四个shared experts是在参数量上的概念,实际上只有一个moe layer,只是 intermediate-size是moe intermidiate size的四倍。
    
    "moe_intermediate_size": 1408,
    "shared_expert_intermediate_size": 5632,
    "num_experts_per_tok": 4,
    "num_experts": 60,
    
     https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat/blob/main/config.json
    
  • DeepSeek V3

    • 专家数量:256 routed expert (fine-grained expert segmentation), 1 shared expert, TopK: 8
    • Loss 1: load balancing loss(Auxiliary-loss-free 改进版)
    • Loss 2: Complementary sequence-wise auxiliary Loss
    • shared experts + sigmoid(routed experts)

5 主要改进的效果(实验对比)

  • 增加Shared expert, 以及 fine-grained expert segmentation(结果来自DeepSeekMoE技术报告)

    从上面结果可以看出:

    • 增加shared expert略有改进,部分数据大幅度改进
    • 通过fine-grained expert segmentation进一步增加experts数量(将一个expert拆分为2个或者4个smaller experts),性能改进更明显(参数量不变,性能大幅度改善)。
    • 关于shared expert与routed experts的比例:整体影响不大,1:3 比较好;(???根据结果来看1:7好像更好)
      详细结果,请参考:https://arxiv.org/pdf/2401.06066
  • Loss: Aux-Loss-Free的影响


结果来自:https://arxiv.org/pdf/2412.19437v1

  • Loss: batch-wise load balancing vs sequence-wise load balancing
    batch wise load balancing 显示出更好的结果,sequence-wise 方法要求更高,限制了专家的专业化能力。但是batch-wise load balancing也有一些挑战:

    • (1)在某些序列或小批次中的负载不平衡,
    • (2)在推理过程中由于领域变化引起的负载不平衡。

相关文章:

  • 刷机维修进阶教程-----MTK芯片机型修改脚本 永久去除系统更新 无视bl锁 无需root权限
  • tcp/ip协议设置参数,tcp/ip协议6设置
  • Linux文件管理:硬链接与软链接
  • 架构——Nginx功能、职责、原理、配置示例、应用场景
  • AF3 from_pdb_string和from_mmcif_string函数解读
  • 麻将对对碰游戏:规则与模拟实现
  • 【Unity3D】Jenkins Pipeline流水线自动构建Apk
  • 深入剖析 Vue 的响应式原理:构建高效 Web 应用的基石
  • HTML4
  • 记使用AScript自动化操作ios苹果手机
  • C语言中的常量与只读变量,#define与const的区别
  • CUDA-计算内存事务的次数
  • Xmind 2024安装教程超详细(小白零基础入门)图文教程【附安装包】
  • ffmpeg学习:ubuntu下编译Android版ffmpeg-kit
  • Windows搭建CUDA大模型Docker环境
  • AMESim中批处理功能的应用
  • Java类与类的关系
  • 【Hadoop】大数据权限管理工具Ranger2.1.0编译
  • 【人工智能】释放数据潜能:使用Featuretools进行自动化特征工程
  • Android Studio - Android Studio 查看项目的 Android SDK 版本(4 种方式)
  • 综艺还有怎样的新可能?挖掘小众文化领域
  • 重庆荣昌出圈背后:把网络流量变成经济发展的增量
  • 习近平会见古共中央第一书记、古巴国家主席迪亚斯-卡内尔
  • 中国词学研究会原会长、华东师大教授马兴荣逝世,享年101岁
  • 保利发展前4个月销售额约876亿元,单月斥资128亿元获4个项目
  • 解读|降准叠加政策利率、公积金贷款利率、结构性政策工具利率全线下调,影响有多大?