【深度学习-pytorch篇】2. Activation, 多层感知机与LLaMA中的MLP实现解析
多层感知机与LLaMA中的MLP解析
1. 多层感知机(MLP)基础实现
多层感知机(Multilayer Perceptron, MLP)是前馈神经网络的基础形式,由多个全连接层组成。每一层都由权重矩阵、偏置向量和非线性激活函数构成。
1.1 激活函数模块
激活函数用于引入非线性,常用的包括 tanh
(双曲正切)与 sigmoid
(logistic 函数)。它们的导数用于反向传播时的梯度计算。
class Activation:def __tanh(self, x):return np.tanh(x)def __tanh_deriv(self, a):return 1.0 - a**2def __logistic(self, x):return 1.0 / (1.0 + np.exp(-x))def __logistic_deriv(self, a):return a * (1 - a)def __init__(self, activation='tanh'):if activation == 'logistic':self.f = self.__logisticself.f_deriv = self.__logistic_derivelif activation == 'tanh':self.f = self.__tanhself.f_deriv = self.__tanh_deriv
1.2 隐藏层类 HiddenLayer
定义 MLP 中的隐藏层,包括初始化权重与偏置、前向传播与反向传播方法。
class HiddenLayer:def __init__(self, n_in, n_out, activation_last_layer='tanh', activation='tanh'):self.input = Noneself.activation = Activation(activation).fself.activation_deriv = Activation(activation_last_layer).f_deriv if activation_last_layer else Nonelimit = np.sqrt(6. / (n_in + n_out))self.W = np.random.uniform(-limit, limit, (n_in, n_out))self.b = np.zeros(n_out)self.grad_W = np.zeros_like(self.W)self.grad_b = np.zeros_like(self.b)def forward(self, input):lin_output = np.dot(input, self.W) + self.bself.output = lin_output if self.activation is None else self.activation(lin_output)self.input = inputreturn self.outputdef backward(self, delta, output_layer=False):self.grad_W = np.atleast_2d(self.input).T @ np.atleast_2d(delta)self.grad_b = deltaif self.activation_deriv:delta = delta @ self.W.T * self.activation_deriv(self.input)return delta
1.3 多层感知机类 MLP
完整 MLP 网络由多个隐藏层堆叠而成,包含前向传播、损失函数、反向传播、参数更新与预测方法。
class MLP:def __init__(self, layers, activation=[None, 'tanh', 'tanh']):self.layers = []self.activation = activationfor i in range(len(layers) - 1):self.layers.append(HiddenLayer(layers[i], layers[i + 1], activation[i], activation[i + 1]))def forward(self, input):for layer in self.layers:input = layer.forward(input)return inputdef criterion_MSE(self, y, y_hat):activation_deriv = Activation(self.activation[-1]).f_deriverror = y - y_hatloss = error ** 2delta = -2 * error * activation_deriv(y_hat)return loss, deltadef backward(self, delta):delta = self.layers[-1].backward(delta, output_layer=True)for layer in reversed(self.layers[:-1]):delta = layer.backward(delta)def update(self, lr):for layer in self.layers:layer.W -= lr * layer.grad_Wlayer.b -= lr * layer.grad_bdef fit(self, X, y, learning_rate=0.1, epochs=100):X = np.array(X)y = np.array(y)to_return = np.zeros(epochs)for k in range(epochs):loss = np.zeros(X.shape[0])for it in range(X.shape[0]):i = np.random.randint(X.shape[0])y_hat = self.forward(X[i])loss[it], delta = self.criterion_MSE(y[i], y_hat)self.backward(delta)self.update(learning_rate)to_return[k] = np.mean(loss)return to_returndef predict(self, x):x = np.array(x)output = np.zeros(x.shape[0])for i in range(x.shape[0]):output[i] = self.forward(x[i, :])return output
2. LLaMA 中的 MLP 模块解析
LLaMA 使用的 MLP 结构更为复杂,引入了门控机制和并行切分计算,用于提升推理与训练效率。
class LlamaMLP(nn.Module):def __init__(self, config):super().__init__()self.config = configself.hidden_size = config.hidden_sizeself.intermediate_size = config.intermediate_sizeself.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)self.act_fn = ACT2FN[config.hidden_act]def forward(self, x):if self.config.pretraining_tp > 1:slice = self.intermediate_size // self.config.pretraining_tpgate_proj_slices = self.gate_proj.weight.split(slice, dim=0)up_proj_slices = self.up_proj.weight.split(slice, dim=0)down_proj_slices = self.down_proj.weight.split(slice, dim=1)gate_proj = torch.cat([F.linear(x, w) for w in gate_proj_slices], dim=-1)up_proj = torch.cat([F.linear(x, w) for w in up_proj_slices], dim=-1)intermediate = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)down_proj = sum(F.linear(i, w) for i, w in zip(intermediate, down_proj_slices))else:down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))return down_proj
3. 反向传播公式推导
反向传播(Backpropagation)是训练神经网络的核心方法。它通过链式法则(链式求导)将损失函数对参数的梯度高效传播到每一层,从而实现梯度下降。
3.1 损失函数对输出层的梯度
以均方误差(MSE)为例:
给定目标输出 t t t 和预测输出 z z z,损失函数为:
L = 1 2 ( z − t ) 2 L = \frac{1}{2} (z - t)^2 L=21(z−t)2
对 z z z 求导:
∂ L ∂ z = z − t \frac{\partial L}{\partial z} = z - t ∂z∂L=z−t
考虑激活函数 a = f ( z i n ) a = f(z_{in}) a=f(zin),例如 a = tanh ( z i n ) a = \tanh(z_{in}) a=tanh(zin) 或 σ ( z i n ) \sigma(z_{in}) σ(zin),则:
δ = ∂ L ∂ z i n = ( z − t ) ⋅ f ′ ( z i n ) \delta = \frac{\partial L}{\partial z_{in}} = (z - t) \cdot f'(z_{in}) δ=∂zin∂L=(z−t)⋅f′(zin)
其中:
- f ′ ( z i n ) f'(z_{in}) f′(zin) 为激活函数的导数
- ∇ W = x T δ \nabla_W = x^T \delta ∇W=xTδ 是该层权重的梯度
- ∇ b = δ \nabla_b = \delta ∇b=δ 是偏置项的梯度
3.2 隐藏层的梯度传播
对于隐藏层的输出 h = f ( z ) h = f(z) h=f(z),我们有:
δ ( l ) = ( δ ( l + 1 ) ⋅ W ( l + 1 ) ) ⊙ f ′ ( z ( l ) ) \delta^{(l)} = \left( \delta^{(l+1)} \cdot W^{(l+1)} \right) \odot f'(z^{(l)}) δ(l)=(δ(l+1)⋅W(l+1))⊙f′(z(l))
其中:
- δ ( l + 1 ) \delta^{(l+1)} δ(l+1) 是下一层的误差信号
- W ( l + 1 ) W^{(l+1)} W(l+1) 是连接本层与下一层的权重矩阵
- ⊙ \odot ⊙ 表示元素乘(Hadamard 积)
- f ′ ( z ( l ) ) f'(z^{(l)}) f′(z(l)) 是当前层的激活函数导数
3.3 梯度更新规则
基于梯度下降法,更新权重与偏置的公式为:
W : = W − η ⋅ ∇ W W := W - \eta \cdot \nabla_W W:=W−η⋅∇W
b : = b − η ⋅ ∇ b b := b - \eta \cdot \nabla_b b:=b−η⋅∇b
其中:
- η \eta η 为学习率(learning rate)
- ∇ W = x T δ \nabla_W = x^T \delta ∇W=xTδ 是输入 x x x 与误差 δ \delta δ 的外积
- ∇ b = δ \nabla_b = \delta ∇b=δ
上述推导是 MLP 的数学基础。每一层都在计算其误差(delta),将其用于反向传播和权重更新。