当前位置：首页 > news >正文

【深度学习-pytorch篇】2. Activation, 多层感知机与LLaMA中的MLP实现解析

news 来源：原创 2025/5/30 22:22:42

多层感知机与LLaMA中的MLP解析

1. 多层感知机（MLP）基础实现

多层感知机（Multilayer Perceptron, MLP）是前馈神经网络的基础形式，由多个全连接层组成。每一层都由权重矩阵、偏置向量和非线性激活函数构成。

1.1 激活函数模块

激活函数用于引入非线性，常用的包括 tanh（双曲正切）与 sigmoid（logistic 函数）。它们的导数用于反向传播时的梯度计算。

class Activation:def __tanh(self, x):return np.tanh(x)def __tanh_deriv(self, a):return 1.0 - a**2def __logistic(self, x):return 1.0 / (1.0 + np.exp(-x))def __logistic_deriv(self, a):return a * (1 - a)def __init__(self, activation='tanh'):if activation == 'logistic':self.f = self.__logisticself.f_deriv = self.__logistic_derivelif activation == 'tanh':self.f = self.__tanhself.f_deriv = self.__tanh_deriv

1.2 隐藏层类 HiddenLayer

定义 MLP 中的隐藏层，包括初始化权重与偏置、前向传播与反向传播方法。

class HiddenLayer:def __init__(self, n_in, n_out, activation_last_layer='tanh', activation='tanh'):self.input = Noneself.activation = Activation(activation).fself.activation_deriv = Activation(activation_last_layer).f_deriv if activation_last_layer else Nonelimit = np.sqrt(6. / (n_in + n_out))self.W = np.random.uniform(-limit, limit, (n_in, n_out))self.b = np.zeros(n_out)self.grad_W = np.zeros_like(self.W)self.grad_b = np.zeros_like(self.b)def forward(self, input):lin_output = np.dot(input, self.W) + self.bself.output = lin_output if self.activation is None else self.activation(lin_output)self.input = inputreturn self.outputdef backward(self, delta, output_layer=False):self.grad_W = np.atleast_2d(self.input).T @ np.atleast_2d(delta)self.grad_b = deltaif self.activation_deriv:delta = delta @ self.W.T * self.activation_deriv(self.input)return delta

1.3 多层感知机类 MLP

完整 MLP 网络由多个隐藏层堆叠而成，包含前向传播、损失函数、反向传播、参数更新与预测方法。

class MLP:def __init__(self, layers, activation=[None, 'tanh', 'tanh']):self.layers = []self.activation = activationfor i in range(len(layers) - 1):self.layers.append(HiddenLayer(layers[i], layers[i + 1], activation[i], activation[i + 1]))def forward(self, input):for layer in self.layers:input = layer.forward(input)return inputdef criterion_MSE(self, y, y_hat):activation_deriv = Activation(self.activation[-1]).f_deriverror = y - y_hatloss = error ** 2delta = -2 * error * activation_deriv(y_hat)return loss, deltadef backward(self, delta):delta = self.layers[-1].backward(delta, output_layer=True)for layer in reversed(self.layers[:-1]):delta = layer.backward(delta)def update(self, lr):for layer in self.layers:layer.W -= lr * layer.grad_Wlayer.b -= lr * layer.grad_bdef fit(self, X, y, learning_rate=0.1, epochs=100):X = np.array(X)y = np.array(y)to_return = np.zeros(epochs)for k in range(epochs):loss = np.zeros(X.shape[0])for it in range(X.shape[0]):i = np.random.randint(X.shape[0])y_hat = self.forward(X[i])loss[it], delta = self.criterion_MSE(y[i], y_hat)self.backward(delta)self.update(learning_rate)to_return[k] = np.mean(loss)return to_returndef predict(self, x):x = np.array(x)output = np.zeros(x.shape[0])for i in range(x.shape[0]):output[i] = self.forward(x[i, :])return output

2. LLaMA 中的 MLP 模块解析

LLaMA 使用的 MLP 结构更为复杂，引入了门控机制和并行切分计算，用于提升推理与训练效率。

class LlamaMLP(nn.Module):def __init__(self, config):super().__init__()self.config = configself.hidden_size = config.hidden_sizeself.intermediate_size = config.intermediate_sizeself.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)self.act_fn = ACT2FN[config.hidden_act]def forward(self, x):if self.config.pretraining_tp > 1:slice = self.intermediate_size // self.config.pretraining_tpgate_proj_slices = self.gate_proj.weight.split(slice, dim=0)up_proj_slices = self.up_proj.weight.split(slice, dim=0)down_proj_slices = self.down_proj.weight.split(slice, dim=1)gate_proj = torch.cat([F.linear(x, w) for w in gate_proj_slices], dim=-1)up_proj = torch.cat([F.linear(x, w) for w in up_proj_slices], dim=-1)intermediate = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)down_proj = sum(F.linear(i, w) for i, w in zip(intermediate, down_proj_slices))else:down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))return down_proj