当前位置：首页 > news >正文

常用优化器的原理及工作机制详解

news 2025/10/17 7:17:25

1. 梯度下降（Gradient Descent, GD）

原理：
通过计算整个数据集的平均梯度来更新参数，公式为：
$\theta_{t+1} = \theta_t - \eta \cdot \nabla J(\theta_t)$
其中 $\eta$ 是学习率。优点是更新方向准确，但计算成本高，适合小数据集。

特点：

计算成本高（需遍历全部数据）
可能陷入局部极小值
对于凸函数保证收敛

Python代码：

def gradient_descent(x, y, learning_rate=0.01, epochs=1000):
    """
    使用梯度下降法（Gradient Descent）优化线性回归模型的参数。

    参数:
    x (numpy.ndarray): 输入特征数据，形状为 (m,)，其中 m 是样本数量。
    y (numpy.ndarray): 目标值数据，形状为 (m,)，其中 m 是样本数量。
    learning_rate (float): 学习率，控制参数更新的步长。
    epochs (int): 迭代次数，即训练轮数。

    返回:
    theta (float): 优化后的模型参数。
    """
    m = len(y)  # 样本数量
    theta = np.random.randn()  # 随机初始化参数 theta

    for epoch in range(epochs):  # 迭代训练
        y_pred = theta * x  # 使用当前参数的预测值
        # 计算预测值与真实值之间的误差，并求梯度
        gradient = -2 * np.mean(x * (y - y_pred))
        # 更新参数 theta
        theta -= learning_rate * gradient

    return theta

#使用PyTorch框架
for inputs, labels in entire_dataset:
    # 清零梯度，防止梯度累积
    optimizer.zero_grad()
    # 前向传播，获取模型的输出
    outputs = model(inputs)
    # 计算损失
    loss = criterion(outputs, labels)
    # 反向传播，计算梯度
    loss.backward()
    
    # 手动实现全批量梯度累积并更新参数
    with torch.no_grad():
        for param in model.parameters():
            if param.grad is not None:  # 检查梯度是否存在
                param -= lr * param.grad

2. 随机梯度下降（Stochastic Gradient Descent, SGD）

原理：
每次随机选择一个样本计算梯度，按照学习率乘以梯度的方向更新参数，公式为：
$\theta_{t+1} = \theta_t - \eta \nabla_\theta J(\theta_t; x^{(i)}, y^{(i)})$
特点：

引入噪声，可能跳出局部极小
更新波动大
适合在线学习

Python代码：

def sgd(x, y, learning_rate=0.01, epochs=1000):
    m = len(y)
    theta = np.random.randn()
    for _ in range(epochs):
        for i in range(m):
            y_pred = theta * x[i]
            gradient = -2 * x[i] * (y[i] - y_pred)
            theta -= learning_rate * gradient
    return theta

# DataLoader设置batch_size=1
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

3. 小批量随机梯度下降（Mini-Batch SGD, MB-SGD）

原理：
折中方案，每次用n个样本（典型n=32~512），公式为：
$\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{n}\sum_{i=1}^n \nabla_\theta J(\theta_t; x^{(i)}, y^{(i)})$

特点：

降低方差，提高计算效率
现代深度学习默认选择

Python代码：

def mini_batch_sgd(x, y, batch_size=32, learning_rate=0.01, epochs=1000):
    m = len(y)
    theta = np.random.randn()
    for _ in range(epochs):
        indices = np.random.permutation(m)
        x_shuffled, y_shuffled = x[indices]