当前位置：首页 > news >正文

逻辑损失以及梯度下降的实现

news 2025/8/4 4:16:10

Logical loss 逻辑损失

Square Error Cost Function

Regression uses the squared error cost function:
The equation for the squared error cost with one variable is:
$\frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1}$
where
$f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{2}$
The squared error cost has the property that following the derivative leads to the minimum. It works well for linear regression.

平方误差成本函数的特性是，沿着其导数的方向可找到最小值。它在线性回归中表现良好。

Logistic Regression的挑战与平方误差成本函数的局限性

然而，若将平方误差成本函数直接应用于逻辑回归（包含Sigmoid函数）：
$f_{w,b}(x^{(i)}) = \text{sigmoid}(wx^{(i)} + b)$
会导致成本函数曲面不光滑（如存在平台区、局部极小值或间断点），这使得梯度下降难以收敛。因此，需要更适合逻辑回归非线性特性的成本函数。

逻辑回归的损失函数与成本函数

逻辑回归使用针对分类任务设计的 损失函数（目标值为0或1）。

Loss 是单个样本的误差，Cost 是训练集上所有样本损失的平均值。

损失函数定义如下：
$$

\text{loss}\left(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}\right) =
\begin{cases}

\log\left(f_{\mathbf{w},b}(\mathbf{x}^{(i)})\right) & \text{若 } y^{(i)} = 1, \
\log\left(1 - f_{\mathbf{w},b}(\mathbf{x}^{(i)})\right) & \text{若 } y^{(i)} = 0.
\end{cases}
$$

其中 $ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = g(\mathbf{w} \cdot \mathbf{x}^{(i)} + b) $，( g ) 是 Sigmoid 函数：
$\frac{1}{1 + e^{-z}}.$
合并后的简化形式为：
$\text{loss}\left(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}\right) = -y^{(i)} \log\left(f_{\mathbf{w},b}(\mathbf{x}^{(i)})\right) - (1 - y^{(i)}) \log\left(1 - f_{\mathbf{w},b}(\mathbf{x}^{(i)})\right). \tag{3}$
损失函数的特性

预测与目标一致时损失为0：
- 当$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = y^{(i)} $，损失为0。
偏离目标时损失迅速增大：
- 若预测值与目标值差异大（如 $ y=1 $ 但 $ f \approx 0 $），损失会趋近于$ +\infty $。

成本函数是训练集上所有样本损失的平均值：
$J(\mathbf{w}, b) = \frac{1}{m} \sum_{i=1}^m \text{loss}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}). \tag{4}$

特性	平方误差成本函数	逻辑损失成本函数
适用场景	线性回归（连续值预测）	分类任务（0/1标签）
曲面性质	光滑、碗形（无局部极小值）	光滑、无平台或间断点（适合梯度下降）
数学形式	基于平方差：( (f - y)^2 )	基于对数损失：( -y\log f - (1-y)\log(1-f) )

Gradient Descent Algorithm & Logistic Regression Implementation

1. Gradient Descent Algorithm

Formula 1: Parameter Update Rule

$\begin{align*} &\text{repeat until convergence:} \; \lbrace \\ & \; \; \;w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{1} \; & \text{for } j := 0..n-1 \\ & \; \; \; \; \;b = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\ &\rbrace \end{align*}$

Explanation:

每次迭代中，所有权重 ( $w_j $) 和偏置 ( $b$ ) 同时更新。
( $\alpha$ ) 是学习率，控制更新步长。
迭代直到收敛（损失函数变化小于阈值或达到最大迭代次数）。

2. Gradient Calculation Formulas

Formula 2: Gradient of ( w_j )

$\frac{\partial J(\mathbf{w},b)}{\partial w_j} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \tag{2}$

Formula 3: Gradient of ( b )

$\frac{\partial J(\mathbf{w},b)}{\partial b} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{3}$

3. Parameters & Model Definition

Key Parameters

( m ): Number of training examples in the dataset.
( n ): Number of features.
($ f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ ): .模型对第 ( i ) 个样本的预测值(通过 Sigmoid 函数计算 ).
( $y^{(i)} $): Target value for the ( i )-th example (0 or 1).

Logistic Regression Model Structure

$\begin{align*} z &= \mathbf{w} \cdot \mathbf{x} + b, \\ f_{\mathbf{w},b}(\mathbf{x}) &= g(z), \\ g(z) &= \frac{1}{1 + e^{-z}} \quad \text{(Sigmoid function)}. \end{align*}$

说明：

( $z = \mathbf{w} \cdot \mathbf{x} + b $) 是线性组合。
( $f_{\mathbf{w},b}(\mathbf{x})$ ) 是 Sigmoid 函数的输出，范围在 ( (0,1) )。

4. Algorithm Implementation

Core Components

The gradient descent algorithm has two parts:
梯度下降算法分为两部分：

Main Loop (gradient_descent):
- Implements Equation (1) for parameter updates.
- Typically provided by frameworks (e.g., scikit-learn).
  主循环 (gradient_descent)：
- 执行公式（1）的参数更新。
- 通常由框架提供。
Gradient Calculation (compute_gradient_logistic):
- Computes Equations (2) and (3) for gradients.
- Requires manual implementation (e.g., in practice labs).
  梯度计算 (compute_gradient_logistic)：
- 计算公式（2）和（3）的梯度。
- 需要手动实现。

5. Notes

Vectorization:
- The nested loops (over examples and features) are for clarity. In practice, use vectorization (e.g., NumPy matrix operations) for efficiency.
  向量化优化：
- 嵌套循环（样本和特征）是为了清晰展示公式推导。实际实现中可通过向量化（如 NumPy 矩阵运算）加速。
Sigmoid Function:
- The prediction ($ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) $) is bounded between 0 and 1 via Sigmoid.
- 预测值 ($ f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ ) 通过 Sigmoid 函数限制在 ( (0,1) ) 区间。