逻辑损失以及梯度下降的实现
Logical loss 逻辑损失
**Square Error Cost Function **
Regression uses the squared error cost function:
The equation for the squared error cost with one variable is:
J
(
w
,
b
)
=
1
2
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
2
(1)
J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1}
J(w,b)=2m1i=0∑m−1(fw,b(x(i))−y(i))2(1)
where
f
w
,
b
(
x
(
i
)
)
=
w
x
(
i
)
+
b
(2)
f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{2}
fw,b(x(i))=wx(i)+b(2)
The squared error cost has the property that following the derivative leads to the minimum. It works well for linear regression.
平方误差成本函数的特性是,沿着其导数的方向可找到最小值。它在线性回归中表现良好。
Logistic Regression的挑战与平方误差成本函数的局限性
然而,若将平方误差成本函数直接应用于逻辑回归(包含Sigmoid函数):
f
w
,
b
(
x
(
i
)
)
=
sigmoid
(
w
x
(
i
)
+
b
)
f_{w,b}(x^{(i)}) = \text{sigmoid}(wx^{(i)} + b)
fw,b(x(i))=sigmoid(wx(i)+b)
会导致成本函数曲面不光滑(如存在平台区、局部极小值或间断点),这使得梯度下降难以收敛。因此,需要更适合逻辑回归非线性特性的成本函数。
逻辑回归的损失函数与成本函数
逻辑回归使用针对分类任务设计的 损失函数(目标值为0或1)。
- Loss 是单个样本的误差,Cost 是训练集上所有样本损失的平均值。
损失函数定义如下:
$$
\text{loss}\left(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}\right) =
\begin{cases}
- \log\left(f_{\mathbf{w},b}(\mathbf{x}^{(i)})\right) & \text{若 } y^{(i)} = 1, \
\log\left(1 - f_{\mathbf{w},b}(\mathbf{x}^{(i)})\right) & \text{若 } y^{(i)} = 0.
\end{cases}
$$
其中 $ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = g(\mathbf{w} \cdot \mathbf{x}^{(i)} + b) $,( g ) 是 Sigmoid 函数:
g
(
z
)
=
1
1
+
e
−
z
.
g(z) = \frac{1}{1 + e^{-z}}.
g(z)=1+e−z1.
合并后的简化形式为:
loss
(
f
w
,
b
(
x
(
i
)
)
,
y
(
i
)
)
=
−
y
(
i
)
log
(
f
w
,
b
(
x
(
i
)
)
)
−
(
1
−
y
(
i
)
)
log
(
1
−
f
w
,
b
(
x
(
i
)
)
)
.
(3)
\text{loss}\left(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}\right) = -y^{(i)} \log\left(f_{\mathbf{w},b}(\mathbf{x}^{(i)})\right) - (1 - y^{(i)}) \log\left(1 - f_{\mathbf{w},b}(\mathbf{x}^{(i)})\right). \tag{3}
loss(fw,b(x(i)),y(i))=−y(i)log(fw,b(x(i)))−(1−y(i))log(1−fw,b(x(i))).(3)
损失函数的特性
- 预测与目标一致时损失为0:
- 当$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = y^{(i)} $,损失为0。
- 偏离目标时损失迅速增大:
- 若预测值与目标值差异大(如 $ y=1 $ 但 $ f \approx 0 ),损失会趋近于 ),损失会趋近于 ),损失会趋近于 +\infty $。
成本函数是训练集上所有样本损失的平均值:
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
loss
(
f
w
,
b
(
x
(
i
)
)
,
y
(
i
)
)
.
(4)
J(\mathbf{w}, b) = \frac{1}{m} \sum_{i=1}^m \text{loss}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}). \tag{4}
J(w,b)=m1i=1∑mloss(fw,b(x(i)),y(i)).(4)
特性 | 平方误差成本函数 | 逻辑损失成本函数 |
---|---|---|
适用场景 | 线性回归(连续值预测) | 分类任务(0/1标签) |
曲面性质 | 光滑、碗形(无局部极小值) | 光滑、无平台或间断点(适合梯度下降) |
数学形式 | 基于平方差:( (f - y)^2 ) | 基于对数损失:( -y\log f - (1-y)\log(1-f) ) |
Gradient Descent Algorithm & Logistic Regression Implementation
1. Gradient Descent Algorithm
Formula 1: Parameter Update Rule
repeat until convergence: { w j = w j − α ∂ J ( w , b ) ∂ w j for j : = 0.. n − 1 b = b − α ∂ J ( w , b ) ∂ b } \begin{align*} &\text{repeat until convergence:} \; \lbrace \\ & \; \; \;w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{1} \; & \text{for } j := 0..n-1 \\ & \; \; \; \; \;b = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\ &\rbrace \end{align*} repeat until convergence:{wj=wj−α∂wj∂J(w,b)b=b−α∂b∂J(w,b)}for j:=0..n−1(1)
Explanation:
- 每次迭代中,所有权重 ( $w_j $) 和偏置 ( b b b ) 同时更新。
- ( α \alpha α ) 是学习率,控制更新步长。
- 迭代直到收敛(损失函数变化小于阈值或达到最大迭代次数)。
2. Gradient Calculation Formulas
Formula 2: Gradient of ( w_j )
∂ J ( w , b ) ∂ w j = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x j ( i ) (2) \frac{\partial J(\mathbf{w},b)}{\partial w_j} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \tag{2} ∂wj∂J(w,b)=m1i=0∑m−1(fw,b(x(i))−y(i))xj(i)(2)
Formula 3: Gradient of ( b )
∂ J ( w , b ) ∂ b = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) (3) \frac{\partial J(\mathbf{w},b)}{\partial b} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{3} ∂b∂J(w,b)=m1i=0∑m−1(fw,b(x(i))−y(i))(3)
3. Parameters & Model Definition
Key Parameters
- ( m ): Number of training examples in the dataset.
- ( n ): Number of features.
- ($ f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ ): .模型对第 ( i ) 个样本的预测值(通过 Sigmoid 函数计算 ).
- ( $y^{(i)} $): Target value for the ( i )-th example (0 or 1).
Logistic Regression Model Structure
z = w ⋅ x + b , f w , b ( x ) = g ( z ) , g ( z ) = 1 1 + e − z (Sigmoid function) . \begin{align*} z &= \mathbf{w} \cdot \mathbf{x} + b, \\ f_{\mathbf{w},b}(\mathbf{x}) &= g(z), \\ g(z) &= \frac{1}{1 + e^{-z}} \quad \text{(Sigmoid function)}. \end{align*} zfw,b(x)g(z)=w⋅x+b,=g(z),=1+e−z1(Sigmoid function).
说明:
- ( $z = \mathbf{w} \cdot \mathbf{x} + b $) 是线性组合。
- ( f w , b ( x ) f_{\mathbf{w},b}(\mathbf{x}) fw,b(x) ) 是 Sigmoid 函数的输出,范围在 ( (0,1) )。
4. Algorithm Implementation
Core Components
The gradient descent algorithm has two parts:
梯度下降算法分为两部分:
-
Main Loop (
gradient_descent
):- Implements Equation (1) for parameter updates.
- Typically provided by frameworks (e.g., scikit-learn).
主循环 (gradient_descent
): - 执行公式(1)的参数更新。
- 通常由框架提供。
-
Gradient Calculation (
compute_gradient_logistic
):- Computes Equations (2) and (3) for gradients.
- Requires manual implementation (e.g., in practice labs).
梯度计算 (compute_gradient_logistic
): - 计算公式(2)和(3)的梯度。
- 需要手动实现。
5**. Notes**
-
Vectorization:
- The nested loops (over examples and features) are for clarity. In practice, use vectorization (e.g., NumPy matrix operations) for efficiency.
向量化优化: - 嵌套循环(样本和特征)是为了清晰展示公式推导。实际实现中可通过向量化(如 NumPy 矩阵运算)加速。
- The nested loops (over examples and features) are for clarity. In practice, use vectorization (e.g., NumPy matrix operations) for efficiency.
-
Sigmoid Function:
-
The prediction ($ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) $) is bounded between 0 and 1 via Sigmoid.
-
预测值 ($ f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ ) 通过 Sigmoid 函数限制在 ( (0,1) ) 区间。
-