深度学习_三层神经网络传播案例(L0->L1->L2)
🌟 三层神经网络传播案例(L0→L1→L2L_0 \to L_1 \to L_2L0→L1→L2)
为了简化计算,我们将网络简化为:输入层 (2个神经元) → 隐藏层 1 (2个神经元) → 输出层 (1个神经元)。
约定:
- 激活函数: Sigmoid (σ(z)=11+e−z\sigma(z) = \frac{1}{1+e^{-z}}σ(z)=1+e−z1),其导数 σ′(z)=σ(z)(1−σ(z))\sigma'(z) = \sigma(z)(1-\sigma(z))σ′(z)=σ(z)(1−σ(z))。
- 损失函数: 均方误差 (MSE): C=12(y^−y)2C = \frac{1}{2}(\hat{y} - y)^2C=21(y^−y)2。
- 学习率 η\etaη: 0.10.10.1。
初始参数和输入值:
| 参数 | 值 |
|---|---|
| 输入 xxx (即 a(0)a^{(0)}a(0)) | [0.05,0.10][0.05, 0.10][0.05,0.10] |
| 真实标签 yyy | [0.01][0.01][0.01] |
| L0→L1L_0 \to L_1L0→L1 权重 W(1)W^{(1)}W(1) | (0.150.200.250.30)\begin{pmatrix} 0.15 & 0.20 \\ 0.25 & 0.30 \end{pmatrix}(0.150.250.200.30) |
| L1L_1L1 偏置 b(1)b^{(1)}b(1) | [0.35,0.35][0.35, 0.35][0.35,0.35] |
| L1→L2L_1 \to L_2L1→L2 权重 W(2)W^{(2)}W(2) | (0.400.45)\begin{pmatrix} 0.40 \\ 0.45 \end{pmatrix}(0.400.45) (转置后为 1×21 \times 21×2) |
| L2L_2L2 偏置 b(2)b^{(2)}b(2) | [0.60][0.60][0.60] |
🚀 阶段一:前向传播 (Forward Pass)
1. 从 L0L_0L0 到 L1L_1L1 (隐藏层)
-
计算 L1L_1L1 的加权输入 z(1)z^{(1)}z(1):
z1(1)=w11(1)x1+w21(1)x2+b1(1)=(0.15)(0.05)+(0.25)(0.10)+0.35=0.0075+0.025+0.35=0.3825\begin{split} z^{(1)}_1 &= w^{(1)}_{11}x_1 + w^{(1)}_{21}x_2 + b^{(1)}_1 \\ &= (0.15)(0.05) + (0.25)(0.10) + 0.35 \\ &= 0.0075 + 0.025 + 0.35 = \mathbf{0.3825} \end{split}z1(1)=w11(1)x1+w21(1)x2+b1(1)=(0.15)(0.05)+(0.25)(0.10)+0.35=0.0075+0.025+0.35=0.3825
z2(1)=w12(1)x1+w22(1)x2+b2(1)=(0.20)(0.05)+(0.30)(0.10)+0.35=0.01+0.03+0.35=0.39\begin{split} z^{(1)}_2 &= w^{(1)}_{12}x_1 + w^{(1)}_{22}x_2 + b^{(1)}_2 \\ &= (0.20)(0.05) + (0.30)(0.10) + 0.35 \\ &= 0.01 + 0.03 + 0.35 = \mathbf{0.39} \end{split}z2(1)=w12(1)x1+w22(1)x2+b2(1)=(0.20)(0.05)+(0.30)(0.10)+0.35=0.01+0.03+0.35=0.39 -
计算 L1L_1L1 的激活输出 a(1)a^{(1)}a(1):
a1(1)=σ(0.3825)≈0.594(使用 11+e−0.3825)a^{(1)}_1 = \sigma(0.3825) \approx \mathbf{0.594} \quad (\text{使用 } \frac{1}{1+e^{-0.3825}})a1(1)=σ(0.3825)≈0.594(使用 1+e−0.38251)
a2(1)=σ(0.39)≈0.596(使用 11+e−0.39)a^{(1)}_2 = \sigma(0.39) \approx \mathbf{0.596} \quad (\text{使用 } \frac{1}{1+e^{-0.39}})a2(1)=σ(0.39)≈0.596(使用 1+e−0.391)
2. 从 L1L_1L1 到 L2L_2L2 (输出层)
-
计算 L2L_2L2 的加权输入 z(2)z^{(2)}z(2):
z1(2)=w11(2)a1(1)+w21(2)a2(1)+b1(2)=(0.40)(0.594)+(0.45)(0.596)+0.60=0.2376+0.2682+0.60=1.1058\begin{split} z^{(2)}_1 &= w^{(2)}_{11}a^{(1)}_1 + w^{(2)}_{21}a^{(1)}_2 + b^{(2)}_1 \\ &= (0.40)(0.594) + (0.45)(0.596) + 0.60 \\ &= 0.2376 + 0.2682 + 0.60 = \mathbf{1.1058} \end{split}z1(2)=w11(2)a1(1)+w21(2)a2(1)+b1(2)=(0.40)(0.594)+(0.45)(0.596)+0.60=0.2376+0.2682+0.60=1.1058 -
计算 L2L_2L2 的最终输出 y^\hat{y}y^ (即 a(2)a^{(2)}a(2)):
y^=a1(2)=σ(1.1058)≈0.751\hat{y} = a^{(2)}_1 = \sigma(1.1058) \approx \mathbf{0.751}y^=a1(2)=σ(1.1058)≈0.751
3. 计算总损失 CCC
- 均方误差损失:
C=12(y^−y)2=12(0.751−0.01)2C = \frac{1}{2}(\hat{y} - y)^2 = \frac{1}{2}(0.751 - 0.01)^2C=21(y^−y)2=21(0.751−0.01)2
C=12(0.741)2≈0.2745C = \frac{1}{2}(0.741)^2 \approx \mathbf{0.2745}C=21(0.741)2≈0.2745
📉 阶段二:反向传播 (Backward Pass)
1. 计算输出层 L2L_2L2 的误差项 δ(2)\delta^{(2)}δ(2)
δ(2)=∂C∂a(2)⊙σ′(z(2))\delta^{(2)} = \frac{\partial C}{\partial a^{(2)}} \odot \sigma'(z^{(2)})δ(2)=∂a(2)∂C⊙σ′(z(2))
- 损失对输出的导数 ∂C∂a(2)\frac{\partial C}{\partial a^{(2)}}∂a(2)∂C: y^−y=0.751−0.01=0.741\hat{y} - y = 0.751 - 0.01 = 0.741y^−y=0.751−0.01=0.741
- Sigmoid 导数 σ′(z(2))\sigma'(z^{(2)})σ′(z(2)): y^(1−y^)=0.751(1−0.751)≈0.187\hat{y}(1-\hat{y}) = 0.751(1-0.751) \approx 0.187y^(1−y^)=0.751(1−0.751)≈0.187
- 误差项 δ(2)\delta^{(2)}δ(2):
δ(2)=0.741×0.187≈0.1384\delta^{(2)} = 0.741 \times 0.187 \approx \mathbf{0.1384}δ(2)=0.741×0.187≈0.1384
2. 计算 L2L_2L2 的梯度并更新 W(2),b(2)W^{(2)}, b^{(2)}W(2),b(2)
-
偏置梯度 ∂C∂b(2)\frac{\partial C}{\partial b^{(2)}}∂b(2)∂C: δ(2)=0.1384\delta^{(2)} = \mathbf{0.1384}δ(2)=0.1384
-
权重梯度 ∂C∂W(2)\frac{\partial C}{\partial W^{(2)}}∂W(2)∂C: δ(2)⋅(a(1))T\delta^{(2)} \cdot (a^{(1)})^Tδ(2)⋅(a(1))T
∂C∂w11(2)=δ(2)a1(1)=0.1384×0.594≈0.0822∂C∂w21(2)=δ(2)a2(1)=0.1384×0.596≈0.0825\begin{split} \frac{\partial C}{\partial w^{(2)}_{11}} &= \delta^{(2)} a^{(1)}_1 = 0.1384 \times 0.594 \approx \mathbf{0.0822} \\ \frac{\partial C}{\partial w^{(2)}_{21}} &= \delta^{(2)} a^{(1)}_2 = 0.1384 \times 0.596 \approx \mathbf{0.0825} \end{split}∂w11(2)∂C∂w21(2)∂C=δ(2)a1(1)=0.1384×0.594≈0.0822=δ(2)a2(1)=0.1384×0.596≈0.0825 -
梯度优化 (更新 W(2)W^{(2)}W(2)): (η=0.1\eta = 0.1η=0.1)
w11,new(2)=0.40−0.1×0.0822≈0.3918w21,new(2)=0.45−0.1×0.0825≈0.4418bnew(2)=0.60−0.1×0.1384≈0.5862\begin{split} w^{(2)}_{11, \text{new}} &= 0.40 - 0.1 \times 0.0822 \approx \mathbf{0.3918} \\ w^{(2)}_{21, \text{new}} &= 0.45 - 0.1 \times 0.0825 \approx \mathbf{0.4418} \\ b^{(2)}_{\text{new}} &= 0.60 - 0.1 \times 0.1384 \approx \mathbf{0.5862} \end{split}w11,new(2)w21,new(2)bnew(2)=0.40−0.1×0.0822≈0.3918=0.45−0.1×0.0825≈0.4418=0.60−0.1×0.1384≈0.5862
3. 计算隐藏层 L1L_1L1 的误差项 δ(1)\delta^{(1)}δ(1)
δ(1)=((W(2))Tδ(2))⊙σ′(z(1))\delta^{(1)} = \left( (W^{(2)})^T \delta^{(2)} \right) \odot \sigma'(z^{(1)})δ(1)=((W(2))Tδ(2))⊙σ′(z(1))
-
传播误差 EpropE_{\text{prop}}Eprop: (使用旧权重 W(2)W^{(2)}W(2))
Eprop,1=w11(2)δ(2)=0.40×0.1384≈0.05536E_{\text{prop}, 1} = w^{(2)}_{11} \delta^{(2)} = 0.40 \times 0.1384 \approx 0.05536Eprop,1=w11(2)δ(2)=0.40×0.1384≈0.05536
Eprop,2=w21(2)δ(2)=0.45×0.1384≈0.06228E_{\text{prop}, 2} = w^{(2)}_{21} \delta^{(2)} = 0.45 \times 0.1384 \approx 0.06228Eprop,2=w21(2)δ(2)=0.45×0.1384≈0.06228 -
Sigmoid 导数 σ′(z(1))\sigma'(z^{(1)})σ′(z(1)):
σ′(z1(1))=a1(1)(1−a1(1))=0.594(1−0.594)≈0.2414\sigma'(z^{(1)}_1) = a^{(1)}_1(1-a^{(1)}_1) = 0.594(1-0.594) \approx 0.2414σ′(z1(1))=a1(1)(1−a1(1))=0.594(1−0.594)≈0.2414
σ′(z2(1))=a2(1)(1−a2(1))=0.596(1−0.596)≈0.2416\sigma'(z^{(1)}_2) = a^{(1)}_2(1-a^{(1)}_2) = 0.596(1-0.596) \approx 0.2416σ′(z2(1))=a2(1)(1−a2(1))=0.596(1−0.596)≈0.2416 -
误差项 δ(1)\delta^{(1)}δ(1):
δ1(1)=Eprop,1×0.2414=0.05536×0.2414≈0.01338\delta^{(1)}_1 = E_{\text{prop}, 1} \times 0.2414 = 0.05536 \times 0.2414 \approx \mathbf{0.01338}δ1(1)=Eprop,1×0.2414=0.05536×0.2414≈0.01338
δ2(1)=Eprop,2×0.2416=0.06228×0.2416≈0.01503\delta^{(1)}_2 = E_{\text{prop}, 2} \times 0.2416 = 0.06228 \times 0.2416 \approx \mathbf{0.01503}δ2(1)=Eprop,2×0.2416=0.06228×0.2416≈0.01503
4. 计算 L1L_1L1 的梯度并更新 W(1),b(1)W^{(1)}, b^{(1)}W(1),b(1)
-
偏置梯度 ∂C∂b(1)\frac{\partial C}{\partial b^{(1)}}∂b(1)∂C:
∂C∂b1(1)=δ1(1)=0.01338\frac{\partial C}{\partial b^{(1)}_1} = \delta^{(1)}_1 = \mathbf{0.01338}∂b1(1)∂C=δ1(1)=0.01338
∂C∂b2(1)=δ2(1)=0.01503\frac{\partial C}{\partial b^{(1)}_2} = \delta^{(1)}_2 = \mathbf{0.01503}∂b2(1)∂C=δ2(1)=0.01503 -
权重梯度 ∂C∂W(1)\frac{\partial C}{\partial W^{(1)}}∂W(1)∂C: δ(1)⋅(a(0))T\delta^{(1)} \cdot (a^{(0)})^Tδ(1)⋅(a(0))T
∂C∂w11(1)=δ1(1)x1=0.01338×0.05≈0.00067∂C∂w21(1)=δ1(1)x2=0.01338×0.10≈0.00134\begin{split} \frac{\partial C}{\partial w^{(1)}_{11}} &= \delta^{(1)}_1 x_1 = 0.01338 \times 0.05 \approx \mathbf{0.00067} \\ \frac{\partial C}{\partial w^{(1)}_{21}} &= \delta^{(1)}_1 x_2 = 0.01338 \times 0.10 \approx \mathbf{0.00134} \end{split}∂w11(1)∂C∂w21(1)∂C=δ1(1)x1=0.01338×0.05≈0.00067=δ1(1)x2=0.01338×0.10≈0.00134
∂C∂w12(1)=δ2(1)x1=0.01503×0.05≈0.00075∂C∂w22(1)=δ2(1)x2=0.01503×0.10≈0.00150\begin{split} \frac{\partial C}{\partial w^{(1)}_{12}} &= \delta^{(1)}_2 x_1 = 0.01503 \times 0.05 \approx \mathbf{0.00075} \\ \frac{\partial C}{\partial w^{(1)}_{22}} &= \delta^{(1)}_2 x_2 = 0.01503 \times 0.10 \approx \mathbf{0.00150} \end{split}∂w12(1)∂C∂w22(1)∂C=δ2(1)x1=0.01503×0.05≈0.00075=δ2(1)x2=0.01503×0.10≈0.00150 -
梯度优化 (更新 W(1)W^{(1)}W(1)): (η=0.1\eta = 0.1η=0.1)
w11,new(1)=0.15−0.1×0.00067≈0.1499w21,new(1)=0.25−0.1×0.00134≈0.2499w12,new(1)=0.20−0.1×0.00075≈0.1999w22,new(1)=0.30−0.1×0.00150≈0.2998\begin{split} w^{(1)}_{11, \text{new}} &= 0.15 - 0.1 \times 0.00067 \approx \mathbf{0.1499} \\ w^{(1)}_{21, \text{new}} &= 0.25 - 0.1 \times 0.00134 \approx \mathbf{0.2499} \\ w^{(1)}_{12, \text{new}} &= 0.20 - 0.1 \times 0.00075 \approx \mathbf{0.1999} \\ w^{(1)}_{22, \text{new}} &= 0.30 - 0.1 \times 0.00150 \approx \mathbf{0.2998} \end{split}w11,new(1)w21,new(1)w12,new(1)w22,new(1)=0.15−0.1×0.00067≈0.1499=0.25−0.1×0.00134≈0.2499=0.20−0.1×0.00075≈0.1999=0.30−0.1×0.00150≈0.2998
总结
经过一个完整的训练周期(前向传播和反向传播),所有的权重 WWW 和偏置 bbb 都向着降低总损失 CCC 的方向进行了微小的更新。在下一次迭代中,网络将使用这些新的参数进行计算,理论上会得到一个更接近 0.010.010.01 的输出 y^\hat{y}y^。

