重读生成概率模型1----基础概念
1 KL 散度
KL 散度的作为是描述两个分布的差异的,首先是度量一个分布,用熵来度量。
1.1 熵
在介绍熵之间,首先要度量单个事件的信息量
I(x)=−logP(x)I(x)=-logP(x)I(x)=−logP(x)
整体的信息量
H(P)=Ex P[−logP(x)]=−∑P(x)logP(x)
\begin{aligned}
H(P) &=E_{x~P}[-logP(x)] \\
& = -\sum P(x)logP(x)
\end{aligned}
H(P)=Ex P[−logP(x)]=−∑P(x)logP(x)
1.2 KL 散度
原本数据真实的分布应该是p(x),但是现在搞错了,搞成q(x)
本来一个信息应该用-logP(x)描述,现在变成了-logq(x),
DKL(P∣∣Q)=Ex p[logP(x)Q(x)]=∑xP(x)logP(x)Q(x)
\begin{aligned}
D_{KL}(P||Q)=E_{x~p}[log\frac{P(x)}{Q(x)}]=\sum_xP(x)log\frac{P(x)}{Q(x)}
\end{aligned}
DKL(P∣∣Q)=Ex p[logQ(x)P(x)]=x∑P(x)logQ(x)P(x)
1.3 应用
- softmax分类问题的KL散度
对于每个样本来说,正确的类别
P(xk)=1,Q(xk)=exkex1+ex2+...+exnDKL(P∣∣Q)=−logQ(xk)=−logexkex1+ex2+...+exn \begin{aligned} P(x_k)=1,Q(x_k)=\frac{e^{x_k}}{e^{x_1}+e^{x_2}+...+e^{x_n}} \\ D_{KL}(P||Q)=-logQ(x_k) =-log\frac{e^{x_k}}{e^{x_1}+e^{x_2}+...+e^{x_n}} \end{aligned} P(xk)=1,Q(xk)=ex1+ex2+...+exnexkDKL(P∣∣Q)=−logQ(xk)=−logex1+ex2+...+exnexk - 高斯分布问题的KL 散度
P(x)=12πe−x22logP(x)=−12log(2π)−x22Q(x)=12πσe−(x−μ)22σ2logQ(x)=−12log(2π)−(x−μ)22σ2−log(σ)DKL(P∣∣Q)=Ep[logP(x)−logQ(x)]=Ep[logσ+(x−μ)22σ2−x22]DKL(P∣∣Q)=log(σ)+12σ2Ep[(x−μ)2]−12Ep(x2)DKL(P∣∣Q)=log(σ)+1+μ22σ2−12 \begin{aligned} P(x)=\frac{1}{\sqrt{2\pi}} e^{\frac{-x^2}{2}} \\ logP(x)=-\frac{1}{2}log(2\pi)-\frac{x^2}{2} \\ Q(x)=\frac{1}{\sqrt{2\pi}\sigma} e^{\frac{-(x-\mu)^2}{2\sigma^2}} \\ logQ(x)=-\frac{1}{2}log(2\pi)-\frac{(x-\mu)^2}{2\sigma^2}-log(\sigma) \\ D_{KL}(P||Q)=E_p[logP(x)-logQ(x)]=E_p[log_{\sigma}+\frac{(x-\mu)^2}{2\sigma^2}-\frac{x^2}{2} ]\\ D_{KL}(P||Q)=log(\sigma)+\frac{1}{2\sigma^2}E_p[(x-\mu)^2]-\frac{1}{2}E_p(x^2) \\ D_{KL}(P||Q)=log(\sigma)+\frac{1+\mu^2}{2\sigma^2}-\frac{1}{2} \end{aligned} P(x)=2π1e2−x2logP(x)=−21log(2π)−2x2Q(x)=2πσ1e2σ2−(x−μ)2logQ(x)=−21log(2π)−2σ2(x−μ)2−log(σ)DKL(P∣∣Q)=Ep[logP(x)−logQ(x)]=Ep[logσ+2σ2(x−μ)2−2x2]DKL(P∣∣Q)=log(σ)+2σ21Ep[(x−μ)2]−21Ep(x2)DKL(P∣∣Q)=log(σ)+2σ21+μ2−21
其中,直觉的理解是总平方距离=抖动平方+偏移的平方
Ep[(x−μ)2]=Ep[(x−E(x)+E(x)−μ)2]=Ep[(x−E(x)2)]+2Ep[x−E(x)][E(x)−μ]+Ep[(E(x)−μ)2]=var(x)+μ2
\begin{aligned}
E_p[(x-\mu)^2] &=E_p[(x-E(x)+E(x)-\mu)^2] \\
& = E_p[(x-E(x)^2)]+2E_p[x-E(x)][E(x)-\mu]+E_p[(E(x)-\mu)^2] \\
& = var(x)+\mu^2
\end{aligned}
Ep[(x−μ)2]=Ep[(x−E(x)+E(x)−μ)2]=Ep[(x−E(x)2)]+2Ep[x−E(x)][E(x)−μ]+Ep[(E(x)−μ)2]=var(x)+μ2