扩散模型DDPM数学推导过程完整版(下)
文章目录
训练目标(损失函数)
我们定义
pθ(x0)p_\theta(x_0)pθ(x0)
表示模型在参数θ\thetaθ下生成的数据点x0x_0x0的概率密度;x0x_0x0表示一张真实的图片。即pθ(x0)p_\theta(x_0)pθ(x0)描述了“样本x0x_0x0”出现的可能性有多大。
我们的训练目标就是
maxθpθ(x0)\max_{\theta} p_\theta(x_0)θmaxpθ(x0)
我们一般会写成负对数的形式:
minθ−log(pθ(x0))\min_{\theta}-\log(p_\theta(x_0))θmin−log(pθ(x0))
但是有一个关键的问题,计算x0x_0x0的概率不能被有效的计算,它需要依赖于x0x_0x0之前的所有步长。
作为一种解决方案,我们引入这个目标的变分下界:
−log(pθ(x0))≤−log(pθ(x0))+DKL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))-\log(p_\theta(x_0))\leq-\log(p_\theta(x_0))+D_{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}|x_0))−log(pθ(x0))≤−log(pθ(x0))+DKL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))
下面解释一下KL散度的含义:模型真实的前向扩散过程与模型学习到的反向去噪过程到底有多接近。
若这两个分布一模一样则DKL=0D_{KL}=0DKL=0,说明模型学习到了“完美的反向扩散的过程”。
我们可以将KL散度进行展开,有:
DKL=E[logq(x1:T∣x0)pθ(x1:T∣x0)]=E[logq(x1:T∣x0)pθ(x0,x1:T)pθ(x0)]=E[logq(x1:T∣x0)pθ(x0:T)pθ(x0)]=E[log(q(x1:T∣x0)pθ(x0:T))+log(pθ(x0))]=E[logq(x1:T∣x0)pθ(x0:T)]+log(pθ(x0))\begin{aligned} D_{KL}=&\mathbb{E}\left[\log\frac{q(x_{1:T}\mid x_0)}{p_\theta(x_{1:T}\mid x_0)}\right] \\=&\mathbb{E}\left[\log\frac{q(x_{1:T}\mid x_0)}{\frac{p_\theta(x_0,x_{1:T})}{p_\theta(x_0)}}\right] \\=&\mathbb{E}\left[\log\frac{q(x_{1:T}\mid x_0)}{\frac{p_\theta(x_{0:T})}{p_\theta(x_0)}}\right] \\=&\mathbb{E}\left[\log(\frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})})+\log(p_\theta(x_0))\right] \\=&\mathbb{E}\left[\log\frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}\right]+\log(p_\theta(x_0)) \end{aligned}DKL=====E[logpθ(x1:T∣x0)q(x1:T∣x0)]Elogpθ(x0)pθ(x0,x1:T)q(x1:T∣x0)Elogpθ(x0)pθ(x0:T)q(x1:T∣x0)E[log(pθ(x0:T)q(x1:T∣x0))+log(pθ(x0))]E[logpθ(x0:T)q(x1:T∣x0)]+log(pθ(x0))
所以−log(pθ(x0))≤−log(pθ(x0))+DKL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))-\log(p_\theta(x_0))\leq-\log(p_\theta(x_0))+D_{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}|x_0))−log(pθ(x0))≤−log(pθ(x0))+DKL(q(x1:T∣x0)∣∣pθ(x1:T∣x0)),可以作如下变换:
−log(pθ(x0))≤−log(pθ(x0))+DKL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))−log(pθ(x0))≤−log(pθ(x0))+E[logq(x1:T∣x0)pθ(x0:T)]+log(pθ(x0))−log(pθ(x0))≤E[logq(x1:T∣x0)pθ(x0:T)]\begin{aligned} -\log(p_\theta(x_0))&\leq-\log(p_\theta(x_0))+D_{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}|x_0)) \\-\log(p_\theta(x_0))&\leq-\log(p_\theta(x_0))+\mathbb{E}\left[\log\frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}\right]+\log(p_\theta(x_0)) \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[\log\frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}\right] \end{aligned}−log(pθ(x0))−log(pθ(x0))−log(pθ(x0))≤−log(pθ(x0))+DKL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))≤−log(pθ(x0))+E[logpθ(x0:T)q(x1:T∣x0)]+log(pθ(x0))≤E[logpθ(x0:T)q(x1:T∣x0)]
-
向前过程:x0→x1→⋯→xTx_0\to x_1\to\cdots\to x_Tx0→x1→⋯→xT:
q(x1:T∣x0)=q(x1∣x0)q(x2∣x1)q(x3∣x2)⋯q(xT∣xT−1)=∏t=1Tq(xt∣xt−1)q(x_{1:T}\mid x_{0})=q(x_{1}\mid x_{0})q(x_{2}\mid x_{1})q(x_{3}\mid x_{2})\cdots q(x_{T}\mid x_{T-1})=\prod_{t=1}^{T}q(x_{t}\mid x_{t-1})q(x1:T∣x0)=q(x1∣x0)q(x2∣x1)q(x3∣x2)⋯q(xT∣xT−1)=t=1∏Tq(xt∣xt−1) -
反向过程:xT→xT−1→⋯→x0x_T\to x_{T-1}\to\cdots\to x_0xT→xT−1→⋯→x0:
pθ(x0:T)=p(xT)pθ(xT−1∣xT)pθ(xT−2∣xT−1)⋯pθ(x0∣x1)=p(xT)∏t=1Tpθ(xt−1∣xt)p_\theta(x_{0:T})=p(x_T)p_\theta(x_{T-1}\mid x_T)p_\theta(x_{T-2}\mid x_{T-1})\cdots p_\theta(x_0\mid x_1)=p(x_T)\prod_{t=1}^Tp_\theta(x_{t-1}\mid x_t)pθ(x0:T)=p(xT)pθ(xT−1∣xT)pθ(xT−2∣xT−1)⋯pθ(x0∣x1)=p(xT)t=1∏Tpθ(xt−1∣xt)
至此可以进一步化简成:
−log(pθ(x0))≤E[logq(x1:T∣x0)pθ(x0:T)]−log(pθ(x0))≤E[−log(p(xT))+log(∏t=1Tq(xt∣xt−1)∏t=1Tpθ(xt−1∣xt))]−log(pθ(x0))≤E[−log(p(xT))+∑t=1Tlog(q(xt∣xt−1)pθ(xt−1∣xt))]−log(pθ(x0))≤E[−log(p(xT))+∑t=2Tlog(q(xt∣xt−1)pθ(xt−1∣xt))+log(q(x1∣x0)pθ(x0∣x1))]\begin{aligned} -\log(p_\theta(x_0))&\leq\mathbb{E}\left[\log\frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}\right] \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\log(\frac{\prod_{t=1}^Tq(x_t|x_{t-1})}{\prod_{t=1}^Tp_\theta(x_{t-1}|x_t)})\right] \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=1}^T\log(\frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)})\right] \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=2}^T\log(\frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)})+\log(\frac{q(x_1|x_0)}{p_\theta(x_0|x_1)})\right] \end{aligned}−log(pθ(x0))−log(pθ(x0))−log(pθ(x0))−log(pθ(x0))≤E[logpθ(x0:T)q(x1:T∣x0)]≤E[−log(p(xT))+log(∏t=1Tpθ(xt−1∣xt)∏t=1Tq(xt∣xt−1))]≤E[−log(p(xT))+t=1∑Tlog(pθ(xt−1∣xt)q(xt∣xt−1))]≤E[−log(p(xT))+t=2∑Tlog(pθ(xt−1∣xt)q(xt∣xt−1))+log(pθ(x0∣x1)q(x1∣x0))]
因为模型训练要拟合的是真实的“反向去噪分布”,所以我们要利用贝叶斯公式将q(xt∣xt−1)q(x_t\mid x_{t-1})q(xt∣xt−1)和q(x1∣x0)q(x_1\mid x_0)q(x1∣x0)转换成q(xt−1∣xt)q(x_{t-1}\mid x_t)q(xt−1∣xt)和q(x0∣x1)q(x_0\mid x_1)q(x0∣x1)的形式,且我们希望q(xt∣xt−1)q(x_t\mid x_{t-1})q(xt∣xt−1)引入我们的已知条件x0x_0x0,使其能够更有效的学习。
q(xt∣xt−1)=q(xt−1∣xt)q(xt)q(xt−1)=q(xt−1∣xt,x0)q(xt∣x0)q(xt−1∣x0)\begin{aligned} q(x_t|x_{t-1})&=\frac{q(x_{t-1}|x_t)q(x_t)}{q(x_{t-1})} \\&=\frac{q(x_{t-1}\mid x_t,x_0)q(x_t\mid x_0)}{q(x_{t-1}\mid x_0)} \end{aligned}q(xt∣xt−1)=q(xt−1)q(xt−1∣xt)q(xt)=q(xt−1∣x0)q(xt−1∣xt,x0)q(xt∣x0)
所以我们可以进一步化简为:
−log(pθ(x0))≤E[−log(p(xT))+∑t=2Tlog(q(xt∣xt−1)pθ(xt−1∣xt))+log(q(x1∣x0)pθ(x0∣x1))]−log(pθ(x0))≤E[−log(p(xT))+∑t=2Tlog(q(xt−1∣xt,x0)q(xt∣x0)pθ(xt−1∣xt)q(xt−1∣x0))+log(q(x1∣x0)pθ(x0∣x1))]−log(pθ(x0))≤E[−log(p(xT))+∑t=2Tlog(q(xt−1∣xt,x0)pθ(xt−1∣xt))+∑t=2Tlog(q(xt∣x0)q(xt−1∣x0))+log(q(x1∣x0)pθ(x0∣x1))]\begin{aligned} \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=2}^T\log(\frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)})+\log(\frac{q(x_1|x_0)}{p_\theta(x_0|x_1)})\right] \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=2}^{T}\log(\frac{q(x_{t-1}|x_{t},x_{0})q(x_{t}|x_{0})}{p_{\theta}(x_{t-1}|x_{t})q(x_{t-1}|x_{0})})+\log(\frac{q(x_1|x_0)}{p_\theta(x_0|x_1)})\right] \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=2}^{T}\log(\frac{q(x_{t-1}|x_{t},x_{0})}{p_{\theta}(x_{t-1}|x_{t})})+\sum_{t=2}^T\log(\frac{q(x_t|x_0)}{q(x_{t-1}|x_0)})+\log(\frac{q(x_1|x_0)}{p_\theta(x_0|x_1)})\right] \end{aligned}−log(pθ(x0))−log(pθ(x0))−log(pθ(x0))≤E[−log(p(xT))+t=2∑Tlog(pθ(xt−1∣xt)q(xt∣xt−1))+log(pθ(x0∣x1)q(x1∣x0))]≤E[−log(p(xT))+t=2∑Tlog(pθ(xt−1∣xt)q(xt−1∣x0)q(xt−1∣xt,x0)q(xt∣x0))+log(pθ(x0∣x1)q(x1∣x0))]≤E[−log(p(xT))+t=2∑Tlog(pθ(xt−1∣xt)q(xt−1∣xt,x0))+t=2∑Tlog(q(xt−1∣x0)q(xt∣x0))+log(pθ(x0∣x1)q(x1∣x0))]
现在我们拿出∑t=2Tlog(q(xt∣x0)q(xt−1∣x0))\sum_{t=2}^Tlog(\frac{q(x_t|x_0)}{q(x_{t-1}|x_0)})∑t=2Tlog(q(xt−1∣x0)q(xt∣x0))这一项进行化简操作:
∑t=2Tlog(q(xt∣x0)q(xt−1∣x0))=log(∏t=2Tq(xt∣x0)q(xt−1∣x0))=log(q(x2∣x0)q(x3∣x0)q(x4∣x0)...q(xT∣x0)q(x1∣x0)q(x2∣x0)q(x3∣x0)...q(xT−1∣x0))=log(q(xT∣x0)q(x1∣x0))\begin{aligned} &\sum_{t=2}^T\log(\frac{q(x_t|x_0)}{q(x_{t-1}|x_0)}) \\&=\log(\prod_{t=2}^T\frac{q(x_t|x_0)}{q(x_{t-1}|x_0)}) \\&=\log(\frac{q(x_2|x_0)q(x_3|x_0)q(x_4|x_0)...q(x_T|x_0)}{q(x_1|x_0)q(x_2|x_0)q(x_3|x_0)...q(x_{T-1}|x_0)}) \\&=\log(\frac{q(x_T|x_0)}{q(x_1|x_0)}) \end{aligned}t=2∑Tlog(q(xt−1∣x0)q(xt∣x0))=log(t=2∏Tq(xt−1∣x0)q(xt∣x0))=log(q(x1∣x0)q(x2∣x0)q(x3∣x0)...q(xT−1∣x0)q(x2∣x0)q(x3∣x0)q(x4∣x0)...q(xT∣x0))=log(q(x1∣x0)q(xT∣x0))
代换到原式中有:
−log(pθ(x0))≤E[−log(p(xT))+∑t=2Tlog(q(xt−1∣xt,x0)pθ(xt−1∣xt))+log(q(xT∣x0)q(x1∣x0))+log(q(x1∣x0)pθ(x0∣x1))]\begin{aligned} \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=2}^{T}\log(\frac{q(x_{t-1}|x_{t},x_{0})}{p_{\theta}(x_{t-1}|x_{t})})+\log(\frac{q(x_T|x_0)}{q(x_1|x_0)})+\log(\frac{q(x_1|x_0)}{p_\theta(x_0|x_1)})\right] \end{aligned}−log(pθ(x0))≤E[−log(p(xT))+t=2∑Tlog(pθ(xt−1∣xt)q(xt−1∣xt,x0))+log(q(x1∣x0)q(xT∣x0))+log(pθ(x0∣x1)q(x1∣x0))]
然后再看最后两项:
log(q(xT∣x0)q(x1∣x0))+log(q(x1∣x0)pθ(x0∣x1))=log(q(xT∣x0))−log(q(x1∣x0))+log(q(x1∣x0))−log(pθ(x0∣x1))=log(q(xT∣x0))−log(pθ(x0∣x1))\begin{aligned} &\log(\frac{q(x_T|x_0)}{q(x_1|x_0)})+\log(\frac{q(x_1|x_0)}{p_\theta(x_0|x_1)}) \\&=\log(q(x_T|x_0))-\log(q(x_1|x_0))+\log(q(x_1|x_0))-\log(p_\theta(x_0|x_1)) \\&=\log(q(x_T|x_0))-\log(p_\theta(x_0|x_1)) \end{aligned}log(q(x1∣x0)q(xT∣x0))+log(pθ(x0∣x1)q(x1∣x0))=log(q(xT∣x0))−log(q(x1∣x0))+log(q(x1∣x0))−log(pθ(x0∣x1))=log(q(xT∣x0))−log(pθ(x0∣x1))
代回到原式中有:
−log(pθ(x0))≤E[−log(p(xT))+∑t=2Tlog(q(xt−1∣xt,x0)pθ(xt−1∣xt))+log(q(xT∣x0))−log(pθ(x0∣x1))]−log(pθ(x0))≤E[log(q(xT∣x0)p(xT))+∑t=2Tlog(q(xt−1∣xt,x0)pθ(xt−1∣xt))−log(pθ(x0∣x1))]−log(pθ(x0))≤DKL(q(xT∣x0)∣∣∣p(xT))+∑t=2TDKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))−log(pθ(x0∣x1)\begin{aligned} -\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=2}^{T}\log(\frac{q(x_{t-1}|x_{t},x_{0})}{p_{\theta}(x_{t-1}|x_{t})})+\log(q(x_T|x_0))-\log(p_\theta(x_0|x_1))\right] \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[\log(\frac{q(x_T|x_0)}{p(x_T)})+\sum_{t=2}^{T}\log(\frac{q(x_{t-1}|x_{t},x_{0})}{p_{\theta}(x_{t-1}|x_{t})})-\log(p_\theta(x_0|x_1))\right] \\-\log(p_\theta(x_0))&\leq D_{KL}(q(x_T|x_0)|||p(x_T))+\sum_{t=2}^TD_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))-\log(p_\theta(x_0|x_1) \end{aligned}−log(pθ(x0))−log(pθ(x0))−log(pθ(x0))≤E[−log(p(xT))+t=2∑Tlog(pθ(xt−1∣xt)q(xt−1∣xt,x0))+log(q(xT∣x0))−log(pθ(x0∣x1))]≤E[log(p(xT)q(xT∣x0))+t=2∑Tlog(pθ(xt−1∣xt)q(xt−1∣xt,x0))−log(pθ(x0∣x1))]≤DKL(q(xT∣x0)∣∣∣p(xT))+t=2∑TDKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))−log(pθ(x0∣x1)
最终我们可以写成这样的形式:
LVLB=DKL(q(xT∣x0)∣∣∣p(xT))+∑t=2TDKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))−log(pθ(x0∣x1)L_{VLB}=D_{KL}(q(x_T|x_0)|||p(x_T))+\sum_{t=2}^TD_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))-\log(p_\theta(x_0|x_1)LVLB=DKL(q(xT∣x0)∣∣∣p(xT))+t=2∑TDKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))−log(pθ(x0∣x1)
其中第一项DKL(q(xT∣x0)∣∣∣p(xT))D_{KL}(q(x_T|x_0)|||p(x_T))DKL(q(xT∣x0)∣∣∣p(xT))与可学习参数θ\thetaθ无关,可以不作为优化的目标。
优化目标由两部分组成:
- 最小化DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))D_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt)):最大化每一个去噪声操作和加噪声逆操作的相似度。
- 最小化−log(pθ(x0∣x1)-\log(p_\theta(x_0|x_1)−log(pθ(x0∣x1):已知x1x_1x1时,让最后复原原图x0x_0x0的概率更高。
最小化:DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))D_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))
若一维正态分布P∼N(μ1,σ1)P\sim \mathcal{N}(\mu_1,\sigma_1)P∼N(μ1,σ1),Q∼N(μ2,σ2)Q\sim \mathcal{N}(\mu_2,\sigma_2)Q∼N(μ2,σ2)则P,QP,QP,Q之间的KL散度为:
DKL(P∣∣Q)=logσ2σ1+σ12+(μ1−μ2)22σ22−12D_{KL}(P||Q)=log\frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+(\mu_1-\mu_2)^2}{2\sigma_2^2}-\frac{1}{2}DKL(P∣∣Q)=logσ1σ2+2σ22σ12+(μ1−μ2)2−21
对于DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))D_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt)),我们可知q(xt−1∣xt,x0)∼N(xt−1;μ~(xt,x0),β~tI)q(x_{t-1}|x_t,x_0)\sim \mathcal{N}(x_{t-1};\tilde{\mu}\left(x_t,x_0\right),\tilde{\beta}_t\mathbf{I})q(xt−1∣xt,x0)∼N(xt−1;μ~(xt,x0),β~tI),pθ(xt−1∣xt)∼N(xt−1;μθ(xt,t),Σθ(xt,t))p_\theta\left(x_{t-1}\mid x_t\right)\sim\mathcal{N}(x_{t-1};\mu_\theta\left(x_t,t\right),\Sigma_\theta\left(x_t,t\right))pθ(xt−1∣xt)∼N(xt−1;μθ(xt,t),Σθ(xt,t))
其中Σθ(xt,t)=β~t\Sigma_\theta(x_t,t)=\tilde{\beta}_tΣθ(xt,t)=β~t。
回顾:因为β~t\tilde{\beta}_tβ~t是固定噪声日程,在DDPM的原始论文中是线性递增的。
Σθ(xt,t)=β~tI=1−αˉt−11−αˉt⋅βtI\Sigma_\theta(x_t,t)=\tilde{\beta}_t\mathbf{I}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\cdot\beta_t\mathbf{I}Σθ(xt,t)=β~tI=1−αˉt1−αˉt−1⋅βtI
所以
DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=logβ~tβ~t+β~t2+(μθ(xt,t)−μ~t(xt,t))22β~t2−12=12β~t2∣∣μθ(xt,t)−μ~t(xt,t)∣∣2D_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))=log\frac{\tilde{\beta}_t}{\tilde{\beta}_t}+\frac{\tilde{\beta}_t^2+(\mu_\theta(x_t,t)-\tilde{\mu}_t(x_t,t))^2}{2\tilde{\beta}_t^2}-\frac{1}{2}=\frac{1}{2\tilde{\beta}_t^2}||\mu_\theta(x_t,t)-\tilde{\mu}_t(x_t,t)||^2DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=logβ~tβ~t+2β~t2β~t2+(μθ(xt,t)−μ~t(xt,t))2−21=2β~t21∣∣μθ(xt,t)−μ~t(xt,t)∣∣2
根据扩散模型DDPM数学推导过程完整版(上)有:
- 采样阶段的理论反向均值为:
μ~(xt,x0)=1αt(xt−1−αt1−αˉtϵt)\tilde{\mu}\left(x_t,x_0\right)=\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_t\right)μ~(xt,x0)=αt1(xt−1−αˉt1−αtϵt)
- 采样阶段的模型反向均值为:
μθ(xt,t)=1αt(xt−1−αt1−αˉtϵθ(xt,t))\mu_\theta(x_t,t)=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t,t))μθ(xt,t)=αt1(xt−1−αˉt1−αtϵθ(xt,t))
故
DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=12β~t2∣∣μθ(xt,t)−μ~t(xt,t)∣∣2=(1−αt)22αt(1−αˉt)β~t2∣∣ϵt−ϵθ(xt,t)∣∣2D_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))=\frac{1}{2\tilde{\beta}_t^2}||\mu_\theta(x_t,t)-\tilde{\mu}_t(x_t,t)||^2=\frac{(1-\alpha_t)^2}{2\alpha_t(1-\bar{\alpha}_t)\tilde{\beta}_t^2}||\epsilon_t-\epsilon_\theta(x_t,t)||^2DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=2β~t21∣∣μθ(xt,t)−μ~t(xt,t)∣∣2=2αt(1−αˉt)β~t2(1−αt)2∣∣ϵt−ϵθ(xt,t)∣∣2
在DDPM论文指出,若将前面的系数全部丢掉,模型的效果更好,最终第一个优化目标DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))D_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))可变为∣∣ϵt−ϵθ(xt,t)∣∣2||\epsilon_t-\epsilon_\theta(x_t,t)||^2∣∣ϵt−ϵθ(xt,t)∣∣2。
最小化:−log(pθ(x0∣x1)-\log(p_\theta(x_0|x_1)−log(pθ(x0∣x1)
对于−log(pθ(x0∣x1)-\log(p_\theta(x_0|x_1)−log(pθ(x0∣x1),我们已知pθ(x0∣x1)∼N(x0;μθ(x1,1),β~1I)p_\theta(x_0|x_1)\sim \mathcal{N}(x_0;\mu_\theta(x_1,1),\tilde{\beta}_1\mathbf{I})pθ(x0∣x1)∼N(x0;μθ(x1,1),β~1I),这就意味着:
pθ(x0∣x1)=1(2πβ~12)1/2exp[−∥x0−μθ(x1,1)∥22β~12]p_\theta(x_0\mid x_1)=\frac{1}{(2\pi\tilde{\beta}_1^2)^{1/2}}\exp\left[-\frac{\|x_0-\mu_\theta(x_1,1)\|^2}{2\tilde{\beta}_1^2}\right]pθ(x0∣x1)=(2πβ~12)1/21exp[−2β~12∥x0−μθ(x1,1)∥2]
故
−log(pθ(x0∣x1)=−log12πβ~12+∣∣x0−μθ(x1,1)∣∣22β~12-\log(p_\theta(x_0|x_1)=-log\frac{1}{\sqrt{2\pi}\tilde{\beta}_1^2}+\frac{||x_0-\mu_\theta(x_1,1)||^2}{2\tilde{\beta}_1^2}−log(pθ(x0∣x1)=−log2πβ~121+2β~12∣∣x0−μθ(x1,1)∣∣2
可以观察到第一项−log12πβ~12-log\frac{1}{\sqrt{2\pi}\tilde{\beta}_1^2}−log2πβ~121与θ\thetaθ无关为常数,第二项∣∣x0−μθ(x1,1)∣∣22β~12\frac{||x_0-\mu_\theta(x_1,1)||^2}{2\tilde{\beta}_1^2}2β~12∣∣x0−μθ(x1,1)∣∣2与θ\thetaθ有关。所以我们只需要优化第二项即可(α1=αˉ1=1−β1\alpha_1=\bar{\alpha}_1=1-\beta_1α1=αˉ1=1−β1):
(x0−μθ(x1,1))22β~12=12β~12∣∣x0−1α1(x1−1−α11−αˉ1ϵθ(x1,1))∣∣2=12β~12∣∣x0−1α1(αˉ1x0+1−αˉ1ϵ1−1−α11−αˉ1ϵθ(x1,1))∣∣2=12β~12α1∣∣1−αˉ1ϵ1−1−α11−αˉ1ϵθ(x1,1)∣∣2=1−αˉ12β~12α1∣∣ϵ1−ϵθ(x1,1)∣∣2\begin{aligned}\frac{(x_0-\mu_\theta(x_1,1))^2}{2\tilde{\beta}_1^2}&=\frac{1}{2\tilde{\beta}_1^2}||x_0-\frac{1}{\sqrt{\alpha_1}}(x_1-\frac{1-\alpha_1}{\sqrt{1-\bar{\alpha}_1}}\epsilon_\theta(x_1,1))||^2\\&=\frac{1}{2\tilde{\beta}_1^2}||x_0-\frac{1}{\sqrt{\alpha_1}}(\sqrt{\bar{\alpha}_1}x_0+\sqrt{1-\bar{\alpha}_1}\epsilon_1-\frac{1-\alpha_1}{\sqrt{1-\bar{\alpha}_1}}\epsilon_\theta(x_1,1))||^2\\&=\frac{1}{2\tilde{\beta}_1^2\alpha_1}||\sqrt{1-\bar{\alpha}_1}\epsilon_1-\frac{1-\alpha_1}{\sqrt{1-\bar{\alpha}_1}}\epsilon_\theta(x_1,1)||^2\\&=\frac{1-\bar{\alpha}_1}{2\tilde{\beta}_1^2\alpha_1}||\epsilon_1-\epsilon_\theta(x_1,1)||^2\end{aligned}2β~12(x0−μθ(x1,1))2=2β~121∣∣x0−α11(x1−1−αˉ11−α1ϵθ(x1,1))∣∣2=2β~121∣∣x0−α11(αˉ1x0+1−αˉ1ϵ1−1−αˉ11−α1ϵθ(x1,1))∣∣2=2β~12α11∣∣1−αˉ1ϵ1−1−αˉ11−α1ϵθ(x1,1)∣∣2=2β~12α11−αˉ1∣∣ϵ1−ϵθ(x1,1)∣∣2
最终可以将第二个优化目标−log(pθ(x0∣x1)-\log(p_\theta(x_0|x_1)−log(pθ(x0∣x1)变为∣∣ϵ1−ϵθ(x1,1)∣∣2||\epsilon_1-\epsilon_\theta(x_1,1)||^2∣∣ϵ1−ϵθ(x1,1)∣∣2。
综合以上推导,我们得到两个等价的优化目标:
- 最小化∣∣ϵt−ϵθ(xt,t)∣∣2||\epsilon_t-\epsilon_\theta(x_t,t)||^2∣∣ϵt−ϵθ(xt,t)∣∣2
- 最小化∣∣ϵ1−ϵθ(x1,1)∣∣2||\epsilon_1-\epsilon_\theta(x_1,1)||^2∣∣ϵ1−ϵθ(x1,1)∣∣2
可以注意到,目标 (2) 实际上是目标 (1) 在t=1t=1t=1时的特例。
因此,我们可以将最终的优化目标统一为对所有时间步 ttt 的形式:
minθ∣∣ϵt−ϵθ(xt,t)∣∣2\min_{\theta}||\epsilon_t-\epsilon_\theta(x_t,t)||^2θmin∣∣ϵt−ϵθ(xt,t)∣∣2
由此,扩散模型的最终损失函数可写为:
L(θ)=∣∣ϵt−ϵθ(xt,t)∣∣2\mathcal{L}(\theta)=||\epsilon_t-\epsilon_\theta(x_t,t)||^2L(θ)=∣∣ϵt−ϵθ(xt,t)∣∣2
即网络通过最小化预测噪声与真实噪声之间的均方误差(MSE)来学习近似真实的反向扩散过程。
参考文献
[1] 扩散模型(Diffusion Model)详解:直观理解、数学原理、PyTorch 实现
[2] diffusion model 原理讲解 公式推导
[3] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models