当前位置: 首页 > news >正文

扩散模型DDPM数学推导过程完整版(下)

训练目标(损失函数)

我们定义

pθ(x0)p_\theta(x_0)pθ(x0)

表示模型在参数θ\thetaθ下生成的数据点x0x_0x0的概率密度;x0x_0x0表示一张真实的图片。pθ(x0)p_\theta(x_0)pθ(x0)描述了“样本x0x_0x0”出现的可能性有多大。

我们的训练目标就是
max⁡θpθ(x0)\max_{\theta} p_\theta(x_0)θmaxpθ(x0)

我们一般会写成负对数的形式:

min⁡θ−log⁡(pθ(x0))\min_{\theta}-\log(p_\theta(x_0))θminlog(pθ(x0))

但是有一个关键的问题,计算x0x_0x0的概率不能被有效的计算,它需要依赖于x0x_0x0之前的所有步长。

作为一种解决方案,我们引入这个目标的变分下界:

−log⁡(pθ(x0))≤−log⁡(pθ(x0))+DKL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))-\log(p_\theta(x_0))\leq-\log(p_\theta(x_0))+D_{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}|x_0))log(pθ(x0))log(pθ(x0))+DKL(q(x1:Tx0)∣∣pθ(x1:Tx0))

下面解释一下KL散度的含义:模型真实的前向扩散过程与模型学习到的反向去噪过程到底有多接近。

若这两个分布一模一样则DKL=0D_{KL}=0DKL=0,说明模型学习到了“完美的反向扩散的过程”。

我们可以将KL散度进行展开,有:

DKL=E[log⁡q(x1:T∣x0)pθ(x1:T∣x0)]=E[log⁡q(x1:T∣x0)pθ(x0,x1:T)pθ(x0)]=E[log⁡q(x1:T∣x0)pθ(x0:T)pθ(x0)]=E[log⁡(q(x1:T∣x0)pθ(x0:T))+log⁡(pθ(x0))]=E[log⁡q(x1:T∣x0)pθ(x0:T)]+log⁡(pθ(x0))\begin{aligned} D_{KL}=&\mathbb{E}\left[\log\frac{q(x_{1:T}\mid x_0)}{p_\theta(x_{1:T}\mid x_0)}\right] \\=&\mathbb{E}\left[\log\frac{q(x_{1:T}\mid x_0)}{\frac{p_\theta(x_0,x_{1:T})}{p_\theta(x_0)}}\right] \\=&\mathbb{E}\left[\log\frac{q(x_{1:T}\mid x_0)}{\frac{p_\theta(x_{0:T})}{p_\theta(x_0)}}\right] \\=&\mathbb{E}\left[\log(\frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})})+\log(p_\theta(x_0))\right] \\=&\mathbb{E}\left[\log\frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}\right]+\log(p_\theta(x_0)) \end{aligned}DKL=====E[logpθ(x1:Tx0)q(x1:Tx0)]Elogpθ(x0)pθ(x0,x1:T)q(x1:Tx0)Elogpθ(x0)pθ(x0:T)q(x1:Tx0)E[log(pθ(x0:T)q(x1:Tx0))+log(pθ(x0))]E[logpθ(x0:T)q(x1:Tx0)]+log(pθ(x0))

所以−log⁡(pθ(x0))≤−log⁡(pθ(x0))+DKL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))-\log(p_\theta(x_0))\leq-\log(p_\theta(x_0))+D_{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}|x_0))log(pθ(x0))log(pθ(x0))+DKL(q(x1:Tx0)∣∣pθ(x1:Tx0)),可以作如下变换:

−log⁡(pθ(x0))≤−log⁡(pθ(x0))+DKL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))−log⁡(pθ(x0))≤−log⁡(pθ(x0))+E[log⁡q(x1:T∣x0)pθ(x0:T)]+log⁡(pθ(x0))−log⁡(pθ(x0))≤E[log⁡q(x1:T∣x0)pθ(x0:T)]\begin{aligned} -\log(p_\theta(x_0))&\leq-\log(p_\theta(x_0))+D_{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}|x_0)) \\-\log(p_\theta(x_0))&\leq-\log(p_\theta(x_0))+\mathbb{E}\left[\log\frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}\right]+\log(p_\theta(x_0)) \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[\log\frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}\right] \end{aligned}log(pθ(x0))log(pθ(x0))log(pθ(x0))log(pθ(x0))+DKL(q(x1:Tx0)∣∣pθ(x1:Tx0))log(pθ(x0))+E[logpθ(x0:T)q(x1:Tx0)]+log(pθ(x0))E[logpθ(x0:T)q(x1:Tx0)]

  • 向前过程:x0→x1→⋯→xTx_0\to x_1\to\cdots\to x_Tx0x1xT
    q(x1:T∣x0)=q(x1∣x0)q(x2∣x1)q(x3∣x2)⋯q(xT∣xT−1)=∏t=1Tq(xt∣xt−1)q(x_{1:T}\mid x_{0})=q(x_{1}\mid x_{0})q(x_{2}\mid x_{1})q(x_{3}\mid x_{2})\cdots q(x_{T}\mid x_{T-1})=\prod_{t=1}^{T}q(x_{t}\mid x_{t-1})q(x1:Tx0)=q(x1x0)q(x2x1)q(x3x2)q(xTxT1)=t=1Tq(xtxt1)

  • 反向过程:xT→xT−1→⋯→x0x_T\to x_{T-1}\to\cdots\to x_0xTxT1x0
    pθ(x0:T)=p(xT)pθ(xT−1∣xT)pθ(xT−2∣xT−1)⋯pθ(x0∣x1)=p(xT)∏t=1Tpθ(xt−1∣xt)p_\theta(x_{0:T})=p(x_T)p_\theta(x_{T-1}\mid x_T)p_\theta(x_{T-2}\mid x_{T-1})\cdots p_\theta(x_0\mid x_1)=p(x_T)\prod_{t=1}^Tp_\theta(x_{t-1}\mid x_t)pθ(x0:T)=p(xT)pθ(xT1xT)pθ(xT2xT1)pθ(x0x1)=p(xT)t=1Tpθ(xt1xt)

至此可以进一步化简成:

−log⁡(pθ(x0))≤E[log⁡q(x1:T∣x0)pθ(x0:T)]−log⁡(pθ(x0))≤E[−log⁡(p(xT))+log⁡(∏t=1Tq(xt∣xt−1)∏t=1Tpθ(xt−1∣xt))]−log⁡(pθ(x0))≤E[−log⁡(p(xT))+∑t=1Tlog⁡(q(xt∣xt−1)pθ(xt−1∣xt))]−log⁡(pθ(x0))≤E[−log⁡(p(xT))+∑t=2Tlog⁡(q(xt∣xt−1)pθ(xt−1∣xt))+log⁡(q(x1∣x0)pθ(x0∣x1))]\begin{aligned} -\log(p_\theta(x_0))&\leq\mathbb{E}\left[\log\frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}\right] \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\log(\frac{\prod_{t=1}^Tq(x_t|x_{t-1})}{\prod_{t=1}^Tp_\theta(x_{t-1}|x_t)})\right] \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=1}^T\log(\frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)})\right] \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=2}^T\log(\frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)})+\log(\frac{q(x_1|x_0)}{p_\theta(x_0|x_1)})\right] \end{aligned}log(pθ(x0))log(pθ(x0))log(pθ(x0))log(pθ(x0))E[logpθ(x0:T)q(x1:Tx0)]E[log(p(xT))+log(t=1Tpθ(xt1xt)t=1Tq(xtxt1))]E[log(p(xT))+t=1Tlog(pθ(xt1xt)q(xtxt1))]E[log(p(xT))+t=2Tlog(pθ(xt1xt)q(xtxt1))+log(pθ(x0x1)q(x1x0))]

因为模型训练要拟合的是真实的“反向去噪分布”,所以我们要利用贝叶斯公式将q(xt∣xt−1)q(x_t\mid x_{t-1})q(xtxt1)q(x1∣x0)q(x_1\mid x_0)q(x1x0)转换成q(xt−1∣xt)q(x_{t-1}\mid x_t)q(xt1xt)q(x0∣x1)q(x_0\mid x_1)q(x0x1)的形式,且我们希望q(xt∣xt−1)q(x_t\mid x_{t-1})q(xtxt1)引入我们的已知条件x0x_0x0,使其能够更有效的学习。

q(xt∣xt−1)=q(xt−1∣xt)q(xt)q(xt−1)=q(xt−1∣xt,x0)q(xt∣x0)q(xt−1∣x0)\begin{aligned} q(x_t|x_{t-1})&=\frac{q(x_{t-1}|x_t)q(x_t)}{q(x_{t-1})} \\&=\frac{q(x_{t-1}\mid x_t,x_0)q(x_t\mid x_0)}{q(x_{t-1}\mid x_0)} \end{aligned}q(xtxt1)=q(xt1)q(xt1xt)q(xt)=q(xt1x0)q(xt1xt,x0)q(xtx0)

所以我们可以进一步化简为:

−log⁡(pθ(x0))≤E[−log⁡(p(xT))+∑t=2Tlog⁡(q(xt∣xt−1)pθ(xt−1∣xt))+log⁡(q(x1∣x0)pθ(x0∣x1))]−log⁡(pθ(x0))≤E[−log⁡(p(xT))+∑t=2Tlog⁡(q(xt−1∣xt,x0)q(xt∣x0)pθ(xt−1∣xt)q(xt−1∣x0))+log⁡(q(x1∣x0)pθ(x0∣x1))]−log⁡(pθ(x0))≤E[−log⁡(p(xT))+∑t=2Tlog⁡(q(xt−1∣xt,x0)pθ(xt−1∣xt))+∑t=2Tlog⁡(q(xt∣x0)q(xt−1∣x0))+log⁡(q(x1∣x0)pθ(x0∣x1))]\begin{aligned} \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=2}^T\log(\frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)})+\log(\frac{q(x_1|x_0)}{p_\theta(x_0|x_1)})\right] \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=2}^{T}\log(\frac{q(x_{t-1}|x_{t},x_{0})q(x_{t}|x_{0})}{p_{\theta}(x_{t-1}|x_{t})q(x_{t-1}|x_{0})})+\log(\frac{q(x_1|x_0)}{p_\theta(x_0|x_1)})\right] \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=2}^{T}\log(\frac{q(x_{t-1}|x_{t},x_{0})}{p_{\theta}(x_{t-1}|x_{t})})+\sum_{t=2}^T\log(\frac{q(x_t|x_0)}{q(x_{t-1}|x_0)})+\log(\frac{q(x_1|x_0)}{p_\theta(x_0|x_1)})\right] \end{aligned}log(pθ(x0))log(pθ(x0))log(pθ(x0))E[log(p(xT))+t=2Tlog(pθ(xt1xt)q(xtxt1))+log(pθ(x0x1)q(x1x0))]E[log(p(xT))+t=2Tlog(pθ(xt1xt)q(xt1x0)q(xt1xt,x0)q(xtx0))+log(pθ(x0x1)q(x1x0))]E[log(p(xT))+t=2Tlog(pθ(xt1xt)q(xt1xt,x0))+t=2Tlog(q(xt1x0)q(xtx0))+log(pθ(x0x1)q(x1x0))]

现在我们拿出∑t=2Tlog(q(xt∣x0)q(xt−1∣x0))\sum_{t=2}^Tlog(\frac{q(x_t|x_0)}{q(x_{t-1}|x_0)})t=2Tlog(q(xt1x0)q(xtx0))这一项进行化简操作:

∑t=2Tlog⁡(q(xt∣x0)q(xt−1∣x0))=log⁡(∏t=2Tq(xt∣x0)q(xt−1∣x0))=log⁡(q(x2∣x0)q(x3∣x0)q(x4∣x0)...q(xT∣x0)q(x1∣x0)q(x2∣x0)q(x3∣x0)...q(xT−1∣x0))=log⁡(q(xT∣x0)q(x1∣x0))\begin{aligned} &\sum_{t=2}^T\log(\frac{q(x_t|x_0)}{q(x_{t-1}|x_0)}) \\&=\log(\prod_{t=2}^T\frac{q(x_t|x_0)}{q(x_{t-1}|x_0)}) \\&=\log(\frac{q(x_2|x_0)q(x_3|x_0)q(x_4|x_0)...q(x_T|x_0)}{q(x_1|x_0)q(x_2|x_0)q(x_3|x_0)...q(x_{T-1}|x_0)}) \\&=\log(\frac{q(x_T|x_0)}{q(x_1|x_0)}) \end{aligned}t=2Tlog(q(xt1x0)q(xtx0))=log(t=2Tq(xt1x0)q(xtx0))=log(q(x1x0)q(x2x0)q(x3x0)...q(xT1x0)q(x2x0)q(x3x0)q(x4x0)...q(xTx0))=log(q(x1x0)q(xTx0))

代换到原式中有:

−log⁡(pθ(x0))≤E[−log⁡(p(xT))+∑t=2Tlog⁡(q(xt−1∣xt,x0)pθ(xt−1∣xt))+log⁡(q(xT∣x0)q(x1∣x0))+log⁡(q(x1∣x0)pθ(x0∣x1))]\begin{aligned} \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=2}^{T}\log(\frac{q(x_{t-1}|x_{t},x_{0})}{p_{\theta}(x_{t-1}|x_{t})})+\log(\frac{q(x_T|x_0)}{q(x_1|x_0)})+\log(\frac{q(x_1|x_0)}{p_\theta(x_0|x_1)})\right] \end{aligned}log(pθ(x0))E[log(p(xT))+t=2Tlog(pθ(xt1xt)q(xt1xt,x0))+log(q(x1x0)q(xTx0))+log(pθ(x0x1)q(x1x0))]

然后再看最后两项:

log⁡(q(xT∣x0)q(x1∣x0))+log⁡(q(x1∣x0)pθ(x0∣x1))=log⁡(q(xT∣x0))−log⁡(q(x1∣x0))+log⁡(q(x1∣x0))−log⁡(pθ(x0∣x1))=log⁡(q(xT∣x0))−log⁡(pθ(x0∣x1))\begin{aligned} &\log(\frac{q(x_T|x_0)}{q(x_1|x_0)})+\log(\frac{q(x_1|x_0)}{p_\theta(x_0|x_1)}) \\&=\log(q(x_T|x_0))-\log(q(x_1|x_0))+\log(q(x_1|x_0))-\log(p_\theta(x_0|x_1)) \\&=\log(q(x_T|x_0))-\log(p_\theta(x_0|x_1)) \end{aligned}log(q(x1x0)q(xTx0))+log(pθ(x0x1)q(x1x0))=log(q(xTx0))log(q(x1x0))+log(q(x1x0))log(pθ(x0x1))=log(q(xTx0))log(pθ(x0x1))

代回到原式中有:

−log⁡(pθ(x0))≤E[−log⁡(p(xT))+∑t=2Tlog⁡(q(xt−1∣xt,x0)pθ(xt−1∣xt))+log⁡(q(xT∣x0))−log⁡(pθ(x0∣x1))]−log⁡(pθ(x0))≤E[log⁡(q(xT∣x0)p(xT))+∑t=2Tlog⁡(q(xt−1∣xt,x0)pθ(xt−1∣xt))−log⁡(pθ(x0∣x1))]−log⁡(pθ(x0))≤DKL(q(xT∣x0)∣∣∣p(xT))+∑t=2TDKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))−log⁡(pθ(x0∣x1)\begin{aligned} -\log(p_\theta(x_0))&\leq\mathbb{E}\left[-\log(p(x_T))+\sum_{t=2}^{T}\log(\frac{q(x_{t-1}|x_{t},x_{0})}{p_{\theta}(x_{t-1}|x_{t})})+\log(q(x_T|x_0))-\log(p_\theta(x_0|x_1))\right] \\-\log(p_\theta(x_0))&\leq\mathbb{E}\left[\log(\frac{q(x_T|x_0)}{p(x_T)})+\sum_{t=2}^{T}\log(\frac{q(x_{t-1}|x_{t},x_{0})}{p_{\theta}(x_{t-1}|x_{t})})-\log(p_\theta(x_0|x_1))\right] \\-\log(p_\theta(x_0))&\leq D_{KL}(q(x_T|x_0)|||p(x_T))+\sum_{t=2}^TD_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))-\log(p_\theta(x_0|x_1) \end{aligned}log(pθ(x0))log(pθ(x0))log(pθ(x0))E[log(p(xT))+t=2Tlog(pθ(xt1xt)q(xt1xt,x0))+log(q(xTx0))log(pθ(x0x1))]E[log(p(xT)q(xTx0))+t=2Tlog(pθ(xt1xt)q(xt1xt,x0))log(pθ(x0x1))]DKL(q(xTx0)∣∣∣p(xT))+t=2TDKL(q(xt1xt,x0)∣∣pθ(xt1xt))log(pθ(x0x1)

最终我们可以写成这样的形式:

LVLB=DKL(q(xT∣x0)∣∣∣p(xT))+∑t=2TDKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))−log⁡(pθ(x0∣x1)L_{VLB}=D_{KL}(q(x_T|x_0)|||p(x_T))+\sum_{t=2}^TD_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))-\log(p_\theta(x_0|x_1)LVLB=DKL(q(xTx0)∣∣∣p(xT))+t=2TDKL(q(xt1xt,x0)∣∣pθ(xt1xt))log(pθ(x0x1)

其中第一项DKL(q(xT∣x0)∣∣∣p(xT))D_{KL}(q(x_T|x_0)|||p(x_T))DKL(q(xTx0)∣∣∣p(xT))与可学习参数θ\thetaθ无关,可以不作为优化的目标。

优化目标由两部分组成:

  • 最小化DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))D_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))DKL(q(xt1xt,x0)∣∣pθ(xt1xt)):最大化每一个去噪声操作和加噪声逆操作的相似度。
  • 最小化−log⁡(pθ(x0∣x1)-\log(p_\theta(x_0|x_1)log(pθ(x0x1):已知x1x_1x1时,让最后复原原图x0x_0x0的概率更高。

最小化:DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))D_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))DKL(q(xt1xt,x0)∣∣pθ(xt1xt))

若一维正态分布P∼N(μ1,σ1)P\sim \mathcal{N}(\mu_1,\sigma_1)PN(μ1,σ1)Q∼N(μ2,σ2)Q\sim \mathcal{N}(\mu_2,\sigma_2)QN(μ2,σ2)P,QP,QP,Q之间的KL散度为:
DKL(P∣∣Q)=logσ2σ1+σ12+(μ1−μ2)22σ22−12D_{KL}(P||Q)=log\frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+(\mu_1-\mu_2)^2}{2\sigma_2^2}-\frac{1}{2}DKL(P∣∣Q)=logσ1σ2+2σ22σ12+(μ1μ2)221

对于DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))D_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))DKL(q(xt1xt,x0)∣∣pθ(xt1xt)),我们可知q(xt−1∣xt,x0)∼N(xt−1;μ~(xt,x0),β~tI)q(x_{t-1}|x_t,x_0)\sim \mathcal{N}(x_{t-1};\tilde{\mu}\left(x_t,x_0\right),\tilde{\beta}_t\mathbf{I})q(xt1xt,x0)N(xt1;μ~(xt,x0),β~tI)pθ(xt−1∣xt)∼N(xt−1;μθ(xt,t),Σθ(xt,t))p_\theta\left(x_{t-1}\mid x_t\right)\sim\mathcal{N}(x_{t-1};\mu_\theta\left(x_t,t\right),\Sigma_\theta\left(x_t,t\right))pθ(xt1xt)N(xt1;μθ(xt,t),Σθ(xt,t))

其中Σθ(xt,t)=β~t\Sigma_\theta(x_t,t)=\tilde{\beta}_tΣθ(xt,t)=β~t

回顾:因为β~t\tilde{\beta}_tβ~t是固定噪声日程,在DDPM的原始论文中是线性递增的。

Σθ(xt,t)=β~tI=1−αˉt−11−αˉt⋅βtI\Sigma_\theta(x_t,t)=\tilde{\beta}_t\mathbf{I}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\cdot\beta_t\mathbf{I}Σθ(xt,t)=β~tI=1αˉt1αˉt1βtI

所以

DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=logβ~tβ~t+β~t2+(μθ(xt,t)−μ~t(xt,t))22β~t2−12=12β~t2∣∣μθ(xt,t)−μ~t(xt,t)∣∣2D_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))=log\frac{\tilde{\beta}_t}{\tilde{\beta}_t}+\frac{\tilde{\beta}_t^2+(\mu_\theta(x_t,t)-\tilde{\mu}_t(x_t,t))^2}{2\tilde{\beta}_t^2}-\frac{1}{2}=\frac{1}{2\tilde{\beta}_t^2}||\mu_\theta(x_t,t)-\tilde{\mu}_t(x_t,t)||^2DKL(q(xt1xt,x0)∣∣pθ(xt1xt))=logβ~tβ~t+2β~t2β~t2+(μθ(xt,t)μ~t(xt,t))221=2β~t21∣∣μθ(xt,t)μ~t(xt,t)2

根据扩散模型DDPM数学推导过程完整版(上)有:

  • 采样阶段的理论反向均值为:

μ~(xt,x0)=1αt(xt−1−αt1−αˉtϵt)\tilde{\mu}\left(x_t,x_0\right)=\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_t\right)μ~(xt,x0)=αt1(xt1αˉt1αtϵt)

  • 采样阶段的模型反向均值为:
    μθ(xt,t)=1αt(xt−1−αt1−αˉtϵθ(xt,t))\mu_\theta(x_t,t)=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t,t))μθ(xt,t)=αt1(xt1αˉt1αtϵθ(xt,t))

DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=12β~t2∣∣μθ(xt,t)−μ~t(xt,t)∣∣2=(1−αt)22αt(1−αˉt)β~t2∣∣ϵt−ϵθ(xt,t)∣∣2D_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))=\frac{1}{2\tilde{\beta}_t^2}||\mu_\theta(x_t,t)-\tilde{\mu}_t(x_t,t)||^2=\frac{(1-\alpha_t)^2}{2\alpha_t(1-\bar{\alpha}_t)\tilde{\beta}_t^2}||\epsilon_t-\epsilon_\theta(x_t,t)||^2DKL(q(xt1xt,x0)∣∣pθ(xt1xt))=2β~t21∣∣μθ(xt,t)μ~t(xt,t)2=2αt(1αˉt)β~t2(1αt)2∣∣ϵtϵθ(xt,t)2

在DDPM论文指出,若将前面的系数全部丢掉,模型的效果更好,最终第一个优化目标DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))D_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))DKL(q(xt1xt,x0)∣∣pθ(xt1xt))可变为∣∣ϵt−ϵθ(xt,t)∣∣2||\epsilon_t-\epsilon_\theta(x_t,t)||^2∣∣ϵtϵθ(xt,t)2


最小化:−log⁡(pθ(x0∣x1)-\log(p_\theta(x_0|x_1)log(pθ(x0x1)

对于−log⁡(pθ(x0∣x1)-\log(p_\theta(x_0|x_1)log(pθ(x0x1),我们已知pθ(x0∣x1)∼N(x0;μθ(x1,1),β~1I)p_\theta(x_0|x_1)\sim \mathcal{N}(x_0;\mu_\theta(x_1,1),\tilde{\beta}_1\mathbf{I})pθ(x0x1)N(x0;μθ(x1,1),β~1I),这就意味着:

pθ(x0∣x1)=1(2πβ~12)1/2exp⁡[−∥x0−μθ(x1,1)∥22β~12]p_\theta(x_0\mid x_1)=\frac{1}{(2\pi\tilde{\beta}_1^2)^{1/2}}\exp\left[-\frac{\|x_0-\mu_\theta(x_1,1)\|^2}{2\tilde{\beta}_1^2}\right]pθ(x0x1)=(2πβ~12)1/21exp[2β~12x0μθ(x1,1)2]

−log⁡(pθ(x0∣x1)=−log12πβ~12+∣∣x0−μθ(x1,1)∣∣22β~12-\log(p_\theta(x_0|x_1)=-log\frac{1}{\sqrt{2\pi}\tilde{\beta}_1^2}+\frac{||x_0-\mu_\theta(x_1,1)||^2}{2\tilde{\beta}_1^2}log(pθ(x0x1)=log2πβ~121+2β~12∣∣x0μθ(x1,1)2

可以观察到第一项−log12πβ~12-log\frac{1}{\sqrt{2\pi}\tilde{\beta}_1^2}log2πβ~121θ\thetaθ无关为常数,第二项∣∣x0−μθ(x1,1)∣∣22β~12\frac{||x_0-\mu_\theta(x_1,1)||^2}{2\tilde{\beta}_1^2}2β~12∣∣x0μθ(x1,1)2θ\thetaθ有关。所以我们只需要优化第二项即可(α1=αˉ1=1−β1\alpha_1=\bar{\alpha}_1=1-\beta_1α1=αˉ1=1β1):

(x0−μθ(x1,1))22β~12=12β~12∣∣x0−1α1(x1−1−α11−αˉ1ϵθ(x1,1))∣∣2=12β~12∣∣x0−1α1(αˉ1x0+1−αˉ1ϵ1−1−α11−αˉ1ϵθ(x1,1))∣∣2=12β~12α1∣∣1−αˉ1ϵ1−1−α11−αˉ1ϵθ(x1,1)∣∣2=1−αˉ12β~12α1∣∣ϵ1−ϵθ(x1,1)∣∣2\begin{aligned}\frac{(x_0-\mu_\theta(x_1,1))^2}{2\tilde{\beta}_1^2}&=\frac{1}{2\tilde{\beta}_1^2}||x_0-\frac{1}{\sqrt{\alpha_1}}(x_1-\frac{1-\alpha_1}{\sqrt{1-\bar{\alpha}_1}}\epsilon_\theta(x_1,1))||^2\\&=\frac{1}{2\tilde{\beta}_1^2}||x_0-\frac{1}{\sqrt{\alpha_1}}(\sqrt{\bar{\alpha}_1}x_0+\sqrt{1-\bar{\alpha}_1}\epsilon_1-\frac{1-\alpha_1}{\sqrt{1-\bar{\alpha}_1}}\epsilon_\theta(x_1,1))||^2\\&=\frac{1}{2\tilde{\beta}_1^2\alpha_1}||\sqrt{1-\bar{\alpha}_1}\epsilon_1-\frac{1-\alpha_1}{\sqrt{1-\bar{\alpha}_1}}\epsilon_\theta(x_1,1)||^2\\&=\frac{1-\bar{\alpha}_1}{2\tilde{\beta}_1^2\alpha_1}||\epsilon_1-\epsilon_\theta(x_1,1)||^2\end{aligned}2β~12(x0μθ(x1,1))2=2β~121∣∣x0α11(x11αˉ11α1ϵθ(x1,1))2=2β~121∣∣x0α11(αˉ1x0+1αˉ1ϵ11αˉ11α1ϵθ(x1,1))2=2β~12α11∣∣1αˉ1ϵ11αˉ11α1ϵθ(x1,1)2=2β~12α11αˉ1∣∣ϵ1ϵθ(x1,1)2

最终可以将第二个优化目标−log⁡(pθ(x0∣x1)-\log(p_\theta(x_0|x_1)log(pθ(x0x1)变为∣∣ϵ1−ϵθ(x1,1)∣∣2||\epsilon_1-\epsilon_\theta(x_1,1)||^2∣∣ϵ1ϵθ(x1,1)2


综合以上推导,我们得到两个等价的优化目标:

  • 最小化∣∣ϵt−ϵθ(xt,t)∣∣2||\epsilon_t-\epsilon_\theta(x_t,t)||^2∣∣ϵtϵθ(xt,t)2
  • 最小化∣∣ϵ1−ϵθ(x1,1)∣∣2||\epsilon_1-\epsilon_\theta(x_1,1)||^2∣∣ϵ1ϵθ(x1,1)2

可以注意到,目标 (2) 实际上是目标 (1) 在t=1t=1t=1时的特例。
因此,我们可以将最终的优化目标统一为对所有时间步 ttt 的形式:

min⁡θ∣∣ϵt−ϵθ(xt,t)∣∣2\min_{\theta}||\epsilon_t-\epsilon_\theta(x_t,t)||^2θmin∣∣ϵtϵθ(xt,t)2

由此,扩散模型的最终损失函数可写为:

L(θ)=∣∣ϵt−ϵθ(xt,t)∣∣2\mathcal{L}(\theta)=||\epsilon_t-\epsilon_\theta(x_t,t)||^2L(θ)=∣∣ϵtϵθ(xt,t)2

即网络通过最小化预测噪声与真实噪声之间的均方误差(MSE)来学习近似真实的反向扩散过程。


参考文献

[1] 扩散模型(Diffusion Model)详解:直观理解、数学原理、PyTorch 实现
[2] diffusion model 原理讲解 公式推导
[3] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models

http://www.dtcms.com/a/457639.html

相关文章:

  • 重庆网站建设入门培训扬州电商网站建设
  • 做国际网站有什么需要注意的福州网站建设吧
  • Echarts极坐标系示例
  • HarmonyOS应用开发深度解析:ArkTS语法精要与状态管理实践
  • 园林景观中企动力提供网站建设网店代理货源网
  • 酒店网站建设建设网站ppt
  • docker学习笔记详记
  • 可做外链的视频网站企业建一个网站
  • 滑动窗口专题总结:从懵逼到掌握valid计数器
  • 深圳市盐田区建设局网站WordPress制作安卓
  • Next.js useState useEffect useRef 速记
  • 图论算法刷题的第五十一天
  • Linux自动化构建工具make/Makefile及Linux下的第一个程序—进度条
  • Vue使用原生方式把视频当作背景
  • 铜陵app网站做招聘信息wordpress第一篇文章id
  • 从玩具到工业:基于 CodeBuddy code CLI 构建电力变压器绕组短路智能诊断系统
  • wordpress 中英文网站模板手机创建网页
  • 基于 GEE 的 Sentinel-2 光谱、指数、纹理特征提取与 Sentinel-1 SAR 数据处理
  • 嘉兴网站排名优化价格windows 安装 wordpress
  • 2-C语言中的数据类型
  • 免费企业营销网站制作公司建网站有何意义
  • LeetCode算法日记 - Day 66: 衣橱整理、斐波那契数(含记忆化递归与动态规划总结)
  • 建行官方网站网站模块数据同步
  • HTTP 协议的基本格式
  • 【代码】洛谷 P6150 [USACO20FEB] Clock Tree S [思维]
  • 专业做网站的公司哪家好西宁网站建设公司
  • 信息安全基础知识:06认证技术
  • 哪一个网站做专栏作家好点橙色企业网站模板
  • 【区间DP】戳气球 题解
  • Ventoy下载和安装教程(图文并茂,非常详细)