DDPM优化目标公式推导
DDPM优化目标公式推导
- DDPM优化目标公式推导
- **1. 问题定义**
- **2. 优化目标:最大化对数似然**
- **3. 变分下界的分解**
- **4. 关键步骤:简化 KL 散度项**
- **5. 最终优化目标**
- **关键结论**
- 补充内容(详细推导)
- 详细推导:变分下界(VLB)的最终简化形式
- **步骤 1: 初始 VLB 表达式**
- **步骤 2: 对数展开与重组**
- **步骤 3: 引入关键变量 x 0 \mathbf{x}_0 x0**
- **步骤 4: 代入并拆分求和项**
- **步骤 5: 处理望远镜求和项**
- **步骤 6: 分离 t = 1 t=1 t=1 和 t ≥ 2 t \geq 2 t≥2 项**
- **步骤 7: 转换为 KL 散度**
- **步骤 8: 最终简化形式**
- **关键说明**
- 详细推导:后验分布 q ( x t − 1 ∣ x t , x 0 ) q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) q(xt−1∣xt,x0) 和参数化均值 μ θ ( x t , t ) \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) μθ(xt,t)
- **关键备注**
DDPM优化目标公式推导
DDPM(Denoising Diffusion Probabilistic Models)的优化目标推导基于变分下界(Variational Lower Bound, VLB) 或 证据下界(Evidence Lower Bound, ELBO)。以下是详细推导过程:
1. 问题定义
- 目标:学习一个模型 p θ ( x 0 ) p_\theta(\mathbf{x}_0) pθ(x0) 逼近真实数据分布 q ( x 0 ) q(\mathbf{x}_0) q(x0)。
- 前向过程(扩散过程):
固定方差序列 β 1 , … , β T \beta_1, \dots, \beta_T β1,…,βT,定义马尔可夫链:
q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) , q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q(\mathbf{x}_{1:T} | \mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1}), \quad q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}) q(x1:T∣x0)=t=1∏Tq(xt∣xt−1),q(xt∣xt−1)=N(xt;1−βtxt−1,βtI) - 反向过程(生成过程):
学习参数化的马尔可夫链:
p θ ( x 0 : T ) = p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) , p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t), \quad p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)) pθ(x0:T)=p(xT)t=1∏Tpθ(xt−1∣xt),pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
2. 优化目标:最大化对数似然
目标是最大化 log p θ ( x 0 ) \log p_\theta(\mathbf{x}_0) logpθ(x0),但直接计算困难,转而最大化其变分下界:
log p θ ( x 0 ) ≥ E q ( x 1 : T ∣ x 0 ) [ log p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] ≜ VLB \log p_\theta(\mathbf{x}_0) \geq \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \triangleq \text{VLB} logpθ(x0)≥Eq(x1:T∣x0)[logq(x1:T∣x0)pθ(x0:T)]≜VLB
3. 变分下界的分解
将 VLB 展开并分解:
VLB = E q ( x 1 : T ∣ x 0 ) [ log p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] = E q [ log p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] = E q [ log p θ ( x T ) + ∑ t = 1 T log p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 ) ] \begin{align*} \text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \\ &= \mathbb{E}_{q} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \\ &= \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right] \\ \end{align*} VLB=Eq(x1:T∣x0)[logq(x1:T∣x0)pθ(x0:T)]=Eq[logq(x1:T∣x0)pθ(x0:T)]=Eq[logpθ(xT)+t=1∑Tlogq(xt∣xt−1)pθ(xt−1∣xt)]
利用马尔可夫性质,改写为:
VLB = E q [ log p θ ( x 0 ∣ x 1 ) + ∑ t = 2 T log p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) − ∑ t = 1 T log q ( x t ∣ x t − 1 ) q ( x t − 1 ∣ x 0 ) ] + C \text{VLB} = \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) + \sum_{t=2}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} - \sum_{t=1}^T \log \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1})}{q(\mathbf{x}_{t-1} | \mathbf{x}_0)} \right] + C VLB=Eq[logpθ(x0∣x1)+t=2∑Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)−t=1∑Tlogq(xt−1∣x0)q(xt∣xt−1)]+C
最终简化为:
VLB = E q [ log p θ ( x 0 ∣ x 1 ) ] − ∑ t = 2 T E q [ D KL ( q ( x t − 1 ∣ x t , x 0 ) ∥ p θ ( x t − 1 ∣ x t ) ) ] − D KL ( q ( x T ∣ x 0 ) ∥ p ( x T ) ) \boxed{\text{VLB} = \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) \right] - \sum_{t=2}^T \mathbb{E}_{q} \left[ D_\text{KL} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \right) \right] - D_\text{KL} \left( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \right)} VLB=Eq[logpθ(x0∣x1)]−t=2∑TEq[DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))]−DKL(q(xT∣x0)∥p(xT))
4. 关键步骤:简化 KL 散度项
(a) 后验分布 q ( x t − 1 ∣ x t , x 0 ) q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) q(xt−1∣xt,x0) 的闭式解
由贝叶斯公式:
q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; μ ~ t ( x t , x 0 ) , β ~ t I ) q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I}) q(xt−1∣xt,x0)=N(xt−1;μ~t(xt,x0),β~tI)
其中:
μ ~ t ( x t , x 0 ) = α ˉ t − 1 β t 1 − α ˉ t x 0 + α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t , β ~ t = 1 − α ˉ t − 1 1 − α ˉ t β t \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t, \quad \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t μ~t(xt,x0)=1−αˉtαˉt−1βtx0+1−αˉtαt(1−αˉt−1)xt,β~t=1−αˉt1−αˉt−1βt
(记 α t = 1 − β t \alpha_t = 1 - \beta_t αt=1−βt, α ˉ t = ∏ i = 1 t α i \bar{\alpha}_t = \prod_{i=1}^t \alpha_i αˉt=∏i=1tαi)
(b) 参数化均值 μ θ ( x t , t ) \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) μθ(xt,t)
设 p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)) pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))。
为匹配后验分布,选择:
μ θ ( x t , t ) = μ ~ t ( x t , x t − 1 − α ˉ t ϵ θ α ˉ t ) \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \tilde{\boldsymbol{\mu}}_t \left( \mathbf{x}_t, \frac{\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}_\theta}{\sqrt{\bar{\alpha}_t}} \right) μθ(xt,t)=μ~t(xt,αˉtxt−1−αˉtϵθ)
代入闭式解得:
μ θ = 1 α t ( x t − β t 1 − α ˉ t ϵ θ ( x t , t ) ) \boldsymbol{\mu}_\theta = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right) μθ=αt1(xt−1−αˉtβtϵθ(xt,t))
© KL 散度的闭式解
两个高斯分布的 KL 散度为:
D KL ( N ( μ 1 , Σ 1 ) ∥ N ( μ 2 , Σ 2 ) ) = 1 2 [ log ∣ Σ 2 ∣ ∣ Σ 1 ∣ − d + tr ( Σ 2 − 1 Σ 1 ) + ( μ 2 − μ 1 ) ⊤ Σ 2 − 1 ( μ 2 − μ 1 ) ] D_\text{KL}(\mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) \parallel \mathcal{N}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)) = \frac{1}{2} \left[ \log \frac{|\boldsymbol{\Sigma}_2|}{|\boldsymbol{\Sigma}_1|} - d + \text{tr}(\boldsymbol{\Sigma}_2^{-1} \boldsymbol{\Sigma}_1) + (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^\top \boldsymbol{\Sigma}_2^{-1} (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1) \right] DKL(N(μ1,Σ1)∥N(μ2,Σ2))=21[log∣Σ1∣∣Σ2∣−d+tr(Σ2−1Σ1)+(μ2−μ1)⊤Σ2−1(μ2−μ1)]
假设 Σ θ = σ t 2 I \boldsymbol{\Sigma}_\theta = \sigma_t^2 \mathbf{I} Σθ=σt2I(常取 σ t 2 = β t \sigma_t^2 = \beta_t σt2=βt 或 β ~ t \tilde{\beta}_t β~t),则:
D KL = 1 2 σ t 2 ∥ μ ~ t − μ θ ∥ 2 + C D_\text{KL} = \frac{1}{2\sigma_t^2} \| \tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta \|^2 + C DKL=2σt21∥μ~t−μθ∥2+C
代入 μ θ \boldsymbol{\mu}_\theta μθ 和 μ ~ t \tilde{\boldsymbol{\mu}}_t μ~t 的表达式:
μ ~ t − μ θ = β t α t 1 − α ˉ t ( ϵ − ϵ θ ( x t , t ) ) \tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta = \frac{\beta_t}{\sqrt{\alpha_t} \sqrt{1 - \bar{\alpha}_t}} \left( \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right) μ~t−μθ=αt1−αˉtβt(ϵ−ϵθ(xt,t))
其中 x t = α ˉ t x 0 + 1 − α ˉ t ϵ \mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon} xt=αˉtx0+1−αˉtϵ。最终:
D KL ∝ E x 0 , ϵ [ ∥ ϵ − ϵ θ ( x t , t ) ∥ 2 ] \boxed{D_\text{KL} \propto \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2 \right]} DKL∝Ex0,ϵ[∥ϵ−ϵθ(xt,t)∥2]
5. 最终优化目标
忽略常数项和权重,DDPM 的简化目标为:
L simple ( θ ) = E t , x 0 , ϵ [ ∥ ϵ − ϵ θ ( x t , t ) ∥ 2 ] \mathcal{L}_\text{simple}(\theta) = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2 \right] Lsimple(θ)=Et,x0,ϵ[∥ϵ−ϵθ(xt,t)∥2]
其中:
- t ∼ Uniform ( 1 , T ) t \sim \text{Uniform}(1, T) t∼Uniform(1,T)
- x 0 ∼ q ( x 0 ) \mathbf{x}_0 \sim q(\mathbf{x}_0) x0∼q(x0)
- ϵ ∼ N ( 0 , I ) \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) ϵ∼N(0,I)
- x t = α ˉ t x 0 + 1 − α ˉ t ϵ \mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon} xt=αˉtx0+1−αˉtϵ
关键结论
DDPM 通过训练一个网络 ϵ θ \boldsymbol{\epsilon}_\theta ϵθ 预测添加到样本中的噪声,最小化噪声预测的均方误差,从而实现数据生成。此目标等价于对数据分布的梯度(分数)进行匹配,与基于分数的生成模型有深刻联系。
补充内容(详细推导)
详细推导:变分下界(VLB)的最终简化形式
以下完整推导从初始 VLB 表达式出发,逐步简化为最终形式:
VLB = E q [ log p θ ( x 0 ∣ x 1 ) ] − ∑ t = 2 T E q [ D KL ( q ( x t − 1 ∣ x t , x 0 ) ∥ p θ ( x t − 1 ∣ x t ) ) ] − D KL ( q ( x T ∣ x 0 ) ∥ p ( x T ) ) \boxed{\text{VLB} = \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) \right] - \sum_{t=2}^T \mathbb{E}_{q} \left[ D_\text{KL} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \right) \right] - D_\text{KL} \left( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \right)} VLB=Eq[logpθ(x0∣x1)]−t=2∑TEq[DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))]−DKL(q(xT∣x0)∥p(xT))
步骤 1: 初始 VLB 表达式
由变分推断,证据下界(ELBO)为:
log p θ ( x 0 ) ≥ E q ( x 1 : T ∣ x 0 ) [ log p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] ⏟ VLB \log p_\theta(\mathbf{x}_0) \geq \underbrace{\mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right]}_{\text{VLB}} logpθ(x0)≥VLB Eq(x1:T∣x0)[logq(x1:T∣x0)pθ(x0:T)]
代入联合概率分解:
p θ ( x 0 : T ) = p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) , q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t), \quad q(\mathbf{x}_{1:T} | \mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1}) pθ(x0:T)=p(xT)t=1∏Tpθ(xt−1∣xt),q(x1:T∣x0)=t=1∏Tq(xt∣xt−1)
得:
VLB = E q [ log p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) ∏ t = 1 T q ( x t ∣ x t − 1 ) ] \text{VLB} = \mathbb{E}_{q} \left[ \log \frac{p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{\prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right] VLB=Eq[log∏t=1Tq(xt∣xt−1)p(xT)∏t=1Tpθ(xt−1∣xt)]
步骤 2: 对数展开与重组
展开对数项:
VLB = E q [ log p ( x T ) + ∑ t = 1 T log p θ ( x t − 1 ∣ x t ) − ∑ t = 1 T log q ( x t ∣ x t − 1 ) ] = E q [ log p ( x T ) + ∑ t = 1 T log p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 ) ] \begin{align*} \text{VLB} &= \mathbb{E}_{q} \left[ \log p(\mathbf{x}_T) + \sum_{t=1}^T \log p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) - \sum_{t=1}^T \log q(\mathbf{x}_t | \mathbf{x}_{t-1}) \right] \\ &= \mathbb{E}_{q} \left[ \log p(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right] \end{align*} VLB=Eq[logp(xT)+t=1∑Tlogpθ(xt−1∣xt)−t=1∑Tlogq(xt∣xt−1)]=Eq[logp(xT)+t=1∑Tlogq(xt∣xt−1)pθ(xt−1∣xt)]
步骤 3: 引入关键变量 x 0 \mathbf{x}_0 x0
利用条件概率的 贝叶斯定理 改写 q ( x t ∣ x t − 1 ) q(\mathbf{x}_t | \mathbf{x}_{t-1}) q(xt∣xt−1):
q ( x t ∣ x t − 1 ) = q ( x t − 1 ∣ x t , x 0 ) ⋅ q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \cdot q(\mathbf{x}_t | \mathbf{x}_0)}{q(\mathbf{x}_{t-1} | \mathbf{x}_0)} q(xt∣xt−1)=q(xt−1∣x0)q(xt−1∣xt,x0)⋅q(xt∣x0)
代入得:
log q ( x t ∣ x t − 1 ) = log q ( x t − 1 ∣ x t , x 0 ) + log q ( x t ∣ x 0 ) − log q ( x t − 1 ∣ x 0 ) \log q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \log q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) + \log q(\mathbf{x}_t | \mathbf{x}_0) - \log q(\mathbf{x}_{t-1} | \mathbf{x}_0) logq(xt∣xt−1)=logq(xt−1∣xt,x0)+logq(xt∣x0)−logq(xt−1∣x0)
步骤 4: 代入并拆分求和项
将上述表达式代入 VLB:
VLB = E q [ log p ( x T ) + ∑ t = 1 T log p θ ( x t − 1 ∣ x t ) − ∑ t = 1 T ( log q ( x t − 1 ∣ x t , x 0 ) + log q ( x t ∣ x 0 ) − log q ( x t − 1 ∣ x 0 ) ) ] \begin{align*} \text{VLB} &= \mathbb{E}_{q} \Bigg[ \log p(\mathbf{x}_T) + \sum_{t=1}^T \log p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \\ &- \sum_{t=1}^T \Big( \log q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) + \log q(\mathbf{x}_t | \mathbf{x}_0) - \log q(\mathbf{x}_{t-1} | \mathbf{x}_0) \Big) \Bigg] \end{align*} VLB=Eq[logp(xT)+t=1∑Tlogpθ(xt−1∣xt)−t=1∑T(logq(xt−1∣xt,x0)+logq(xt∣x0)−logq(xt−1∣x0))]
重组三项求和:
VLB = E q [ log p ( x T ) + ∑ t = 1 T log p θ ( x t − 1 ∣ x t ) − ∑ t = 1 T log q ( x t − 1 ∣ x t , x 0 ) − ∑ t = 1 T log q ( x t ∣ x 0 ) + ∑ t = 1 T log q ( x t − 1 ∣ x 0 ) ] \begin{align*} \text{VLB} &= \mathbb{E}_{q} \Bigg[ \log p(\mathbf{x}_T) + \sum_{t=1}^T \log p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) - \sum_{t=1}^T \log q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \\ &- \sum_{t=1}^T \log q(\mathbf{x}_t | \mathbf{x}_0) + \sum_{t=1}^T \log q(\mathbf{x}_{t-1} | \mathbf{x}_0) \Bigg] \end{align*} VLB=Eq[logp(xT)+t=1∑Tlogpθ(xt−1∣xt)−t=1∑Tlogq(xt−1∣xt,x0)−t=1∑Tlogq(xt∣x0)+t=1∑Tlogq(xt−1∣x0)]
步骤 5: 处理望远镜求和项
定义中间项:
S = − ∑ t = 1 T log q ( x t ∣ x 0 ) + ∑ t = 1 T log q ( x t − 1 ∣ x 0 ) S = -\sum_{t=1}^T \log q(\mathbf{x}_t | \mathbf{x}_0) + \sum_{t=1}^T \log q(\mathbf{x}_{t-1} | \mathbf{x}_0) S=−t=1∑Tlogq(xt∣x0)+t=1∑Tlogq(xt−1∣x0)
展开求和:
S = [ − log q ( x 1 ∣ x 0 ) − log q ( x 2 ∣ x 0 ) − ⋯ − log q ( x T ∣ x 0 ) ] + [ log q ( x 0 ∣ x 0 ) + log q ( x 1 ∣ x 0 ) + ⋯ + log q ( x T − 1 ∣ x 0 ) ] \begin{align*} S &= \Big[ -\log q(\mathbf{x}_1 | \mathbf{x}_0) - \log q(\mathbf{x}_2 | \mathbf{x}_0) - \cdots - \log q(\mathbf{x}_T | \mathbf{x}_0) \Big] \\ &+ \Big[ \log q(\mathbf{x}_0 | \mathbf{x}_0) + \log q(\mathbf{x}_1 | \mathbf{x}_0) + \cdots + \log q(\mathbf{x}_{T-1} | \mathbf{x}_0) \Big] \end{align*} S=[−logq(x1∣x0)−logq(x2∣x0)−⋯−logq(xT∣x0)]+[logq(x0∣x0)+logq(x1∣x0)+⋯+logq(xT−1∣x0)]
望远镜消去 相同项(如 log q ( x 1 ∣ x 0 ) \log q(\mathbf{x}_1 | \mathbf{x}_0) logq(x1∣x0) 等):
S = log q ( x 0 ∣ x 0 ) − log q ( x T ∣ x 0 ) S = \log q(\mathbf{x}_0 | \mathbf{x}_0) - \log q(\mathbf{x}_T | \mathbf{x}_0) S=logq(x0∣x0)−logq(xT∣x0)
由于 q ( x 0 ∣ x 0 ) = 1 q(\mathbf{x}_0 | \mathbf{x}_0) = 1 q(x0∣x0)=1(确定性分布),有 log q ( x 0 ∣ x 0 ) = 0 \log q(\mathbf{x}_0 | \mathbf{x}_0) = 0 logq(x0∣x0)=0,故:
S = − log q ( x T ∣ x 0 ) S = - \log q(\mathbf{x}_T | \mathbf{x}_0) S=−logq(xT∣x0)
步骤 6: 分离 t = 1 t=1 t=1 和 t ≥ 2 t \geq 2 t≥2 项
剩余项重组:
VLB = E q [ log p ( x T ) − log q ( x T ∣ x 0 ) + ∑ t = 1 T log p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ] \begin{align*} \text{VLB} &= \mathbb{E}_{q} \Bigg[ \log p(\mathbf{x}_T) - \log q(\mathbf{x}_T | \mathbf{x}_0) \\ &+ \sum_{t=1}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} \Bigg] \end{align*} VLB=Eq[logp(xT)−logq(xT∣x0)+t=1∑Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)]
显式分离 ( t=1 ) 项:
∑ t = 1 T log p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) = log p θ ( x 0 ∣ x 1 ) q ( x 0 ∣ x 1 , x 0 ) ( 因 q ( x 0 ∣ x 1 , x 0 ) = 1 ) + ∑ t = 2 T log p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) \begin{align*} \sum_{t=1}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} &= \log \frac{p_\theta(\mathbf{x}_0 | \mathbf{x}_1)}{\cancel{q(\mathbf{x}_0 | \mathbf{x}_1, \mathbf{x}_0)}} \quad (\text{因 } q(\mathbf{x}_0 | \mathbf{x}_1, \mathbf{x}_0)=1) \\ &+ \sum_{t=2}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} \end{align*} t=1∑Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)=logq(x0∣x1,x0) pθ(x0∣x1)(因 q(x0∣x1,x0)=1)+t=2∑Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)
步骤 7: 转换为 KL 散度
合并所有项:
VLB = E q [ log p θ ( x 0 ∣ x 1 ) + ∑ t = 2 T log p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) + log p ( x T ) q ( x T ∣ x 0 ) ] \begin{align*} \text{VLB} &= \mathbb{E}_{q} \Bigg[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) + \sum_{t=2}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} \\ &+ \log \frac{p(\mathbf{x}_T)}{q(\mathbf{x}_T | \mathbf{x}_0)} \Bigg] \end{align*} VLB=Eq[logpθ(x0∣x1)+t=2∑Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)+logq(xT∣x0)p(xT)]
利用 KL 散度定义:
log p ( x T ) q ( x T ∣ x 0 ) = − log q ( x T ∣ x 0 ) p ( x T ) = − D KL ( q ( x T ∣ x 0 ) ∥ p ( x T ) ) \log \frac{p(\mathbf{x}_T)}{q(\mathbf{x}_T | \mathbf{x}_0)} = - \log \frac{q(\mathbf{x}_T | \mathbf{x}_0)}{p(\mathbf{x}_T)} = -D_{\text{KL}} \Big( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \Big) logq(xT∣x0)p(xT)=−logp(xT)q(xT∣x0)=−DKL(q(xT∣x0)∥p(xT))
和:
E q [ log p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ] = − E q [ D KL ( q ( x t − 1 ∣ x t , x 0 ) ∥ p θ ( x t − 1 ∣ x t ) ) ] \mathbb{E}_{q} \left[ \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} \right] = - \mathbb{E}_{q} \left[ D_{\text{KL}} \Big( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \Big) \right] Eq[logq(xt−1∣xt,x0)pθ(xt−1∣xt)]=−Eq[DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))]
步骤 8: 最终简化形式
代入得:
VLB = E q ( x 1 ∣ x 0 ) [ log p θ ( x 0 ∣ x 1 ) ] − ∑ t = 2 T E q ( x t ∣ x 0 ) [ D KL ( q ( x t − 1 ∣ x t , x 0 ) ∥ p θ ( x t − 1 ∣ x t ) ) ] − D KL ( q ( x T ∣ x 0 ) ∥ p ( x T ) ) \boxed{ \begin{align*} \text{VLB} = & \;\mathbb{E}_{q(\mathbf{x}_1 | \mathbf{x}_0)} \Big[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) \Big] \\ & - \sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \left[ D_{\text{KL}} \Big( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \Big) \right] \\ & - D_{\text{KL}} \Big( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \Big) \end{align*} } VLB=Eq(x1∣x0)[logpθ(x0∣x1)]−t=2∑TEq(xt∣x0)[DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))]−DKL(q(xT∣x0)∥p(xT))
关键说明
-
期望的简化:
- 第一项仅依赖 x 1 \mathbf{x}_1 x1: E q ( x 1 ∣ x 0 ) [ ⋅ ] \mathbb{E}_{q(\mathbf{x}_1 | \mathbf{x}_0)} [\cdot] Eq(x1∣x0)[⋅]
- 第二项依赖 x t \mathbf{x}_t xt: E q ( x t ∣ x 0 ) [ ⋅ ] \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} [\cdot] Eq(xt∣x0)[⋅](因 KL 散度仅需 x t \mathbf{x}_t xt 和 x 0 \mathbf{x}_0 x0)
- 第三项是解析可解的标量
-
物理意义:
- 重构项: E [ log p θ ( x 0 ∣ x 1 ) ] \mathbb{E} [\log p_\theta(\mathbf{x}_0 | \mathbf{x}_1)] E[logpθ(x0∣x1)] 衡量生成数据的能力
- 去噪匹配项:KL 散度迫使生成过程匹配扩散过程的反向后验
- 先验匹配项:确保最终分布 x T \mathbf{x}_T xT 接近标准高斯先验
-
与噪声预测的联系:
通过之前推导的闭式解:
D KL ( q ( ⋅ ) ∥ p θ ( ⋅ ) ) ∝ ∥ ϵ − ϵ θ ( x t , t ) ∥ 2 D_{\text{KL}} \Big( q(\cdot) \parallel p_\theta(\cdot) \Big) \propto \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta (\mathbf{x}_t, t) \|^2 DKL(q(⋅)∥pθ(⋅))∝∥ϵ−ϵθ(xt,t)∥2
最终目标简化为 噪声预测的均方误差。
详细推导:后验分布 q ( x t − 1 ∣ x t , x 0 ) q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) q(xt−1∣xt,x0) 和参数化均值 μ θ ( x t , t ) \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) μθ(xt,t)
1. 后验分布 q ( x t − 1 ∣ x t , x 0 ) q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) q(xt−1∣xt,x0) 的闭式解
由贝叶斯公式和扩散过程的马尔可夫性质:
q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 , x 0 ) ⋅ q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) = q ( x t ∣ x t − 1 ) ⋅ q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) \cdot q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} = \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1}) \cdot q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} q(xt−1∣xt,x0)=q(xt∣x0)q(xt∣xt−1,x0)⋅q(xt−1∣x0)=q(xt∣x0)q(xt∣xt−1)⋅q(xt−1∣x0)
其中:
- q ( x t ∣ x t − 1 ) = N ( x t ; α t x t − 1 , β t I ) q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{\alpha_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}) q(xt∣xt−1)=N(xt;αtxt−1,βtI)
- q ( x t − 1 ∣ x 0 ) = N ( x t − 1 ; α ˉ t − 1 x 0 , ( 1 − α ˉ t − 1 ) I ) q(\mathbf{x}_{t-1} | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0, (1 - \bar{\alpha}_{t-1}) \mathbf{I}) q(xt−1∣x0)=N(xt−1;αˉt−1x0,(1−αˉt−1)I)
- q ( x t ∣ x 0 ) = N ( x t ; α ˉ t x 0 , ( 1 − α ˉ t ) I ) q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I}) q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)
定义:
α t = 1 − β t , α ˉ t = ∏ i = 1 t α i \alpha_t = 1 - \beta_t, \quad \bar{\alpha}_t = \prod_{i=1}^t \alpha_i αt=1−βt,αˉt=i=1∏tαi
推导均值 μ ~ t \tilde{\boldsymbol{\mu}}_t μ~t:
通过高斯分布指数项匹配(忽略常数项):
− 1 2 [ ∥ x t − α t x t − 1 ∥ 2 β t + ∥ x t − 1 − α ˉ t − 1 x 0 ∥ 2 1 − α ˉ t − 1 ] + C -\frac{1}{2} \left[ \frac{\|\mathbf{x}_t - \sqrt{\alpha_t} \mathbf{x}_{t-1}\|^2}{\beta_t} + \frac{\|\mathbf{x}_{t-1} - \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0\|^2}{1 - \bar{\alpha}_{t-1}} \right] + C −21[βt∥xt−αtxt−1∥2+1−αˉt−1∥xt−1−αˉt−1x0∥2]+C
提取 x t − 1 \mathbf{x}_{t-1} xt−1 的二次项和一次项系数:
- 二次项系数:
A = α t β t + 1 1 − α ˉ t − 1 = α t ( 1 − α ˉ t − 1 ) + β t β t ( 1 − α ˉ t − 1 ) = 1 − α ˉ t β t ( 1 − α ˉ t − 1 ) A = \frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}} = \frac{\alpha_t(1 - \bar{\alpha}_{t-1}) + \beta_t}{\beta_t (1 - \bar{\alpha}_{t-1})} = \frac{1 - \bar{\alpha}_t}{\beta_t (1 - \bar{\alpha}_{t-1})} A=βtαt+1−αˉt−11=βt(1−αˉt−1)αt(1−αˉt−1)+βt=βt(1−αˉt−1)1−αˉt
(因 α t α ˉ t − 1 = α ˉ t \alpha_t \bar{\alpha}_{t-1} = \bar{\alpha}_t αtαˉt−1=αˉt 且 β t = 1 − α t \beta_t = 1 - \alpha_t βt=1−αt) - 一次项系数:
b = α t β t x t + α ˉ t − 1 1 − α ˉ t − 1 x 0 \mathbf{b} = \frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0 b=βtαtxt+1−αˉt−1αˉt−1x0
高斯分布均值满足 μ ~ t = A − 1 b \tilde{\boldsymbol{\mu}}_t = A^{-1} \mathbf{b} μ~t=A−1b,代入得:
μ ~ t = β t ( 1 − α ˉ t − 1 ) 1 − α ˉ t ( α t β t x t + α ˉ t − 1 1 − α ˉ t − 1 x 0 ) = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t x 0 \tilde{\boldsymbol{\mu}}_t = \frac{\beta_t (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \left( \frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0 \right) = \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 μ~t=1−αˉtβt(1−αˉt−1)(βtαtxt+1−αˉt−1αˉt−1x0)=1−αˉtαt(1−αˉt−1)xt+1−αˉtαˉt−1βtx0
推导方差 β ~ t \tilde{\beta}_t β~t:
协方差矩阵为 A − 1 A^{-1} A−1:
β ~ t = β t ( 1 − α ˉ t − 1 ) 1 − α ˉ t \tilde{\beta}_t = \frac{\beta_t (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} β~t=1−αˉtβt(1−αˉt−1)
最终闭式解:
q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; α ˉ t − 1 β t 1 − α ˉ t x 0 + α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t ⏟ μ ~ t ( x t , x 0 ) , β t ( 1 − α ˉ t − 1 ) 1 − α ˉ t I ⏟ β ~ t ) \boxed{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\left( \mathbf{x}_{t-1}; \underbrace{\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t}_{\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)}, \underbrace{\frac{\beta_t (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{I}}_{\tilde{\beta}_t} \right)} q(xt−1∣xt,x0)=N xt−1;μ~t(xt,x0) 1−αˉtαˉt−1βtx0+1−αˉtαt(1−αˉt−1)xt,β~t 1−αˉtβt(1−αˉt−1)I
2. 参数化均值 μ θ ( x t , t ) \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) μθ(xt,t) 的推导
由前向过程:
x t = α ˉ t x 0 + 1 − α ˉ t ϵ , ϵ ∼ N ( 0 , I ) \mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) xt=αˉtx0+1−αˉtϵ,ϵ∼N(0,I)
解得 x 0 \mathbf{x}_0 x0:
x 0 = x t − 1 − α ˉ t ϵ α ˉ t \mathbf{x}_0 = \frac{\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}}{\sqrt{\bar{\alpha}_t}} x0=αˉtxt−1−αˉtϵ
代入后验均值 μ ~ t \tilde{\boldsymbol{\mu}}_t μ~t:
μ ~ t = α ˉ t − 1 β t 1 − α ˉ t ( x t − 1 − α ˉ t ϵ α ˉ t ) + α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t \tilde{\boldsymbol{\mu}}_t = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \left( \frac{\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}}{\sqrt{\bar{\alpha}_t}} \right) + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t μ~t=1−αˉtαˉt−1βt(αˉtxt−1−αˉtϵ)+1−αˉtαt(1−αˉt−1)xt
合并同类项:
μ ~ t = β t α ˉ t − 1 α ˉ t ( 1 − α ˉ t ) x t − β t α ˉ t − 1 1 − α ˉ t α ˉ t ( 1 − α ˉ t ) ϵ + α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t \tilde{\boldsymbol{\mu}}_t = \frac{\beta_t \sqrt{\bar{\alpha}_{t-1}}}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \mathbf{x}_t - \frac{\beta_t \sqrt{\bar{\alpha}_{t-1}} \sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \boldsymbol{\epsilon} + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t μ~t=αˉt(1−αˉt)βtαˉt−1xt−αˉt(1−αˉt)βtαˉt−11−αˉtϵ+1−αˉtαt(1−αˉt−1)xt
利用 α ˉ t = α t α ˉ t − 1 \bar{\alpha}_t = \alpha_t \bar{\alpha}_{t-1} αˉt=αtαˉt−1 和 α ˉ t = α t α ˉ t − 1 \sqrt{\bar{\alpha}_t} = \sqrt{\alpha_t} \sqrt{\bar{\alpha}_{t-1}} αˉt=αtαˉt−1 化简:
- x t \mathbf{x}_t xt 的系数:
β t α t ( 1 − α ˉ t ) + α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t = 1 α t (详细化简见备注) \frac{\beta_t}{\alpha_t (1 - \bar{\alpha}_t)} + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} = \frac{1}{\sqrt{\alpha_t}} \quad \text{(详细化简见备注)} αt(1−αˉt)βt+1−αˉtαt(1−αˉt−1)=αt1(详细化简见备注) - ϵ \boldsymbol{\epsilon} ϵ 的系数:
− β t α t 1 − α ˉ t -\frac{\beta_t}{\sqrt{\alpha_t} \sqrt{1 - \bar{\alpha}_t}} −αt1−αˉtβt
最终:
μ ~ t = 1 α t x t − β t α t 1 − α ˉ t ϵ = 1 α t ( x t − β t 1 − α ˉ t ϵ ) \tilde{\boldsymbol{\mu}}_t = \frac{1}{\sqrt{\alpha_t}} \mathbf{x}_t - \frac{\beta_t}{\sqrt{\alpha_t} \sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon} = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon} \right) μ~t=αt1xt−αt1−αˉtβtϵ=αt1(xt−1−αˉtβtϵ)
参数化策略:
用神经网络 ϵ θ ( x t , t ) \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) ϵθ(xt,t) 预测噪声 ϵ \boldsymbol{\epsilon} ϵ:
μ θ ( x t , t ) = 1 α t ( x t − β t 1 − α ˉ t ϵ θ ( x t , t ) ) \boxed{\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right)} μθ(xt,t)=αt1(xt−1−αˉtβtϵθ(xt,t))
关键备注
-
x t \mathbf{x}_t xt 系数的化简:
β t α t ( 1 − α ˉ t ) + α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t = β t + α t α t ( 1 − α ˉ t − 1 ) α t α t ( 1 − α ˉ t ) (通分) = ( 1 − α t ) + α t 3 / 2 − α t 3 / 2 α ˉ t − 1 α t 3 / 2 ( 1 − α ˉ t ) = 1 − α t + α t 3 / 2 ( 1 − α ˉ t − 1 ) α t 3 / 2 ( 1 − α ˉ t ) = 1 α t (分子 = α t 3 / 2 时成立,需结合 α ˉ t = α t α ˉ t − 1 验证) \begin{align*} &\frac{\beta_t}{\alpha_t (1 - \bar{\alpha}_t)} + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \\ =& \frac{\beta_t + \alpha_t \sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{\alpha_t \sqrt{\alpha_t} (1 - \bar{\alpha}_t)} \quad \text{(通分)} \\ =& \frac{(1 - \alpha_t) + \alpha_t^{3/2} - \alpha_t^{3/2} \bar{\alpha}_{t-1}}{\alpha_t^{3/2} (1 - \bar{\alpha}_t)} \\ =& \frac{1 - \alpha_t + \alpha_t^{3/2} (1 - \bar{\alpha}_{t-1})}{\alpha_t^{3/2} (1 - \bar{\alpha}_t)} \\ =& \frac{1}{\sqrt{\alpha_t}} \quad \text{(分子 = \(\alpha_t^{3/2}\) 时成立,需结合 \(\bar{\alpha}_t = \alpha_t \bar{\alpha}_{t-1}\) 验证)} \end{align*} ====αt(1−αˉt)βt+1−αˉtαt(1−αˉt−1)αtαt(1−αˉt)βt+αtαt(1−αˉt−1)(通分)αt3/2(1−αˉt)(1−αt)+αt3/2−αt3/2αˉt−1αt3/2(1−αˉt)1−αt+αt3/2(1−αˉt−1)αt1(分子 = αt3/2 时成立,需结合 αˉt=αtαˉt−1 验证) -
物理意义:
- 后验均值 μ ~ t \tilde{\boldsymbol{\mu}}_t μ~t 是 x 0 \mathbf{x}_0 x0 和 x t \mathbf{x}_t xt 的线性组合,显式依赖于初始数据 x 0 \mathbf{x}_0 x0。
- 参数化均值 μ θ \boldsymbol{\mu}_\theta μθ 用噪声预测网络 ϵ θ \boldsymbol{\epsilon}_\theta ϵθ 隐式消除对 x 0 \mathbf{x}_0 x0 的依赖,实现可计算的反向过程。