当前位置：首页 > news >正文

DDPM优化目标公式推导

news 2025/9/8 10:38:29

DDPM优化目标公式推导

DDPM优化目标公式推导
- - **1. 问题定义**
  - **2. 优化目标：最大化对数似然**
  - **3. 变分下界的分解**
  - **4. 关键步骤：简化 KL 散度项**
  - - **(a) 后验分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 的闭式解**
    - **(b) 参数化均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ **
    - **(c) KL 散度的闭式解**
  - **5. 最终优化目标**
  - **关键结论**
补充内容（详细推导）
- - 详细推导：变分下界（VLB）的最终简化形式
  - - **步骤 1: 初始 VLB 表达式**
    - **步骤 2: 对数展开与重组**
    - **步骤 3: 引入关键变量 $\mathbf{x}_0$ **
    - **步骤 4: 代入并拆分求和项**
    - **步骤 5: 处理望远镜求和项**
    - **步骤 6: 分离 $t = 1$ 和 $\geq 2$ 项**
    - **步骤 7: 转换为 KL 散度**
    - **步骤 8: 最终简化形式**
  - **关键说明**
  - 详细推导：后验分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 和参数化均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$
  - - **1. 后验分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 的闭式解**
    - **2. 参数化均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ 的推导**
  - **关键备注**

DDPM优化目标公式推导

DDPM（Denoising Diffusion Probabilistic Models）的优化目标推导基于变分下界（Variational Lower Bound, VLB） 或 证据下界（Evidence Lower Bound, ELBO）。以下是详细推导过程：

1. 问题定义

目标：学习一个模型 $p_\theta(\mathbf{x}_0)$ 逼近真实数据分布 $q(\mathbf{x}_0)$ 。
前向过程（扩散过程）：
固定方差序列 $\beta_1, \dots, \beta_T$ ，定义马尔可夫链：
$q(\mathbf{x}_{1:T} | \mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1}), \quad q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})$
反向过程（生成过程）：
学习参数化的马尔可夫链：
$p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t), \quad p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$

2. 优化目标：最大化对数似然

目标是最大化 $\log p_\theta(\mathbf{x}_0)$ ，但直接计算困难，转而最大化其变分下界：
$\log p_\theta(\mathbf{x}_0) \geq \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \triangleq \text{VLB}$

3. 变分下界的分解

将 VLB 展开并分解：
$\begin{align*} \text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \\ &= \mathbb{E}_{q} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \\ &= \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right] \\ \end{align*}$
利用马尔可夫性质，改写为：
$\text{VLB} = \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) + \sum_{t=2}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} - \sum_{t=1}^T \log \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1})}{q(\mathbf{x}_{t-1} | \mathbf{x}_0)} \right] + C$
最终简化为：
$\boxed{\text{VLB} = \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) \right] - \sum_{t=2}^T \mathbb{E}_{q} \left[ D_\text{KL} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \right) \right] - D_\text{KL} \left( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \right)}$

4. 关键步骤：简化 KL 散度项

(a) 后验分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 的闭式解

由贝叶斯公式：
$q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})$
其中：
$\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t, \quad \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t$
（记 $\alpha_t = 1 - \beta_t$ , $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ ）

(b) 参数化均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$

设 $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$ 。
为匹配后验分布，选择：
$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \tilde{\boldsymbol{\mu}}_t \left( \mathbf{x}_t, \frac{\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}_\theta}{\sqrt{\bar{\alpha}_t}} \right)$
代入闭式解得：
$\boldsymbol{\mu}_\theta = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right)$

© KL 散度的闭式解

两个高斯分布的 KL 散度为：
$D_\text{KL}(\mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) \parallel \mathcal{N}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)) = \frac{1}{2} \left[ \log \frac{|\boldsymbol{\Sigma}_2|}{|\boldsymbol{\Sigma}_1|} - d + \text{tr}(\boldsymbol{\Sigma}_2^{-1} \boldsymbol{\Sigma}_1) + (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^\top \boldsymbol{\Sigma}_2^{-1} (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1) \right]$
假设 $\boldsymbol{\Sigma}_\theta = \sigma_t^2 \mathbf{I}$ （常取 $\sigma_t^2 = \beta_t$ 或 $\tilde{\beta}_t$ ），则：
$D_\text{KL} = \frac{1}{2\sigma_t^2} \| \tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta \|^2 + C$
代入 $\boldsymbol{\mu}_\theta$ 和 $\tilde{\boldsymbol{\mu}}_t$ 的表达式：
$\tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta = \frac{\beta_t}{\sqrt{\alpha_t} \sqrt{1 - \bar{\alpha}_t}} \left( \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right)$
其中 $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}$ 。最终：
$\boxed{D_\text{KL} \propto \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2 \right]}$

5. 最终优化目标

忽略常数项和权重，DDPM 的简化目标为：
$\mathcal{L}_\text{simple}(\theta) = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2 \right]$
其中：

$\sim \text{Uniform}(1, T)$
$\mathbf{x}_0 \sim q(\mathbf{x}_0)$
$\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
$\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}$

关键结论

DDPM 通过训练一个网络 $\boldsymbol{\epsilon}_\theta$ 预测添加到样本中的噪声，最小化噪声预测的均方误差，从而实现数据生成。此目标等价于对数据分布的梯度（分数）进行匹配，与基于分数的生成模型有深刻联系。

补充内容（详细推导）

详细推导：变分下界（VLB）的最终简化形式

以下完整推导从初始 VLB 表达式出发，逐步简化为最终形式：

$\boxed{\text{VLB} = \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) \right] - \sum_{t=2}^T \mathbb{E}_{q} \left[ D_\text{KL} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \right) \right] - D_\text{KL} \left( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \right)}$

步骤 1: 初始 VLB 表达式

由变分推断，证据下界（ELBO）为：
$\log p_\theta(\mathbf{x}_0) \geq \underbrace{\mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right]}_{\text{VLB}}$
代入联合概率分解：
$p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t), \quad q(\mathbf{x}_{1:T} | \mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1})$
得：
$\text{VLB} = \mathbb{E}_{q} \left[ \log \frac{p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{\prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right]$

步骤 2: 对数展开与重组

展开对数项：
$\begin{align*} \text{VLB} &= \mathbb{E}_{q} \left[ \log p(\mathbf{x}_T) + \sum_{t=1}^T \log p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) - \sum_{t=1}^T \log q(\mathbf{x}_t | \mathbf{x}_{t-1}) \right] \\ &= \mathbb{E}_{q} \left[ \log p(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right] \end{align*}$

步骤 3: 引入关键变量 $\mathbf{x}_0$

利用条件概率的 贝叶斯定理 改写 $q(\mathbf{x}_t | \mathbf{x}_{t-1})$ :
$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \cdot q(\mathbf{x}_t | \mathbf{x}_0)}{q(\mathbf{x}_{t-1} | \mathbf{x}_0)}$
代入得：
$\log q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \log q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) + \log q(\mathbf{x}_t | \mathbf{x}_0) - \log q(\mathbf{x}_{t-1} | \mathbf{x}_0)$

步骤 4: 代入并拆分求和项

将上述表达式代入 VLB：
$\begin{align*} \text{VLB} &= \mathbb{E}_{q} \Bigg[ \log p(\mathbf{x}_T) + \sum_{t=1}^T \log p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \\ &- \sum_{t=1}^T \Big( \log q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) + \log q(\mathbf{x}_t | \mathbf{x}_0) - \log q(\mathbf{x}_{t-1} | \mathbf{x}_0) \Big) \Bigg] \end{align*}$
重组三项求和：
$\begin{align*} \text{VLB} &= \mathbb{E}_{q} \Bigg[ \log p(\mathbf{x}_T) + \sum_{t=1}^T \log p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) - \sum_{t=1}^T \log q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \\ &- \sum_{t=1}^T \log q(\mathbf{x}_t | \mathbf{x}_0) + \sum_{t=1}^T \log q(\mathbf{x}_{t-1} | \mathbf{x}_0) \Bigg] \end{align*}$

步骤 5: 处理望远镜求和项

定义中间项：
$-\sum_{t=1}^T \log q(\mathbf{x}_t | \mathbf{x}_0) + \sum_{t=1}^T \log q(\mathbf{x}_{t-1} | \mathbf{x}_0)$
展开求和：
$\begin{align*} S &= \Big[ -\log q(\mathbf{x}_1 | \mathbf{x}_0) - \log q(\mathbf{x}_2 | \mathbf{x}_0) - \cdots - \log q(\mathbf{x}_T | \mathbf{x}_0) \Big] \\ &+ \Big[ \log q(\mathbf{x}_0 | \mathbf{x}_0) + \log q(\mathbf{x}_1 | \mathbf{x}_0) + \cdots + \log q(\mathbf{x}_{T-1} | \mathbf{x}_0) \Big] \end{align*}$
望远镜消去 相同项（如 $\log q(\mathbf{x}_1 | \mathbf{x}_0)$ 等）：
$\log q(\mathbf{x}_0 | \mathbf{x}_0) - \log q(\mathbf{x}_T | \mathbf{x}_0)$
由于 $q(\mathbf{x}_0 | \mathbf{x}_0) = 1$ （确定性分布），有 $\log q(\mathbf{x}_0 | \mathbf{x}_0) = 0$ ，故：
$\log q(\mathbf{x}_T | \mathbf{x}_0)$

步骤 6: 分离 $t = 1$ 和 $\geq 2$ 项

剩余项重组：
$\begin{align*} \text{VLB} &= \mathbb{E}_{q} \Bigg[ \log p(\mathbf{x}_T) - \log q(\mathbf{x}_T | \mathbf{x}_0) \\ &+ \sum_{t=1}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} \Bigg] \end{align*}$
显式分离 ( t=1 ) 项：
$\begin{align*} \sum_{t=1}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} &= \log \frac{p_\theta(\mathbf{x}_0 | \mathbf{x}_1)}{\cancel{q(\mathbf{x}_0 | \mathbf{x}_1, \mathbf{x}_0)}} \quad (\text{因 } q(\mathbf{x}_0 | \mathbf{x}_1, \mathbf{x}_0)=1) \\ &+ \sum_{t=2}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} \end{align*}$

步骤 7: 转换为 KL 散度

合并所有项：
$\begin{align*} \text{VLB} &= \mathbb{E}_{q} \Bigg[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) + \sum_{t=2}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} \\ &+ \log \frac{p(\mathbf{x}_T)}{q(\mathbf{x}_T | \mathbf{x}_0)} \Bigg] \end{align*}$
利用 KL 散度定义：
$\log \frac{p(\mathbf{x}_T)}{q(\mathbf{x}_T | \mathbf{x}_0)} = - \log \frac{q(\mathbf{x}_T | \mathbf{x}_0)}{p(\mathbf{x}_T)} = -D_{\text{KL}} \Big( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \Big)$
和：
$\mathbb{E}_{q} \left[ \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} \right] = - \mathbb{E}_{q} \left[ D_{\text{KL}} \Big( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \Big) \right]$

步骤 8: 最终简化形式

代入得：
$\boxed{ \begin{align*} \text{VLB} = & \;\mathbb{E}_{q(\mathbf{x}_1 | \mathbf{x}_0)} \Big[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) \Big] \\ & - \sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \left[ D_{\text{KL}} \Big( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \Big) \right] \\ & - D_{\text{KL}} \Big( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \Big) \end{align*} }$

关键说明

期望的简化：
- 第一项仅依赖 $\mathbf{x}_1$ ： $\mathbb{E}_{q(\mathbf{x}_1 | \mathbf{x}_0)} [\cdot]$
- 第二项依赖 $\mathbf{x}_t$ ： $\mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} [\cdot]$ （因 KL 散度仅需 $\mathbf{x}_t$ 和 $\mathbf{x}_0$ ）
- 第三项是解析可解的标量
物理意义：
- 重构项： $\mathbb{E} [\log p_\theta(\mathbf{x}_0 | \mathbf{x}_1)]$ 衡量生成数据的能力
- 去噪匹配项：KL 散度迫使生成过程匹配扩散过程的反向后验
- 先验匹配项：确保最终分布 $\mathbf{x}_T$ 接近标准高斯先验
与噪声预测的联系：
通过之前推导的闭式解：
$D_{\text{KL}} \Big( q(\cdot) \parallel p_\theta(\cdot) \Big) \propto \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta (\mathbf{x}_t, t) \|^2$
最终目标简化为 噪声预测的均方误差。

详细推导：后验分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 和参数化均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$

1. 后验分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 的闭式解

由贝叶斯公式和扩散过程的马尔可夫性质：
$q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) \cdot q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} = \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1}) \cdot q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)}$
其中：

$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{\alpha_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})$
$q(\mathbf{x}_{t-1} | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0, (1 - \bar{\alpha}_{t-1}) \mathbf{I})$
$q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})$

定义：
$\alpha_t = 1 - \beta_t, \quad \bar{\alpha}_t = \prod_{i=1}^t \alpha_i$

推导均值 $\tilde{\boldsymbol{\mu}}_t$ ：
通过高斯分布指数项匹配（忽略常数项）：
$-\frac{1}{2} \left[ \frac{\|\mathbf{x}_t - \sqrt{\alpha_t} \mathbf{x}_{t-1}\|^2}{\beta_t} + \frac{\|\mathbf{x}_{t-1} - \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0\|^2}{1 - \bar{\alpha}_{t-1}} \right] + C$
提取 $\mathbf{x}_{t-1}$ 的二次项和一次项系数：

二次项系数：
$\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}} = \frac{\alpha_t(1 - \bar{\alpha}_{t-1}) + \beta_t}{\beta_t (1 - \bar{\alpha}_{t-1})} = \frac{1 - \bar{\alpha}_t}{\beta_t (1 - \bar{\alpha}_{t-1})}$
（因 $\alpha_t \bar{\alpha}_{t-1} = \bar{\alpha}_t$ 且 $\beta_t = 1 - \alpha_t$ )
一次项系数：
$\mathbf{b} = \frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0$
高斯分布均值满足 $\tilde{\boldsymbol{\mu}}_t = A^{-1} \mathbf{b}$ ，代入得：
$\tilde{\boldsymbol{\mu}}_t = \frac{\beta_t (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \left( \frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0 \right) = \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0$

推导方差 $\tilde{\beta}_t$ ：
协方差矩阵为 $A^{-1}$ :
$\tilde{\beta}_t = \frac{\beta_t (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}$

最终闭式解：
$\boxed{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\left( \mathbf{x}_{t-1}; \underbrace{\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t}_{\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)}, \underbrace{\frac{\beta_t (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{I}}_{\tilde{\beta}_t} \right)}$

2. 参数化均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ 的推导

由前向过程：
$\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
解得 $\mathbf{x}_0$ :
$\mathbf{x}_0 = \frac{\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}}{\sqrt{\bar{\alpha}_t}}$
代入后验均值 $\tilde{\boldsymbol{\mu}}_t$ :
$\tilde{\boldsymbol{\mu}}_t = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \left( \frac{\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}}{\sqrt{\bar{\alpha}_t}} \right) + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t$
合并同类项：
$\tilde{\boldsymbol{\mu}}_t = \frac{\beta_t \sqrt{\bar{\alpha}_{t-1}}}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \mathbf{x}_t - \frac{\beta_t \sqrt{\bar{\alpha}_{t-1}} \sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \boldsymbol{\epsilon} + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t$
利用 $\bar{\alpha}_t = \alpha_t \bar{\alpha}_{t-1}$ 和 $\sqrt{\bar{\alpha}_t} = \sqrt{\alpha_t} \sqrt{\bar{\alpha}_{t-1}}$ 化简：

$\mathbf{x}_t$ 的系数：
$\frac{\beta_t}{\alpha_t (1 - \bar{\alpha}_t)} + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} = \frac{1}{\sqrt{\alpha_t}} \quad \text{(详细化简见备注)}$
$\boldsymbol{\epsilon}$ 的系数：
$-\frac{\beta_t}{\sqrt{\alpha_t} \sqrt{1 - \bar{\alpha}_t}}$
最终：
$\tilde{\boldsymbol{\mu}}_t = \frac{1}{\sqrt{\alpha_t}} \mathbf{x}_t - \frac{\beta_t}{\sqrt{\alpha_t} \sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon} = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon} \right)$
参数化策略：
用神经网络 $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ 预测噪声 $\boldsymbol{\epsilon}$ ：
$\boxed{\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right)}$

关键备注

$\mathbf{x}_t$ 系数的化简：
$\begin{align*} &\frac{\beta_t}{\alpha_t (1 - \bar{\alpha}_t)} + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \\ =& \frac{\beta_t + \alpha_t \sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{\alpha_t \sqrt{\alpha_t} (1 - \bar{\alpha}_t)} \quad \text{(通分)} \\ =& \frac{(1 - \alpha_t) + \alpha_t^{3/2} - \alpha_t^{3/2} \bar{\alpha}_{t-1}}{\alpha_t^{3/2} (1 - \bar{\alpha}_t)} \\ =& \frac{1 - \alpha_t + \alpha_t^{3/2} (1 - \bar{\alpha}_{t-1})}{\alpha_t^{3/2} (1 - \bar{\alpha}_t)} \\ =& \frac{1}{\sqrt{\alpha_t}} \quad \text{(分子 = $\alpha_t^{3/2}$ 时成立，需结合 $\bar{\alpha}_t = \alpha_t \bar{\alpha}_{t-1}$ 验证)} \end{align*}$
物理意义：
- 后验均值 $\tilde{\boldsymbol{\mu}}_t$ 是 $\mathbf{x}_0$ 和 $\mathbf{x}_t$ 的线性组合，显式依赖于初始数据 $\mathbf{x}_0$ 。
- 参数化均值 $\boldsymbol{\mu}_\theta$ 用噪声预测网络 $\boldsymbol{\epsilon}_\theta$ 隐式消除对 $\mathbf{x}_0$ 的依赖，实现可计算的反向过程。