当前位置：首页 > news >正文

RL【9】：Policy Gradient

news 2025/9/13 5:45:46

系列文章目录

Fundamental Tools

RL【1】：Basic Concepts
RL【2】：Bellman Equation
RL【3】：Bellman Optimality Equation

Algorithm

RL【4】：Value Iteration and Policy Iteration
RL【5】：Monte Carlo Learning
RL【6】：Stochastic Approximation and Stochastic Gradient Descent

Method

RL【7-1】：Temporal-difference Learning
RL【7-2】：Temporal-difference Learning

文章目录

系列文章目录
- Fundamental Tools
- Algorithm
- Method
前言
Basic idea of policy gradient
Metrics to define optimal policies
- Average value
- Average reward
- Remarks
Gradients of the metrics
Gradient-ascent algorithm (REINFORCE)
总结

前言

本系列文章主要用于记录 B站赵世钰老师的【强化学习的数学原理】的学习笔记，关于赵老师课程的具体内容，可以移步：
B站视频：【【强化学习的数学原理】课程：从零开始到透彻理解（完结）】
GitHub 课程资料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Basic idea of policy gradient

Policy Parameterization

Now, policies can be represented by parameterized functions:
```
  $\pi(a|s,\theta)$
```
- where $θ∈Rm\theta \in \mathbb{R}^m$ is a parameter vector.
The function can be, for example, a neural network, whose input is $s$ , output is the probability to take each action, and parameter is $θ\theta$ .
Advantage: when the state space is large, the tabular representation will be of low efficiency in terms of storage and generalization.
The function representation is also sometimes written as $π(a,s,θ)\pi(a,s,\theta)$ , $πθ(a∣s)\pi_\theta(a|s)$ , or $πθ(a,s)\pi_\theta(a,s)$ .

Policy Parameterization

公式：

$π(a∣s,θ)\pi(a|s,\theta)$

表示在状态 $s$ 下，选择动作 $a$ 的概率，其中 $θ\theta$ 是参数向量。

直观理解：

在 tabular 方法里，我们用表格直接存储 $π(s,a)\pi(s,a)$ 的概率；
在函数表示方法里，我们用参数（比如神经网络的权重 $θ\theta$ ）去“生成”这些概率。

优点：

当状态空间很大时（比如图像输入，连续状态），tabular 表存储和泛化能力都很差；

而函数（尤其是神经网络）可以高效地表示复杂策略，并且具有泛化能力。

Differences between tabular and function representations

First, how to define optimal policies?
- When represented as a table, a policy $π\pi$ is optimal if it can maximize every state value.
- When represented by a function, a policy $π\pi$ is optimal if it can maximize certain scalar metrics.
Second, how to access the probability of an action?
- In the tabular case, the probability of taking a at s can be directly accessed by looking up the tabular policy.
- In the case of function representation, we need to calculate the value of $π(a∣s,θ)\pi(a|s,\theta)$ given the function structure and the parameter.
Third, how to update policies?
- When represented by a table, a policy $π\pi$ can be updated by directly changing the entries in the table.
- When represented by a parameterized function, a policy $π\pi$ cannot be updated in this way anymore. Instead, it can only be updated by changing the parameter $θ\theta$ .

Tabular vs. Function Representation 的三方面区别

如何定义最优策略
Tabular：必须在 每个状态 上都最优（即最大化所有状态值）。
Function：只需要优化一个 全局的指标（scalar metric），比如期望回报 $J(θ)J(\theta)$ 。
这也是为什么深度强化学习中经常定义目标函数，然后用梯度下降来优化。

如何获得某个动作的概率
Tabular：直接查表就能得到 $π(a∣s)\pi(a|s)$ 。
Function：需要通过函数计算（比如神经网络前向传播），输出 $π(a∣s,θ)\pi(a|s,\theta)$ 。
这增加了计算复杂度，但同时提供了泛化能力。

如何更新策略
Tabular：直接修改表项，比如更新 $π(s,a)\pi(s,a)$ 的概率值。
Function：不能直接修改，只能通过 更新参数 $θ\theta$ 来间接改变策略。
例如神经网络里，通过反向传播更新权重 $θ\theta$ ，进而改变输出的概率分布。

Policy Gradient: Basic Idea

The basic idea of the policy gradient is simple:
- First, metrics (or objective functions) to define optimal policies: $J(θ)J(\theta)$ , which can define optimal policies.
- Second, gradient-based optimization algorithms to search for optimal policies:
  
  $θt+1=θt+α∇θJ(θt)\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t)$
Although the idea is simple, the complication emerges when we try to answer the following questions:

Policy Gradient 的基本思想

目标：定义一个性能指标 $J(θ)J(\theta)$ ，比如期望累计回报。

更新方法：使用梯度上升（因为我们要最大化 $J(θ)J(\theta)$ ）：

$θt+1=θt+α∇θJ(θt)\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t)$

难点：

该用什么指标 $J(θ)J(\theta)$ 来定义最优策略？（比如 expected return）
如何计算 $∇θJ(θ)\nabla_\theta J(\theta)$ ？（涉及到策略梯度定理）

这就是后续 Policy Gradient 方法（如 REINFORCE, Actor-Critic）的基础。

策略梯度定理（Policy Gradient Theorem）

核心结论（适用于随机策略）：

$∇θJ(θ)=Es∼dπθ,a∼πθ[∇θlog⁡πθ(a∣s)Qπθ(s,a)]\boxed{\;\nabla_\theta J(\theta) =\mathbb{E}{s\sim d{\pi_\theta},\,a\sim\pi_\theta}\big[\nabla_\theta \log \pi_\theta(a|s)\,Q^{\pi_\theta}(s,a)\big]\;}$

直观推导（likelihood-ratio/score function）：

轨迹概率 $Pθ(τ)=ρ(s0)∏tπθ(at∣st)P(st+1∣st,at)P_\theta(\tau)=\rho(s_0)\prod_t \pi_\theta(a_t|s_t)P(s_{t+1}|s_t,a_t)$ ；
$∇θJ=∑τR(τ)∇θPθ(τ)=Eτ⁣[R(τ)∇θlog⁡Pθ(τ)]\nabla_\theta J=\sum_\tau R(\tau)\nabla_\theta P_\theta(\tau) =\mathbb{E}{\tau}\!\big[R(\tau)\nabla\theta\log P_\theta(\tau)\big]$ ；
$θ无关)\log P_\theta(\tau)=\sum_t \log \pi_\theta(a_t|s_t)+\text{(与 }\theta\text{ 无关)}$ ；
把 $R(τ)R(\tau)$ 只分配给发生后的梯度项，得到上式（用 $QπQ^\pi$ 或后续回报替代）。

这个定理把“怎样改参数让回报变大”转化为“增加产生高回报动作的对数概率”

Metrics to define optimal policies

Average value

Definition

$vˉπ=∑s∈Sd(s)vπ(s)\bar v_\pi = \sum_{s \in \mathcal S} d(s) v_\pi(s)$

$vˉπ\bar v_\pi$ is a weighted average of the state values.
$\geq 0$ is the weight for state $s$ .
Since $∑s∈Sd(s)=1\sum_{s \in \mathcal S} d(s) = 1$ , we can interpret $d (s)$ as a probability distribution. Then the metric can be written as

$vˉπ=E[vπ(S)]\bar v_\pi = \mathbb{E}[v_\pi(S)]$
- where $\sim d$ .

平均状态价值的含义

公式：

$vˉπ=∑s∈Sd(s)vπ(s)\bar v_\pi = \sum_{s \in \mathcal S} d(s) v_\pi(s)$

这里的 $vˉπ\bar v_\pi$ 是一个标量，表示“在某个状态分布 $d$ 下，状态价值的加权平均值”。
$vπ(s)v_\pi(s)$ ：策略 $π\pi$ 下，状态 $s$ 的价值函数。
$d (s)$ ：权重，可以理解为状态出现的概率。

换句话说， $vˉπ\bar v_\pi$ 衡量了在长期运行时，策略 $π\pi$ 的整体表现。

Vector-product form

$vˉπ=∑s∈Sd(s)vπ(s)=dTvπ\bar v_\pi = \sum_{s \in \mathcal S} d(s) v_\pi(s) = d^T v_\pi$
- where
  
  $vπ=[…,vπ(s),…]T∈R∣S∣v_\pi = [\ldots, v_\pi(s), \ldots]^T \in \mathbb{R}^{|\mathcal S|}$
  
  $[\ldots, d(s), \ldots]^T \in \mathbb{R}^{|\mathcal S|}$

How to select the distribution $d$ ?

The first case is that $d$ is independent of the policy $π\pi$ .
- This case is relatively simple because the gradient of the metric is easier to calculate.
- In this case, we specifically denote d as $d_0$ and $vˉπ\bar v_\pi$ as $vˉπ0\bar v^0_\pi$ .
- How to select $d_0$ ?
  - One trivial way is to treat all the states equally important and hence select
    
    $d0(s)=1∣S∣d_0(s) = \frac{1}{|\mathcal S|}$
  - Another important case is that we are only interested in a specific state $s_0$ . For example, the episodes in some tasks always start from the same state $s_0$ . Then, we only care about the long-term return starting from $s_0$ . In this case,
    
    $d0(s0)=1,d0(s≠s0)=0d_0(s_0) = 1, \quad d_0(s \neq s_0) = 0$
The second case is that $d$ depends on the policy $π\pi$ .
- A common way to select $d$ as $dπ(s)d_\pi(s)$ , which is the stationary distribution under $π\pi$ .
- One basic property of $dπd_\pi$ is that it satisfies
  
  $dπTPπ=dπTd_\pi^T P_\pi = d_\pi^T$
  - where $PπP_\pi$ is the state transition probability matrix.
- The interpretation of selecting $dπd_\pi$ is as follows:
  - If one state is frequently visited in the long run, it is more important and deserves more weight.
  - If a state is hardly visited, then we give it less weight.

两种分布选择方式

Case 1: $d$ 与策略 $π\pi$ 无关
简单场景，比如希望所有状态都被平等对待。
选择方式：
平均分布： $d0(s)=1/∣S∣d_0(s) = 1/|\mathcal S|$ ，所有状态同等重要。
单点分布：比如只关心起始状态 $s_0$ ，那么 $d_0(s_0)=1$ ，其它为 $0$ 。

应用场景：任务有特定起点，或者我们希望学习“全局最优”而不考虑状态访问频率。

Case 2: $d$ 依赖于策略 $π\pi$

常见的选择是 stationary distribution $dπ(s)d_\pi(s)$ 。

它表示在策略 $π\pi$ 下，长期运行后各状态的访问频率。

性质：

$dπTPπ=dπTd_\pi^T P_\pi = d_\pi^T$

说明在状态转移矩阵 $PπP_\pi$ 下， $dπd_\pi$ 是不变分布。

直观理解：

高频访问的状态权重大（学习时更关注这些状态）。
低频访问的状态权重小（可能近似不准，但无所谓）。

应用场景：我们希望优化的目标与实际运行的分布一致，更符合真实策略表现。

Average reward

Definition

In particular, the metric is

$rˉπ≐∑s∈Sdπ(s)rπ(s)=E[rπ(S)],\bar{r}\pi \doteq \sum{s \in \mathcal{S}} d_\pi(s) r_\pi(s) = \mathbb{E}[r_\pi(S)],$
where $\sim d_\pi$ . Here,

$rπ(s)≐∑a∈Aπ(a∣s)r(s,a)r_\pi(s) \doteq \sum_{a \in \mathcal{A}} \pi(a|s) r(s,a)$
is the average of the one-step immediate reward that can be obtained starting from state s, and

$\mathbb{E}[R|s,a] = \sum_r r p(r|s,a)$
- The weight $dπd_\pi$ is the stationary distribution.
- As its name suggests, $rˉπ\bar{r}_\pi$ is simply a weighted average of the one-step immediate rewards.

Equivalent definition

Suppose an agent follows a given policy and generate a trajectory with the rewards as ( $Rt+1,Rt+2,⋯R_{t+1}, R_{t+2}, \cdots$ ).
The average single-step reward along this trajectory is

$lim⁡n→∞1nE[Rt+1+Rt+2+⋯+Rt+n∣St=s0]\lim_{n \to \infty} \frac{1}{n} \mathbb{E}\Big[ R_{t+1} + R_{t+2} + \cdots + R_{t+n} \,\big|\, S_t = s_0 \Big]$

$\lim_{n \to \infty} \frac{1}{n} \mathbb{E}\Big[ \sum_{k=1}^n R_{t+k} \,\big|\, S_t = s_0 \Big]$
- where $s_0$ is the starting state of the trajectory.

Important property

An important property is that

$lim⁡n→∞1nE[∑k=1nRt+k∣St=s0]=lim⁡n→∞1nE[∑k=1nRt+k]\lim_{n \to \infty} \frac{1}{n} \mathbb{E}\Big[ \sum_{k=1}^n R_{t+k} \,\big|\, S_t = s_0 \Big] = \lim_{n \to \infty} \frac{1}{n} \mathbb{E}\Big[ \sum_{k=1}^n R_{t+k} \Big]$

$\sum_s d_\pi(s) r_\pi(s)$
```
  $= \bar{r}_\pi$
```
Note that
- The starting state $s_0$ does not matter.
- The two definitions of $rˉπ\bar{r}_\pi$ are equivalent.

average reward (平均一步奖励)

定义回顾

平均一步奖励（average one-step reward），记作 $rˉπ\bar r_\pi$ ：

$rˉπ=∑s∈Sdπ(s)rπ(s)=E[rπ(S)].\bar r_\pi = \sum_{s \in \mathcal{S}} d_\pi(s) r_\pi(s) = \mathbb{E}[r_\pi(S)] .$

这里：

$dπ(s)d_\pi(s)$ ：在策略 $π\pi$ 下的 stationary distribution（长期访问某状态的概率）；

$rπ(s)r_\pi(s)$ ：在状态 $s$ 下的 期望一步奖励，定义为

$rπ(s)=∑a∈Aπ(a∣s)r(s,a),r_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s)\, r(s,a),$

其中 $\mathbb{E}[R | s,a]$ 是在 $(s, a)$ 下得到的一步奖励的期望。

换句话说， $rˉπ\bar r_\pi$ 衡量了智能体 长期运行时，每一步能获得的平均奖励。

直观理解
分布加权平均
我们不能单纯对所有状态取均值，因为有些状态很少访问（权重应该小）。
所以引入 $dπ(s)d_\pi(s)$ 作为权重。它保证常访问的状态在期望里更重要。
这样 $rˉπ\bar r_\pi$ 就是 “长期平均意义下的每步奖励”。

等价定义

另一种定义方式是直接看一条轨迹：

$rˉπ=lim⁡n→∞1nE[Rt+1+Rt+2+⋯+Rt+n∣St=s0].\bar r_\pi = \lim_{n \to \infty} \frac{1}{n}\, \mathbb{E}\Big[ R_{t+1} + R_{t+2} + \cdots + R_{t+n} \,\Big|\, S_t = s_0 \Big].$

这表示：无论从什么初始状态 $s_0$ 开始，沿着策略 $π\pi$ 跑无限长时间，每一步的平均奖励收敛到同一个值 $rˉπ\bar r_\pi$ 。

也就是说，起始状态并不重要，长期看平均值一定相同。

重要性质

最终可以证明：

$lim⁡n→∞1nE[∑k=1nRt+k∣St=s0]=∑sdπ(s)rπ(s)=rˉπ.\lim_{n \to \infty} \frac{1}{n} \mathbb{E}\Big[ \sum_{k=1}^n R_{t+k} \,\big|\, S_t = s_0 \Big] = \sum_s d_\pi(s) r_\pi(s) = \bar r_\pi .$

这说明 轨迹平均 与 加权期望 是等价的。

Remarks

Remark 1 about the metrics:
- All these metrics are functions of $π\pi$ .
- Since $π\pi$ is parameterized by $θ\theta$ , these metrics are functions of $θ\theta$ .
- In other words, different values of $θ\theta$ can generate different metric values.
- Therefore, we can search for the optimal values of $θ\theta$ to maximize these metrics.
Remark 2 about the metrics:
- One complication is that the metrics can be defined in either the discounted case where $γ∈(0,1)\gamma \in (0,1)$ or the undiscounted case where $γ=1\gamma = 1$ .
Remark 3 about the metrics:
- Intuitively, $rˉπ\bar r_\pi$ is more short-sighted because it merely considers the immediate rewards, whereas $vˉπ\bar v_\pi$ considers the total reward overall steps.
- However, the two metrics are equivalent to each other. In the discounted case where $γ<1\gamma < 1$ , it holds that
  
  $rˉπ=(1−γ)vˉπ.\bar r_\pi = (1-\gamma)\bar v_\pi.$

Summary of Remarks about the Metrics

Remark 1: Metrics are functions of parameters

所有指标（如平均状态价值 $vˉπ\bar v_\pi$ 、平均奖励 $rˉπ\bar r_\pi$ ）本质上都是策略 $π\pi$ 的函数。

策略 $π\pi$ 又由参数 $θ\theta$ 控制，因此这些指标是 $θ\theta$ 的函数。

改变参数 $θ\theta$ 会导致指标值的变化。

目标就是找到能最大化这些指标的最优参数 $θ∗\theta^*$ 。

→ 这就是 Policy Gradient 方法的基本思想：通过梯度优化参数，提升指标。

Remark 2: Discounted vs. Undiscounted cases

指标可以在 折扣情况（ $γ∈(0,1)\gamma \in (0,1)$ ）或 无折扣情况（ $γ=1\gamma = 1$ ）下定义。
折扣情况：更常用，考虑未来奖励时进行折扣，保证收敛性。
无折扣情况：直接平均长期奖励，但理论和计算会更复杂。

Remark 3: Relation between average reward and average state value

直观区别：

平均奖励 $rˉπ\bar r_\pi$ ：更“短视”，只看一步的即时奖励。
平均状态价值 $vˉπ\bar v_\pi$ ：更“长远”，看累计回报。

数学关系：在折扣情况 $γ<1\gamma < 1$ 下，两者等价，满足

$rˉπ=(1−γ)vˉπ\bar r_\pi = (1-\gamma)\bar v_\pi$

也就是说，优化 $vˉπ\bar v_\pi$ 与优化 $rˉπ\bar r_\pi$ 没有本质区别，只是表现形式不同。

一句话总结：

这三条 remark 阐明了三个关键点：

指标依赖于参数 $θ\theta$ ，因此可以用梯度方法优化（Policy Gradient 基础）。
指标可以基于折扣或无折扣定义，但常用折扣版本。
平均状态价值和平均奖励是等价的度量方式（在折扣情形下），一个偏重长期，一个偏重即时。

平均状态价值（average state value） 和 平均奖励（average reward） 的异同与联系

1. 定义

平均状态价值

$vˉπ=∑s∈Sd(s)vπ(s)=ES∼d[vπ(S)]\bar v_\pi = \sum_{s \in \mathcal{S}} d(s) v_\pi(s) = \mathbb{E}{S \sim d}[v\pi(S)]$

是状态价值函数 $vπ(s)v_\pi(s)$ 的加权平均；
权重 $d (s)$ 来自于一个状态分布（可以是 均匀分布 或 stationary distribution $dπd_\pi$ ）。

平均奖励

$rˉπ=∑s∈Sdπ(s)rπ(s)=ES∼dπ[rπ(S)]\bar r_\pi = \sum_{s \in \mathcal{S}} d_\pi(s) r_\pi(s) = \mathbb{E}{S \sim d\pi}[r_\pi(S)]$

是一步奖励的加权平均；
权重必须是策略 $π\pi$ 下的 stationary distribution $dπd_\pi$ 。

相同点

都基于加权平均
二者本质上都是对某个函数（ $vπ(s)v_\pi(s)$ 或 $rπ(s)r_\pi(s)$ ）在状态空间上的加权平均。
加权方式都由状态分布决定。

都体现长期行为
$vˉπ\bar v_\pi$ ：反映智能体在长期运行下，整体的状态价值水平。
$rˉπ\bar r_\pi$ ：反映智能体在长期运行下，每步能获得的平均奖励水平。

不同点

方面 平均状态价值 $vˉπ\bar v_\pi$ 平均奖励 $rˉπ\bar r_\pi$
对象状态价值 $vπ(s)v_\pi(s)$ ：长期折扣回报一步奖励 $rπ(s,a)r_\pi(s,a)$ ：即时回报
公式 $vˉπ=∑sd(s)vπ(s)\bar v_\pi = \sum_s d(s) v_\pi(s)$ $rˉπ=∑sdπ(s)rπ(s)\bar r_\pi = \sum_s d_\pi(s) r_\pi(s)$
权重分布 可以选均匀分布 $d_0$ ，或 stationary distribution $dπd_\pi$ 必须是 stationary distribution $dπd_\pi$
任务场景 多用于分析函数逼近下的价值误差多用于 average reward RL（持续任务）
时间尺度 关注长期回报（折扣累积）关注每步的即时平均回报

联系

价值函数和奖励函数的关系

状态价值函数本身就是基于奖励定义的：

$vπ(s)=Eπ[∑t=0∞γtRt+1∣S0=s].v_\pi(s) = \mathbb{E}\pi\Big[\sum{t=0}^\infty \gamma^t R_{t+1} \,\Big|\, S_0 = s\Big].$

因此：

$vˉπ\bar v_\pi$ 可以看作 “对长期折扣奖励的加权平均”；
$rˉπ\bar r_\pi$ 可以看作 “对即时奖励的加权平均”。

互补作用
在 discounted reward RL 框架下， $vˉπ\bar v_\pi$ 更常用。
在 average reward RL 框架下， $rˉπ\bar r_\pi$ 是核心目标。
二者本质上是 不同角度的长期表现度量。

一句话总结：

平均状态价值 $vˉπ\bar v_\pi$ 是 长期折扣回报的全局加权平均；
平均奖励 $rˉπ\bar r_\pi$ 是 即时奖励的长期平均；
前者强调 长期价值，后者强调 长期平均收益，二者都依赖于状态分布，但适用场景和优化目标不同。

方面	平均状态价值 $vˉπ\bar v_\pi$	平均奖励 $rˉπ\bar r_\pi$
对象	状态价值 $vπ(s)v_\pi(s)$ ：长期折扣回报	一步奖励 $rπ(s,a)r_\pi(s,a)$ ：即时回报
公式	$vˉπ=∑sd(s)vπ(s)\bar v_\pi = \sum_s d(s) v_\pi(s)$	$rˉπ=∑sdπ(s)rπ(s)\bar r_\pi = \sum_s d_\pi(s) r_\pi(s)$
权重分布	可以选均匀分布 $d_0$ ，或 stationary distribution $dπd_\pi$	必须是 stationary distribution $dπd_\pi$
任务场景	多用于分析函数逼近下的价值误差	多用于 average reward RL（持续任务）
时间尺度	关注长期回报（折扣累积）	关注每步的即时平均回报

Gradients of the metrics

Summary of the results about the gradients

$∇θJ(θ)=∑s∈Sη(s)∑a∈A∇θπ(a∣s,θ)qπ(s,a)\nabla_\theta J(\theta) = \sum_{s \in \mathcal S} \eta(s) \sum_{a \in \mathcal A} \nabla_\theta \pi(a|s,\theta) q_\pi(s,a)$

where:
- $J(θ)J(\theta)$ can be $vˉπ,rˉπ\bar v_\pi, \bar r_\pi$ , or $vˉπ0\bar v^0_\pi$
- “ $=$ ” may denote strict equality, approximation, or proportional to
- $η\eta$ is a distribution or weight of the states

Some specific results

$∇θrˉπ≃∑sdπ(s)∑a∇θπ(a∣s,θ)qπ(s,a)\nabla_\theta \bar r_\pi \simeq \sum_s d_\pi(s) \sum_a \nabla_\theta \pi(a|s,\theta) q_\pi(s,a)$

$∇θvˉπ=11−γ∇θrˉπ\nabla_\theta \bar v_\pi = \frac{1}{1-\gamma} \nabla_\theta \bar r_\pi$

$∇θvˉπ0=∑s∈Sρπ(s)∑a∈A∇θπ(a∣s,θ)qπ(s,a)\nabla_\theta \bar v_\pi^0 = \sum_{s \in \mathcal S} \rho_\pi(s) \sum_{a \in \mathcal A} \nabla_\theta \pi(a|s,\theta) q_\pi(s,a)$

A compact and useful form of the gradient

$∇θJ(θ)=∑s∈Sη(s)∑a∈A∇θπ(a∣s,θ)qπ(s,a)=E[∇θln⁡π(A∣S,θ)qπ(S,A)]\nabla_\theta J(\theta) = \sum_{s \in \mathcal S} \eta(s) \sum_{a \in \mathcal A} \nabla_\theta \pi(a|s,\theta) q_\pi(s,a) = \mathbb{E}[\nabla_\theta \ln \pi(A|S,\theta) q_\pi(S,A)]$

where $\sim \eta$ and $\sim \pi(A|S,\theta)$ .
Why is this expression useful?
- Because we can use samples to approximate the gradient!
  
  $∇θJ≈∇θln⁡π(a∣s,θ)qπ(s,a)\nabla_\theta J \approx \nabla_\theta \ln \pi(a|s,\theta) q_\pi(s,a)$
How to prove the above equation?
- Consider the function $ln⁡π\ln \pi$ where $ln⁡\ln$ is the natural logarithm. It is easy to see that
  
  $∇θln⁡π(a∣s,θ)=∇θπ(a∣s,θ)π(a∣s,θ)\nabla_\theta \ln \pi(a|s,\theta) = \frac{\nabla_\theta \pi(a|s,\theta)}{\pi(a|s,\theta)}$
- and hence
  
  $∇θπ(a∣s,θ)=π(a∣s,θ)∇θln⁡π(a∣s,θ).\nabla_\theta \pi(a|s,\theta) = \pi(a|s,\theta) \nabla_\theta \ln \pi(a|s,\theta).$
- Then, we have
  
  $∇θJ=∑sd(s)∑a∇θπ(a∣s,θ)qπ(s,a)\nabla_\theta J = \sum_s d(s) \sum_a \nabla_\theta \pi(a|s,\theta) q_\pi(s,a)$
  
  $\sum_s d(s) \sum_a \pi(a|s,\theta) \nabla_\theta \ln \pi(a|s,\theta) q_\pi(s,a)$
  
  $\mathbb{E}{S \sim d} \Bigg[ \sum_a \pi(a|S,\theta) \nabla\theta \ln \pi(a|S,\theta) q_\pi(S,a) \Bigg]$
  
  $\mathbb{E}{S \sim d, A \sim \pi} \big[ \nabla\theta \ln \pi(A|S,\theta) q_\pi(S,A) \big]$
  
  $≐E[∇θln⁡π(A∣S,θ)qπ(S,A)]\doteq \mathbb{E}[\nabla_\theta \ln \pi(A|S,\theta) q_\pi(S,A)]$

Some remarks

Because we need to calculate $ln⁡π(a∣s,θ)\ln \pi(a|s,\theta)$ , we must ensure that for all $s,a,θs,a,\theta$

$π(a∣s,θ)>0\pi(a|s,\theta) > 0$
This can be archived by using softmax functions that can normalize the entries in a vector from $(−∞,+∞)(-\infty,+\infty)$ to $(0, 1)$ .
For example, for any vector $[x_1, \ldots, x_n]^T$ ,

$zi=exi∑j=1nexjz_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}$
- where $zi∈(0,1)z_i \in (0,1)$ and $∑i=1nzi=1\sum_{i=1}^n z_i = 1$ .
Then, the policy function has the form of

$π(a∣s,θ)=eh(s,a,θ)∑a’∈Aeh(s,a’,θ),\pi(a|s,\theta) = \frac{e^{h(s,a,\theta)}}{\sum_{a’ \in \mathcal{A}} e^{h(s,a’,\theta)}},$
- where $h(s,a,θ)h(s,a,\theta)$ is another function.

Policy Gradient (策略梯度方法) 的数学推导的进一步解释

一般形式：梯度表达式

核心公式：

$∇θJ(θ)=∑s∈Sη(s)∑a∈A∇θπ(a∣s,θ)qπ(s,a)\nabla_\theta J(\theta) = \sum_{s \in \mathcal S} \eta(s) \sum_{a \in \mathcal A} \nabla_\theta \pi(a|s,\theta) q_\pi(s,a)$

$J(θ)J(\theta)$ 表示优化目标，可以是：

$vˉπ\bar v_\pi$ （平均状态价值）
$rˉπ\bar r_\pi$ （平均奖励）
$vˉπ0\bar v_\pi^0$ （起始状态的价值）

$η(s)\eta(s)$ 表示状态分布或权重。

$qπ(s,a)q_\pi(s,a)$ 是 Q-value，即在状态 $s$ 采取动作 $a$ 后的期望回报。

这个公式说明：策略参数更新方向取决于动作对未来回报的贡献。

三个常见具体结果

平均奖励：

$∇θrˉπ≃∑sdπ(s)∑a∇θπ(a∣s,θ)qπ(s,a)\nabla_\theta \bar r_\pi \simeq \sum_s d_\pi(s) \sum_a \nabla_\theta \pi(a|s,\theta) q_\pi(s,a)$

其中 $dπ(s)d_\pi(s)$ 是在策略 $π\pi$ 下的稳态分布。

折扣累计回报：

$∇θvˉπ=11−γ∇θrˉπ\nabla_\theta \bar v_\pi = \frac{1}{1-\gamma} \nabla_\theta \bar r_\pi$

说明平均价值和平均奖励梯度之间有比例关系。

起始状态价值：

$∇θvˉπ0=∑s∈Sρπ(s)∑a∈A∇θπ(a∣s,θ)qπ(s,a)\nabla_\theta \bar v_\pi^0 = \sum_{s \in \mathcal S} \rho_\pi(s) \sum_{a \in \mathcal A} \nabla_\theta \pi(a|s,\theta) q_\pi(s,a)$

这里 $ρπ(s)\rho_\pi(s)$ 是从初始状态分布出发，经过若干步在策略 $π\pi$ 下到达 $s$ 的概率。

三者本质相同，只是加权的分布不同。

3. 紧凑形式（Likelihood Ratio Trick）

通过 对数技巧：

$∇θπ(a∣s,θ)=π(a∣s,θ)∇θln⁡π(a∣s,θ)\nabla_\theta \pi(a|s,\theta) = \pi(a|s,\theta) \nabla_\theta \ln \pi(a|s,\theta)$

将公式化简为：

$∇θJ(θ)=ES∼η,A∼π[∇θln⁡π(A∣S,θ)qπ(S,A)]\nabla_\theta J(\theta) = \mathbb{E}{S \sim \eta, A \sim \pi} \big[ \nabla\theta \ln \pi(A|S,\theta) q_\pi(S,A) \big]$

好处：避免直接对 $π\pi$ 求导，只需对 $ln⁡π\ln \pi$ 求导，更稳定。

更重要的好处：可以用采样来近似：

$∇θJ≈∇θln⁡π(a∣s,θ)qπ(s,a)\nabla_\theta J \approx \nabla_\theta \ln \pi(a|s,\theta) q_\pi(s,a)$

这就是 Policy Gradient 方法可以在实际采样下工作的关键。

为什么需要 Softmax

因为我们要用 $ln⁡π(a∣s,θ)\ln \pi(a|s,\theta)$ ，所以必须保证 $π(a∣s,θ)>0\pi(a|s,\theta) > 0$ 。

Softmax 可以自然保证这个条件：

任意实数输入 $h(s,a,θ)h(s,a,\theta)$ 都能映射到 $(0, 1)$ 且和为 $1$ 。

$π(a∣s,θ)=eh(s,a,θ)∑a’∈Aeh(s,a’,θ)\pi(a|s,\theta) = \frac{e^{h(s,a,\theta)}}{\sum_{a’ \in \mathcal A} e^{h(s,a’,\theta)}}$

Softmax 的性质：

每个动作概率 $∈(0,1)\in (0,1)$
所有动作概率和为 $1$

这使得策略既合法，又能覆盖所有动作（exploration）。

直观理解
目标：最大化期望回报 $J(θ)J(\theta)$ 。
方法：通过梯度上升更新参数 $θ\theta$ 。
关键技巧：
Likelihood Ratio Trick ( $∇θπ=π∇θln⁡π\nabla_\theta \pi = \pi \nabla_\theta \ln \pi$ )，把导数搬到 log 上。
Softmax 保证概率合法性，方便求 $ln⁡π\ln \pi$ 的梯度。

本质：更新方向由 $∇θln⁡π(a∣s,θ)\nabla_\theta \ln \pi(a|s,\theta)$ 决定，而更新幅度由 $qπ(s,a)q_\pi(s,a)$ （动作的好坏）加权。

Gradient-ascent algorithm (REINFORCE)

Introduction

The gradient-ascent algorithm maximizing J(\theta) is

$θt+1=θt+α∇θJ(θ)\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta)$

$\theta_t + \alpha \mathbb{E}\Big[ \nabla_\theta \ln \pi(A|S, \theta_t) q_\pi(S, A) \Big]$
The true gradient can be replaced by a stochastic one:

$θt+1=θt+α∇θln⁡π(at∣st,θt)qπ(st,at)\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) q_\pi(s_t, a_t)$
Furthermore, since $qπq_\pi$ is unknown, it can be approximated:

$θt+1=θt+α∇θln⁡π(at∣st,θt)qt(st,at)\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) q_t(s_t, a_t)$
There are different methods to approximate $qπ(st,at)q_\pi(s_t, a_t)$

直观解释

Gradient-ascent algorithm 基本思想

我们的目标是最大化某个性能指标（objective function） $J(θ)J(\theta)$ ，即期望回报或平均奖励。

用 梯度上升 (gradient ascent) 来更新参数：

$θt+1=θt+α∇θJ(θ),\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta),$

其中 $α\alpha$ 是学习率。

Policy Gradient 定理告诉我们：

$∇θJ(θ)=E[∇θln⁡π(A∣S,θ)qπ(S,A)],\nabla_\theta J(\theta) = \mathbb{E}\Big[\nabla_\theta \ln \pi(A|S,\theta) q_\pi(S,A)\Big],$

这意味着梯度可以通过采样和 $qπq_\pi$ 的估计来近似。

Stochastic approximation & REINFORCE

由于期望 $E[⋅]\mathbb{E}[\cdot]$ 很难精确计算，可以用 随机样本 来替代：

$θt+1=θt+α∇θln⁡π(at∣st,θt)qπ(st,at).\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q_\pi(s_t,a_t).$

但问题是： $qπ(st,at)q_\pi(s_t,a_t)$ 本身未知。于是引入近似 $q_t(s_t,a_t)$ 。

如果 $q_t(s_t,a_t)$ 通过 Monte Carlo 方法 来估计未来回报（即从该状态动作出发，直到 episode 结束累积的实际回报），那么这个更新公式就是经典的 REINFORCE 算法。

Remark

Remark 1: How to do sampling?

$ES∼d,A∼π[∇θln⁡π(A∣S,θt)qπ(S,A)]⟶∇θln⁡π(a∣s,θt)qπ(s,a)\mathbb{E}{S \sim d, A \sim \pi} \Big[ \nabla\theta \ln \pi(A|S, \theta_t) q_\pi(S,A) \Big] \;\;\longrightarrow\;\; \nabla_\theta \ln \pi(a|s,\theta_t) q_\pi(s,a)$
- How to sample $S$ ?
  - $\sim d$ , where the distribution d is a long-run behavior under $π\pi$ .
- How to sample $A$ ?
  - $\sim \pi(A|S,\theta)$ . Hence, $a_t$ should be sampled following $π(θt)\pi(\theta_t)$ at $s_t$ .
  - Therefore, the policy gradient method is on-policy.
Remark 2: How to interpret this algorithm?
- Since
  
  $∇θln⁡π(at∣st,θt)=∇θπ(at∣st,θt)π(at∣st,θt),\nabla_\theta \ln \pi(a_t|s_t, \theta_t) = \frac{\nabla_\theta \pi(a_t|s_t,\theta_t)}{\pi(a_t|s_t,\theta_t)},$
- the algorithm can be rewritten as
  
  $θt+1=θt+α∇θln⁡π(at∣st,θt)qt(st,at)\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) q_t(s_t,a_t)$
  
  $\theta_t + \alpha \Bigg( \frac{q_t(s_t,a_t)}{\pi(a_t|s_t,\theta_t)} \Bigg) \nabla_\theta \pi(a_t|s_t,\theta_t).$
- Therefore, we have the important expression of the algorithm:
  
  $θt+1=θt+αβt∇θπ(at∣st,θt),\theta_{t+1} = \theta_t + \alpha \beta_t \nabla_\theta \pi(a_t|s_t,\theta_t),$
  - where $βt=qt(st,at)π(at∣st,θt)\beta_t = \dfrac{q_t(s_t,a_t)}{\pi(a_t|s_t,\theta_t)}$ .

Remark

Remark 1: Sampling 的意义

$\sim d$ ：状态来自策略 $π\pi$ 下的长期分布。
$\sim \pi(\cdot|S,\theta)$ ：动作根据当前策略 $π\pi$ 采样。

因此，Policy Gradient 是 on-policy 的方法：必须按照当前策略生成数据。

Remark 2: 更新公式的解释

有一个重要推导：

$∇θln⁡π(at∣st,θt)=∇θπ(at∣st,θt)π(at∣st,θt).\nabla_\theta \ln \pi(a_t|s_t,\theta_t) = \frac{\nabla_\theta \pi(a_t|s_t,\theta_t)}{\pi(a_t|s_t,\theta_t)}.$

所以更新式可以写成：

$θt+1=θt+αβt∇θπ(at∣st,θt),\theta_{t+1} = \theta_t + \alpha \beta_t \nabla_\theta \pi(a_t|s_t,\theta_t),$

其中

$βt=qt(st,at)π(at∣st,θt).\beta_t = \frac{q_t(s_t,a_t)}{\pi(a_t|s_t,\theta_t)}.$

Gradient-ascent interpretation

It is a gradient-ascent algorithm for maximizing $π(at∣st,θ)\pi(a_t|s_t,\theta)$ :

$θt+1=θt+αβt∇θπ(at∣st,θt).\theta_{t+1} = \theta_t + \alpha \beta_t \nabla_\theta \pi(a_t|s_t,\theta_t).$
Intuition: When \alpha \beta_t is sufficiently small:
- If $βt>0\beta_t > 0$ , the probability of choosing $s_t,a_t)$ is enhanced:
  
  $π(at∣st,θt+1)>π(at∣st,θt).\pi(a_t|s_t,\theta_{t+1}) > \pi(a_t|s_t,\theta_t).$
- If $βt<0\beta_t < 0$ , then
  
  $π(at∣st,θt+1)<π(at∣st,θt).\pi(a_t|s_t,\theta_{t+1}) < \pi(a_t|s_t,\theta_t).$
Math: When $θt+1−θt\theta_{t+1} - \theta_t$ is sufficiently small, we have

$π(at∣st,θt+1)≈π(at∣st,θt)+(∇θπ(at∣st,θt))T(θt+1−θt)\pi(a_t|s_t,\theta_{t+1}) \approx \pi(a_t|s_t,\theta_t) + \big( \nabla_\theta \pi(a_t|s_t,\theta_t) \big)^T (\theta_{t+1} - \theta_t)$

$\pi(a_t|s_t,\theta_t) + \alpha \beta_t \big( \nabla_\theta \pi(a_t|s_t,\theta_t) \big)^T \big( \nabla_\theta \pi(a_t|s_t,\theta_t) \big)$

$\pi(a_t|s_t,\theta_t) + \alpha \beta_t \big\| \nabla_\theta \pi(a_t|s_t,\theta_t) \big\|^2.$

Coefficient $βt\beta_t$

The coefficient $βt\beta_t$ can well balance exploration and exploitation.

$θt+1=θt+α(qt(st,at)π(at∣st,θt))∇θπ(at∣st,θt)\theta_{t+1} = \theta_t + \alpha \left( \frac{q_t(s_t,a_t)}{\pi(a_t|s_t,\theta_t)} \right) \nabla_\theta \pi(a_t|s_t,\theta_t)$

First, $βt\beta_t$ is proportional to $q_t(s_t,a_t)$ .
- If $q_t(s_t,a_t)$ is great, then $βt\beta_t$ is great.
- Therefore, the algorithm intends to enhance actions with greater values.
Second, $βt\beta_t$ is inversely proportional to $π(at∣st,θt)\pi(a_t|s_t,\theta_t)$ .
- If $π(at∣st,θt)\pi(a_t|s_t,\theta_t)$ is small, then $βt\beta_t$ is large.
- Therefore, the algorithm intends to explore actions that have low probabilities.

$βt\beta_t$ 的作用：平衡探索与利用

与 Q 值成正比：如果 $q_t(s_t,a_t)$ 大，说明该动作回报高，则 $βt\beta_t$ 大，更新幅度更强。→ 利用 (exploitation)。
与动作概率成反比：如果 $π(at∣st,θt)\pi(a_t|s_t,\theta_t)$ 小，则 $βt\beta_t$ 大，更新时更倾向于提高低概率但可能有价值的动作。→ 探索 (exploration)。

因此， $βt\beta_t$ 同时控制了 提升高价值动作 和 促进低概率动作的尝试，自然实现了探索-利用平衡。

REINFORCE

Recall that

$θt+1=θt+α∇θln⁡π(at∣st,θt)qπ(st,at)\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q_\pi(s_t,a_t)$
is replaced by

$θt+1=θt+α∇θln⁡π(at∣st,θt)qt(st,at)\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q_t(s_t,a_t)$
- where $q_t(s_t,a_t)$ is an approximation of $qπ(st,at)q_\pi(s_t,a_t)$ .
If $qπ(st,at)q_\pi(s_t,a_t)$ is approximated by Monte Carlo estimation, the algorithm has a specifics name, REINFORCE.
REINFORCE is one of earliest and simplest policy gradient algorithms.

Pseudocode: Policy Gradient by Monte Carlo (REINFORCE)

Initialization: A parameterized function $π(a∣s,θ)\pi(a|s,\theta)$ , $γ∈(0,1)\gamma \in (0,1)$ , and $α>0\alpha > 0$ .
Aim: Search for an optimal policy maximizing $J(θ)J(\theta)$ .
For the $k$ -th iteration, do
- Select $s_0$ and generate an episode following $π(θk)\pi(\theta_k)$ .
- Suppose the episode is ${s0,a0,r1,…,sT−1,aT−1,rT}\{s_0, a_0, r_1, \ldots, s_{T-1}, a_{T-1}, r_T\}$ .
- For $0,1,\ldots,T-1$ , do
  - Value update:
    
    $qt(st,at)=∑k=t+1Tγk−t−1rkq_t(s_t,a_t) = \sum_{k=t+1}^T \gamma^{k-t-1} r_k$
  - Policy update:
    
    $θt+1=θt+α∇θln⁡π(at∣st,θt)qt(st,at)\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q_t(s_t,a_t)$
- $θk=θT\theta_k = \theta_T$

伪代码逐行解释

Initialization

$π(a∣s,θ)\pi(a|s,\theta)$
表示一个 参数化策略函数，输入状态 $s$ ，输出动作 $a$ 的概率分布。
参数 $θ\theta$ 是神经网络权重或其他可学习参数。

$γ∈(0,1)\gamma \in (0,1)$
折扣因子，用于平衡短期奖励和长期奖励。

$α>0\alpha > 0$
学习率，控制参数更新的步长。

目的：找到最优策略参数 $θ∗\theta^*$ ，使得 性能目标函数 $J(θ)J(\theta)$ 最大化。

For the $k$ th iteration

在第 $k$ 次迭代中，重复以下步骤：

生成一条轨迹 (Episode)

从初始状态 $s_0$ 开始，使用当前策略 $π(θk)\pi(\theta_k)$ 生成一个完整的序列：

$s0,a0,r1,…,sT−1,aT−1,rT{s_0, a_0, r_1, \ldots, s_{T-1}, a_{T-1}, r_T}$

其中 $T$ 是 episode 的长度。

目的：收集当前策略下的经验数据。

For $t=0,1,…,T−1t=0,1,\ldots,T-1$

对轨迹中的每一步 $s_t,a_t)$ ，进行更新：

Value update

$qt(st,at)=∑k=t+1Tγk−t−1rkq_t(s_t,a_t) = \sum_{k=t+1}^T \gamma^{k-t-1} r_k$

这是 蒙特卡洛回报 (Monte Carlo return)，表示从时刻 $t$ 开始，执行动作 $a_t$ 后能获得的 折扣累计奖励。
即：
$r_{t+1}$ 是立即奖励；
$γrt+2\gamma r_{t+2}$ 是一步后奖励；
以此类推。

目的：为每个状态-动作对 $s_t,a_t)$ 提供一个估计的价值。

Policy update

$θt+1=θt+α∇θln⁡π(at∣st,θt)qt(st,at)\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q_t(s_t,a_t)$

$∇θln⁡π(at∣st,θt)\nabla_\theta \ln \pi(a_t|s_t,\theta_t)$

表示对数似然梯度，告诉我们如何调整 $θ\theta$ 使得该动作 $a_t$ 在 $s_t$ 下的概率增大。

$q_t(s_t,a_t)$

是权重，告诉算法这个动作的好坏：

如果 $q_t(s_t,a_t)$ 大 → 说明动作有高回报 → 增强该动作的概率；
如果 $q_t(s_t,a_t)$ 小或负 → 说明动作差 → 降低该动作的概率。

目的：更新策略参数，使得高回报动作的概率增加。

参数更新完成

在一个 episode 内完成所有 $t$ 的更新后，得到新的参数 $θT\theta_T$ 。

最后赋值：

$θk=θT\theta_k = \theta_T$

表示第 $k$ 次迭代的最终参数。

总结

这段伪代码描述的 REINFORCE 算法 本质上是：

用当前策略 采样一条完整轨迹。
计算 每一步的累计折扣回报。
用回报作为权重，调整策略参数，使得：
高回报动作的概率 ↑
低回报动作的概率 ↓

这就是 蒙特卡洛方法 + 策略梯度 的结合，也是最基础的 Policy Gradient 方法。