当前位置：首页 > news >正文

RL【7-1】：Temporal-difference Learning

news 2025/9/11 7:21:24

系列文章目录

Fundamental Tools

RL【1】：Basic Concepts
RL【2】：Bellman Equation
RL【3】：Bellman Optimality Equation

Algorithm

RL【4】：Value Iteration and Policy Iteration
RL【5】：Monte Carlo Learning
RL【6】：Stochastic Approximation and Stochastic Gradient Descent

Method

RL【7-1】：Temporal-difference Learning
RL【7-2】：Temporal-difference Learning

文章目录

系列文章目录
- Fundamental Tools
- Algorithm
- Method
前言
Stochastic Algorithms
TD Learning of State Values
Sarsa
- Base Sarsa
- Expected Sarsa
- $n$ -step Sarsa
总结

前言

本系列文章主要用于记录 B站赵世钰老师的【强化学习的数学原理】的学习笔记，关于赵老师课程的具体内容，可以移步：
B站视频：【【强化学习的数学原理】课程：从零开始到透彻理解（完结）】
GitHub 课程资料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Stochastic Algorithms

First: Simple mean estimation problem

Calculate $\mathbb{E}[X],$ based on some i.i.d. samples ${x\}$ of $X$ .

By writing $\mathbb{E}[X]$ , we can reformulate the problem to a root-finding problem: $g (w) = 0$ .
Since we can only obtain samples {x} of X, the noisy observation is

$g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η.\tilde{g}(w,\eta) = w - x = (w - \mathbb{E}[X]) + (\mathbb{E}[X] - x) \doteq g(w) + \eta.$
Then, according to the RM algorithm, solving g(w)=0:

$wk+1=wk−αkg~(wk,ηk)=wk−αk(wk−xk).w_{k+1} = w_k - \alpha_k \tilde{g}(w_k,\eta_k) = w_k - \alpha_k (w_k - x_k).$

问题背景

我们想要求解某个随机变量 X 的数学期望：

$\mathbb{E}[X],$

但我们不能直接得到 $E[X]\mathbb{E}[X]$ ，只能获取一组来自 $X$ 的 i.i.d. 样本 ${x\}$ 。

转换为根寻找问题

我们把问题写成如下形式：

$\mathbb{E}[X] = 0.$

也就是说，如果我们能找到使得 $g (w) = 0 的 w$ ，那么这个解就是 $E[X]\mathbb{E}[X]$ 。

噪声观测

因为只能观测到样本 $x$ ，所以我们实际上得到的不是 $g (w)$ ，而是一个带噪声的观测：

$g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η,\tilde{g}(w,\eta) = w - x = (w - \mathbb{E}[X]) + (\mathbb{E}[X] - x) \doteq g(w) + \eta,$

其中， $η=E[X]−x\eta = \mathbb{E}[X] - x$ 表示噪声，它的期望为 $0$ 。

RM 更新公式

Robbins–Monro 算法通过迭代更新 $w$ ，逐步收敛到正确的 $E[X]\mathbb{E}[X]$ 。其更新公式为：

$wk+1=wk−αkg~(wk,ηk),w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k),$

代入上面的噪声观测：

$wk+1=wk−αk(wk−xk).w_{k+1} = w_k - \alpha_k (w_k - x_k).$

算法直观解释

当前估计为 $w_k$ 。
我们拿到一个新的样本 $x_k$ 。
用差值 $w_k - x_k$ 来修正估计：如果 $w_k$ 大于样本 $x_k$ ，更新会往下调；反之则往上调。
$αk\alpha_k$ 是步长，通常随迭代次数减小（比如 $αk=1/k\alpha_k = 1/k$ ），保证算法收敛。

Second: A more complex problem

Estimate the mean of a function $v (X)$ : $\mathbb{E}[v(X)]$ , based on some i.i.d. random samples ${x\}$ of $X$ .

To solve this problem, we define

$\mathbb{E}[v(X)]$ ,

$g~(w,η)=w−v(x)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η.\tilde{g}(w,\eta) = w - v(x) = (w - \mathbb{E}[v(X)]) + (\mathbb{E}[v(X)] - v(x)) \doteq g(w) + \eta.$
Then, the problem becomes a root-finding problem: $g (w) = 0$ . The corresponding RM algorithm is

$wk+1=wk−αkg~(wk,ηk)=wk−αk[wk−v(xk)].w_{k+1} = w_k - \alpha_k \tilde{g}(w_k,\eta_k) = w_k - \alpha_k [w_k - v(x_k)].$

问题背景

我们不再直接估计随机变量 $X$ 的均值，而是希望估计某个函数 $v (X)$ 的期望：

$\mathbb{E}[v(X)] ,$

其中 $v(⋅)v(\cdot)$ 是已知函数， $X$ 是随机变量。我们能够获得 $X$ 的 i.i.d. 样本 ${x\}$ ，但无法直接得到 $E[v(X)]\mathbb{E}[v(X)]$ 。

转换为根寻找问题

类似均值估计问题，我们将目标改写为一个根寻找问题：

$\mathbb{E}[v(X)] = 0.$

噪声观测

我们无法直接计算 $E[v(X)]\mathbb{E}[v(X)]$ ，但可以通过样本 $x$ 进行观测。于是定义一个带噪声的观测函数：

$g~(w,η)=w−v(x).\tilde{g}(w, \eta) = w - v(x).$

展开来看：

$g~(w,η)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η,\tilde{g}(w, \eta) = (w - \mathbb{E}[v(X)]) + (\mathbb{E}[v(X)] - v(x)) \doteq g(w) + \eta,$

其中 $η=E[v(X)]−v(x)\eta = \mathbb{E}[v(X)] - v(x)$ ，是零均值的噪声。

RM 更新公式

Robbins–Monro 算法通过迭代更新 $w$ ，使其收敛到 $E[v(X)]\mathbb{E}[v(X)]$ 。更新公式为：

$wk+1=wk−αkg~(wk,ηk).w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k).$

代入 $g~(w,η)\tilde{g}(w,\eta)$ 得到：

$wk+1=wk−αk[wk−v(xk)].w_{k+1} = w_k - \alpha_k [w_k - v(x_k)].$

算法直观解释

当前估计为 $w_k$ 。
我们用一个新样本 $x_k$ ，计算函数值 $v(x_k)$ 。
更新公式会逐步把 $w_k$ 调整到 $v(x_k)$ 的方向。
多次迭代后， $w_k$ 会收敛到所有样本的平均值，即 $E[v(X)]\mathbb{E}[v(X)]$ 。

Third: An even more complex problem

Calculate $\mathbb{E}[R + \gamma v(X)],$ where $R, X$ are random variables, $γ\gamma$ is a constant, and $v(⋅)v(\cdot)$ is a function.

Suppose we can obtain samples ${x\}$ and ${r\}$ of $X$ and $R$ . we define

$\mathbb{E}[R + \gamma v(X)],$

$g~(w,η)=w−[r+γv(x)]=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)])≐g(w)+η.\tilde{g}(w,\eta) = w - [r + \gamma v(x)] = (w - \mathbb{E}[R + \gamma v(X)]) + (\mathbb{E}[R + \gamma v(X)] - [r + \gamma v(x)]) \doteq g(w) + \eta.$
Then, the problem becomes a root-finding problem: $g (w) = 0$ . The corresponding RM algorithm is

问题背景

我们希望估计期望值：

$\mathbb{E}[R + \gamma v(X)],$

其中：

$R, X$ 是随机变量；
$γ\gamma$ 是一个常数；
$v(⋅)v(\cdot)$ 是一个函数。

也就是说，目标是同时考虑随机奖励 $R$ 和函数 $v (X)$ 的加权期望。

转换为根寻找问题

将问题改写为求解方程 $g (w) = 0$ ：

$\mathbb{E}[R + \gamma v(X)].$

显然，解为 $w⋆=E[R+γv(X)]w^\star = \mathbb{E}[R + \gamma v(X)]$ 。

噪声观测

由于我们无法直接得到 $E[R+γv(X)]\mathbb{E}[R + \gamma v(X)]$ ，只能通过样本 $(r, x)$ 来观测：

$g~(w,η)=w−[r+γv(x)].\tilde{g}(w, \eta) = w - [r + \gamma v(x)].$

展开来看：

$g~(w,η)=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)]).\tilde{g}(w, \eta) = (w - \mathbb{E}[R + \gamma v(X)]) + \big(\mathbb{E}[R + \gamma v(X)] - [r + \gamma v(x)]\big).$

于是可以写作：

$g~(w,η)≐g(w)+η,\tilde{g}(w, \eta) \doteq g(w) + \eta,$

其中 $η\eta$ 是零均值噪声项。

RM 算法迭代公式

Robbins–Monro 算法通过递推公式更新 $w$ ，逐渐逼近最优解：

$wk+1=wk−αkg~(wk,ηk).w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k).$

代入具体的噪声观测函数，得到：

$wk+1=wk−αk(wk−(rk+γv(xk))).w_{k+1} = w_k - \alpha_k \Big(w_k - (r_k + \gamma v(x_k))\Big).$

直观解释

当前估计为 $w_k$ ；
用样本 $r_k, x_k)$ 计算近似目标值 $rk+γv(xk)r_k + \gamma v(x_k)$ ；
更新公式让 $w_k$ 向这个近似目标靠近；
随着迭代次数增加， $w_k$ 会收敛到 $E[R+γv(X)]\mathbb{E}[R + \gamma v(X)]$ 。

TD Learning of State Values

Algorithm description

The data/experience required by the algorithm:
- $(s0,r1,s1,…,st,rt+1,st+1,…)(s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots)$ or ${(s_t, r_{t+1}, s_{t+1})\}_t$ generated following the given policy $π\pi$ .
The TD learning algorithm is

$vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))](1)v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \Big[ v_t(s_t) - \big( r_{t+1} + \gamma v_t(s_{t+1}) \big) \Big] \quad (1)$

$vt+1(s)=vt(s),∀s≠st(2)v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t \quad (2)$
- where $0,1,2,\ldots$ . Here, $v_t(s_t)$ is the estimated state value of $vπ(st)v_\pi(s_t)$ ; $αt(st)\alpha_t(s_t)$ is the learning rate of $s_t$ at time $t$ .
- At time $t$ , only the value of the visited state $s_t$ is updated whereas the values of the unvisited states $\neq s_t$ remain unchanged.

背景

在强化学习中，我们希望估计某个策略 π\pi 下的 状态价值函数：

$vπ(s)=E[Gt∣St=s,π],v_\pi(s) = \mathbb{E}[G_t \mid S_t = s, \pi],$

其中 $G_t$ 是从状态 $s$ 出发得到的未来累计回报。

但是我们往往 没有环境的模型（转移概率/奖励分布），所以不能直接用 Bellman 方程去算，只能用采样到的轨迹数据来更新估计值。

算法需要的数据

算法只需要从策略 π\pi 下采样的轨迹：
完整轨迹： $(s0,r1,s1,…,st,rt+1,st+1,…)(s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots)$ ，或者
三元组集合： ${(s_t, r_{t+1}, s_{t+1})\}_t$

这意味着 TD 学习可以 在线学习，只需一小步经验（状态-奖励-下一个状态）即可更新。

TD 更新公式

更新访问过的状态：

$vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))](1)v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \Big[ v_t(s_t) - \big( r_{t+1} + \gamma v_t(s_{t+1}) \big) \Big] \quad (1)$

未访问的状态不变：

$vt+1(s)=vt(s),∀s≠st(2)v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t \quad (2)$

含义

更新公式说明，当前状态 $s_t$ 的价值估计会朝着 TD Target 靠拢：

$Target=rt+1+γvt(st+1),\text{TD Target} = r_{t+1} + \gamma v_t(s_{t+1}),$

即“一步奖励 + 折扣后的下一状态价值”。

更新量由 TD Error 控制：

$δt=(rt+1+γvt(st+1))−vt(st).\delta_t = \big(r_{t+1} + \gamma v_t(s_{t+1})\big) - v_t(s_t).$

它刻画了“预测”和“实际一步观察”之间的差异。

学习率 $αt(st)\alpha_t(s_t)$ 决定了更新幅度。

如果 $α\alpha$ 较大，更新更快但不稳定。
如果 $α\alpha$ 较小，更新更慢但更稳定。

Algorithm properties

The TD algorithm can be annotated as

$v_{t+1}(s_t)
= \underbrace{v_t(s_t)}_{\text{current estimate}}|
- \alpha_t(s_t)\Big[\underbrace{v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})]}_{\substack{\text{TD error } \delta_t \ \text{TD target } \bar v_t}}\Big],
  \quad (3)$
- Here,
  
  $vˉt≐rt+1+γv(st+1)\bar v_t \doteq r_{t+1} + \gamma v(s_{t+1})$
  - is called the TD Target.
  $δt≐v(st)−[rt+1+γv(st+1)]=v(st)−vˉt\delta_t \doteq v(s_t) - [r_{t+1} + \gamma v(s_{t+1})] = v(s_t) - \bar v_t$
  - is called the TD error.
- It is clear that the new estimate $v_{t+1}(s_t)$ is a combination of the current estimate $v_t(s_t)$ and the TD error.
First, why is $vˉt\bar v_t$ called the TD Target?
- That is because the algorithm drives $v(s_t)$ towards $vˉt\bar v_t$ .
- To see that,
  
  $vt+1(st)=vt(st)−αt(st)[vt(st)−vˉt]v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)[v_t(s_t) - \bar v_t]$
  
  $⟹vt+1(st)−vˉt=vt(st)−vˉt−αt(st)[vt(st)−vˉt]\implies v_{t+1}(s_t) - \bar v_t = v_t(s_t) - \bar v_t - \alpha_t(s_t)[v_t(s_t) - \bar v_t]$
  
  $⟹vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt]\implies v_{t+1}(s_t) - \bar v_t = [1 - \alpha_t(s_t)][v_t(s_t) - \bar v_t]$
  
  $⟹∣vt+1(st)−vˉt∣=∣1−αt(st)∣∣vt(st)−vˉt∣\implies |v_{t+1}(s_t) - \bar v_t| = |1 - \alpha_t(s_t)||v_t(s_t) - \bar v_t|$
- Since $αt(st)\alpha_t(s_t)$ is a small positive number, we have
  
  $\alpha_t(s_t) < 1$
  - Therefore,
    
    $∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣|v_{t+1}(s_t) - \bar v_t| \le |v_t(s_t) - \bar v_t|$
  - which means $v(s_t)$ is driven towards $vˉt\bar v_t$ !
Second, what is the interpretation of the TD error?

$δt=v(st)−[rt+1+γv(st+1)]\delta_t = v(s_t) - [r_{t+1} + \gamma v(s_{t+1})]$
- It is a difference between two consequent time steps.
- It reflects the deficiency between $v_t$ and $vπv_\pi$ .
  To see that, denote
  
  $δπ,t≐vπ(st)−[rt+1+γvπ(st+1)]\delta_{\pi,t} \doteq v_\pi(s_t) - [r_{t+1} + \gamma v_\pi(s_{t+1})]$
- Note that
  
  $E[δπ,t∣St=st]=vπ(st)−E[Rt+1+γvπ(St+1)∣St=st]=0.\mathbb{E}[\delta_{\pi,t} \mid S_t = s_t] = v_\pi(s_t) - \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s_t] = 0.$
  - If $vt=vπv_t = v_\pi$ , then $δt\delta_t$ should be zero (in the expectation sense).
  - Hence, if $δt\delta_t$ is not zero, then $v_t$ is not equal to $vπv_\pi$ .
- The TD error can be interpreted as innovation, which means new information obtained from the experience $s_t, r_{t+1}, s_{t+1})$ .
Other properties:
- The TD algorithm in (3) only estimates the state value of a given policy.
  - It does not estimate the action values.
  - It does not search for optimal policies.
- Later, we will see how to estimate action values and then search for optimal policies.
- Nonetheless, the TD algorithm in (3) is fundamental for understanding the core idea.

Explanation of TD Algorithm Properties

TD 更新公式回顾

TD 的核心更新公式是：

$vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))].v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\Big[v_t(s_t) - \big(r_{t+1} + \gamma v_t(s_{t+1})\big)\Big].$

它包含三部分：
当前估计 (current estimate)： $v_t(s_t)$
TD Target： $vˉt=rt+1+γvt(st+1)\bar v_t = r_{t+1} + \gamma v_t(s_{t+1})$
TD Error： $δt=vt(st)−vˉt\delta_t = v_t(s_t) - \bar v_t$

因此，更新的本质就是：

$vt+1(st)=vt(st)−αt(st)δt,v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\delta_t,$

即在当前估计的基础上，减去与 TD Target 的差值。

为什么 $vˉt\bar v_t$ 被称为 TD Target？

直观解释

算法的目标是让 $v(s_t)$ 逐渐逼近 $vˉt\bar v_t$ 。
每次更新时，都会缩小两者之间的差距。

推导过程

$vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt].v_{t+1}(s_t) - \bar v_t = [1 - \alpha_t(s_t)]\,[v_t(s_t) - \bar v_t].$

因为学习率 $αt(st)\alpha_t(s_t)$ 在 $\alpha_t(s_t) < 1$ ，所以每次迭代都会让

$∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣.|v_{t+1}(s_t) - \bar v_t| \le |v_t(s_t) - \bar v_t|.$

这说明 估计值一步步朝着 TD Target 收敛。

TD Error 的解释

定义：

$δt=v(st)−(rt+1+γv(st+1)).\delta_t = v(s_t) - \big(r_{t+1} + \gamma v(s_{t+1})\big).$

含义：

它是 两个连续时间步之间的差值；
反映了当前估计 $v_t$ 与真实价值函数 $vπv_\pi$ 的差距。

关键结论：

$vt=vπ\mathbb{E}[\delta_{\pi,t} \mid S_t = s_t] = 0, \quad \text{当且仅当 } v_t = v_\pi$
如果 $δt=0\delta_t = 0$ ，说明估计完全正确；
如果 $δt≠0\delta_t \neq 0$ ，说明估计与真实价值有偏差。

直观理解：

TD Error 可以理解为 创新 (innovation)，即每次从经验中获得的新信息。
更新就是通过 TD Error 将估计逐步修正。

Further Explanation: TD Target & TD Error

TD Target $vˉt\bar v_t$

$vˉt≐rt+1+γv(st+1)\bar v_t \doteq r_{t+1} + \gamma v(s_{t+1})$

直观含义：
它是 下一时刻回报的估计，由当前奖励 $r_{t+1}$ 和未来状态价值 $v(s_{t+1})$ 构成。
可以看作是 一步预测 (one-step lookahead)：从当前状态 $s_t$ 出发，走一步，收集到即时奖励，再加上下一个状态的估值。

类比：
如果把 $v(s_t)$ 看作我们对房价的预测，那么 $vˉt\bar v_t$ 就是“根据最新成交价 + 未来走势”的修正目标。
每次更新就是让预测逐渐接近这个“修正后的目标”。

为什么是 target？

因为在 Bellman 方程中：

$vπ(s)=E[Rt+1+γvπ(St+1)∣St=s],v_\pi(s) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s],$

TD Target $vˉt\bar v_t$ 正是右边的样本近似。

所以 TD Target 是 Bellman 方程的局部实现。

TD Error $δt\delta_t$

$δt=v(st)−(rt+1+γv(st+1))=v(st)−vˉt\delta_t = v(s_t) - \big(r_{t+1} + \gamma v(s_{t+1})\big) = v(s_t) - \bar v_t$

直观含义：
它是 当前估计 与 TD Target 的差距。
如果差距为零，说明预测完美；否则，差距的大小和符号告诉我们更新的方向。

解释 1：预测误差
$δt\delta_t$ 就像是在问：“我的预测 $v(s_t)$ ，和根据实际观察到的奖励修正后的预测 $vˉt\bar v_t$ ，差多少？”

解释 2：学习信号

$δt\delta_t$ 是 更新的驱动力：

$vt+1(st)=vt(st)−αt(st)δtv_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\,\delta_t$

如果 $δt>0\delta_t > 0$ ：说明预测过高，要下降；
如果 $δt<0\delta_t < 0$ ：说明预测过低，要上升。

解释 3：创新 (innovation)
在统计学里，innovation 指新信息与已有预测之间的差异。
在 TD 中， $δt\delta_t$ 就是 从经验中获得的新信息，它衡量了“我们学到的和我们以为的之间的差异”。

TD Target & TD Error 的关系

TD Target 提供 学习的目标；
TD Error 衡量 当前预测与目标的差距；

更新规则就是：

$Error.\text{New Estimate} = \text{Old Estimate} - \text{Learning Rate} \times \text{TD Error}.$

即：
TD Target = “我应该往哪走”
TD Error = “我现在离目标有多远”
TD Update = “往目标迈一小步”

The idea of the algorithm

First, a new expression of the Bellman equation
- The definition of state value of $π\pi$ is
  
  $vπ(s)=E[R+γG∣S=s],s∈S(4)v_\pi(s) = \mathbb{E}[R + \gamma G \mid S = s], \quad s \in \mathcal{S} \quad (4)$
- where $G$ is discounted return. Since
  
  $E[G∣S=s]=∑aπ(a∣s)∑s′p(s′∣s,a)vπ(s′)=E[vπ(S′)∣S=s],\mathbb{E}[G \mid S = s] = \sum_a \pi(a \mid s) \sum_{s'} p(s' \mid s,a) v_\pi(s') = \mathbb{E}[v_\pi(S') \mid S = s],$
- where $S^{'}$ is the next state, we can rewrite (4) as
  
  $vπ(s)=E[R+γvπ(S′)∣S=s],s∈S.(5)v_\pi(s) = \mathbb{E}[R + \gamma v_\pi(S') \mid S = s], \quad s \in \mathcal{S}. \quad (5)$
- Equation (5) is another expression of the Bellman equation. It is sometimes called the Bellman expectation equation, an important tool to design and analyze TD algorithms.
这说明 当前状态的价值 可以用 一步奖励 + 下一状态价值的期望 来表示。
Second, solve the Bellman equation in (5) using the RM algorithm
- In particular, by defining
  
  $\mathbb{E}[R + \gamma v_\pi(S') \mid s],$
- we can rewrite (5) as
  
  $g (v (s)) = 0.$
- Since we can only obtain the samples $r$ and $s^{'}$ of $R$ and $S^{'}$ , the noisy observation we have is
  
  $g~(v(s))=v(s)−[r+γvπ(s′)]=(v(s)−E[R+γvπ(S′)∣s])⏟g(v(s))+(E[R+γvπ(S′)∣s]−[r+γvπ(s′)])⏟η.\tilde g(v(s)) = v(s) - [r + \gamma v_\pi(s')] = \underbrace{(v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s])}_{g(v(s))} + \underbrace{(\mathbb{E}[R + \gamma v_\pi(S') \mid s] - [r + \gamma v_\pi(s')])}_{\eta}.$
用 RM 算法求解 Bellman 方程
- Bellman 方程形式上是一个 不动点方程：
  
  $\mathbb{E}[R + \gamma v_\pi(S') \mid s].$
- 我们可以改写为 零点问题：
  
  $\mathbb{E}[R + \gamma v_\pi(S') \mid s] = 0.$
- 为什么要改写成 $g (v (s)) = 0$ ？
  - 这是为了使用 随机逼近（Robbins-Monro, RM）算法。
  - RM 专门用于在存在噪声的情况下求解零点问题。
- RM 的做法
  - 我们无法直接计算 $E[R+γvπ(S′)∣s]\mathbb{E}[R + \gamma v_\pi(S') \mid s]$ ，只能通过样本 $(r, s^{'})$ 来近似。于是我们定义：
    
    $g~(v(s))=v(s)−[r+γvπ(s′)].\tilde g(v(s)) = v(s) - [r + \gamma v_\pi(s')].$
  - RM 更新式为：
    
    $vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]).v_{k+1}(s) = v_k(s) - \alpha_k \tilde g(v_k(s)) = v_k(s) - \alpha_k \Big( v_k(s) - [r_k + \gamma v_\pi(s'_k)] \Big).$
  - 这一步的含义：
    - 用样本 $r_k, s'_k)$ 构造近似的“梯度” $g~\tilde g$ 。
    - 然后不断更新 $v_k(s)$ ，使其逐渐逼近 Bellman 方程的解。
Therefore, the RM algorithm for solving $g (v (s)) = 0$ is

$vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]),k=1,2,3,…(6)v_{k+1}(s) = v_k(s) - \alpha_k \tilde g(v_k(s)) = v_k(s) - \alpha_k \Big( v_k(s) - [r_k + \gamma v_\pi(s'_k)] \Big), \quad k = 1,2,3,\ldots \quad (6)$
- where $v_k(s)$ is the estimate of $vπ(s)v_\pi(s)$ at the $k$ th step; $r_k, s'_k$ are the samples of $R, S^{'}$ obtained at the $k$ th step.
- The RM algorithm in (6) has two assumptions that deserve special attention:
  - We must have the experience set ${(s, r, s')}$ for $k=1,2,3,…k=1,2,3,\ldots$ .
  - We assume that $vπ(s′)v_\pi(s')$ is already known for any $s^{'}$ .
- To remove the two assumptions in the RM algorithm, we can modify it
  - One modification is that ${(s,r,s')}$ is changed to ${(s_t, r_{t+1}, s_{t+1})}$ so that the algorithm can utilize the sequential samples in an episode.
  - Another modification is that $vπ(s′)v_\pi(s')$ is replaced by an estimate of it because we don’t know it in advance.
与 TD 学习的联系
- RM 算法有两个限制：
  1. 需要知道 $vπ(s′)v_\pi(s')$ ，但实际上我们并不知道；
  2. 需要 $(s, r, s^{'})$ 样本，最好是整个 episode 数据。
- 为了解决这两个问题：
  - 我们 用当前的估计 $v (s^{'})$ 替换真值 $vπ(s′)v_\pi(s')$ ；
  - 我们 用序列样本 $s_t, r_{t+1}, s_{t+1})$ 来进行逐步更新。
- 于是就得到了 TD 更新公式：
  
  $vt+1(st)=vt(st)+αt[rt+1+γvt(st+1)−vt(st)].v_{t+1}(s_t) = v_t(s_t) + \alpha_t \big[ r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \big].$
  - 这里：
    - TD Target: $vˉt=rt+1+γvt(st+1)\bar v_t = r_{t+1} + \gamma v_t(s_{t+1})$
    - TD Error: $δt=vˉt−vt(st)\delta_t = \bar v_t - v_t(s_t)$
直观理解
- 对于 trajectory 的每一个 $s$ 都用 TD 公式更新到收敛
- “所有状态互相帮助一起慢慢收敛”

Algorithm

Algorithm convergence
- By the TD algorithm (1), $v_t(s)$ converges with probability 1 to $vπ(s)v_\pi(s)$ for all $\in \mathcal{S}$ as $\to \infty$ if $∑tαt(s)=∞\sum_t \alpha_t(s) = \infty$ and $∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \infty$ for all $\in \mathcal{S}$ .
- Remarks:
  - This theorem says the state value can be found by the TD algorithm for a given a policy $π\pi$ .
  - $∑tαt(s)=∞\sum_t \alpha_t(s) = \infty$ and $∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \infty$ must be valid for all $\in \mathcal{S}$ . At time step $t$ , if $s = s_t$ which means that $s$ is visited at time $t$ , then $αt(s)>0\alpha_t(s) > 0$ ; otherwise, $αt(s)=0\alpha_t(s) = 0$ for all the other $\ne s_t$ . That requires every state must be visited an infinite (or sufficiently many) number of times.
  - The learning rate $α\alpha$ is often selected as a small constant. In this case, the condition that $∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \infty$ is invalid anymore. When $α\alpha$ is constant, it can still be shown that the algorithm converges in the sense of expectation sense.
定理内容
- 在合适的条件下，TD 学习能够收敛到真实的状态值函数 $vπ(s)v_\pi(s)$ 。
- 条件是：
  - $∑tαt(s)=∞\sum_t \alpha_t(s) = \infty$ （学习率必须足够大，保证无穷多次更新）
  - $∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \infty$ （学习率不能太大，保证更新逐渐收敛而不是震荡）
解释
- 第一条条件保证了 TD 算法不断吸收新信息，不会过早停止学习。
- 第二条条件保证了 TD 算法不会因为过大的更新幅度而在收敛点附近震荡。
学习率的取值
- 实践中，通常将 $αt(s)\alpha_t(s)$ 设为一个 小常数（如 $0.1$ ）。
- 这样严格来说不满足 $∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \infty$ ，但在“期望意义”下依然能保证收敛到接近 $vπ(s)v_\pi(s)$ 。

Algorithm properties

TD/Sarsa learning	MC learning
Online: TD learning is online. It can update the state/action values immediately after receiving a reward.	Offline: MC learning is offline. It has to wait until an episode has been completely collected.
Continuing tasks: Since TD learning is online, it can handle both episodic and continuing tasks.	Episodic tasks: Since MC learning is offline, it can only handle episodic tasks that has terminate states.
Bootstrapping: TD bootstraps because the update of a value relies on the previous estimate of this value. Hence, it requires initial guesses.	Non-bootstrapping: MC is not bootstrapping, because it can directly estimate state/action values without any initial guess.
Low estimation variance: TD has lower than MC because there are fewer random variables. For instance, Sarsa requires $R_{t+1}, S_{t+1}, A_{t+1}$ .	High estimation variance: To estimate $qπ(st,at)q_\pi(s_t, a_t)$ , we need samples of $Rt+1+γRt+2+γ2Rt+3+…R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots$ . Suppose the length of each episode is $L$ . There are $

TD (Temporal Difference) 学习和 MC (Monte Carlo) 学习都是 model-free 方法，但它们有显著的差异。

在线 vs 离线
TD 是 在线更新：每一步获得奖励 $r_{t+1}$ 后，就能立刻更新 $v(s_t)$ 。
MC 是 离线更新：必须等整个 episode 结束，才能计算回报并更新。
含义：TD 更适合 实时学习场景，MC 适合 完整轨迹收集 的情况。

任务类型
TD：既能处理 episodic（有终止状态），也能处理 continuing tasks（无终止状态）。
MC：只能处理 episodic 任务，因为需要完整回报作为更新目标。
含义：TD 更灵活，适合长期运行的系统（如智能体在无限时间的环境中学习）。

Bootstrapping
TD 使用 bootstrapping：即更新时依赖当前的估计值（如 $v(s_{t+1})$ ）。
MC 是 非 bootstrapping：直接使用完整回报 $G_t$ 更新。
含义：TD 更快，因为它不需要等完整 episode，但它依赖初始值。

估计方差
TD：低方差，因为更新只依赖一个即时奖励和下一个状态估计。
MC：高方差，因为回报包含很多随机变量，导致估计更不稳定（无偏估计）。
含义：TD 学习通常比 MC 收敛更快、更稳定，但也可能引入 偏差 (bias)，因为 bootstrapping 使用了近似的估计。

Sarsa

Base Sarsa

Sarsa algorithm

First, our aim is to estimate the action values of a given policy $π\pi$ .
Suppose we have some experience ${(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})}_t$ .
We can use the following Sarsa algorithm to estimate the action values:

$qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γqt(st+1,at+1)]],q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[q_t(s_t, a_t) - [r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})] \Big],$

$qt+1(s,a)=qt(s,a),∀(s,a)≠(st,at),q_{t+1}(s,a) = q_t(s,a), \quad \forall (s,a) \neq (s_t, a_t),$
- where $0,1,2,\ldots$
  - $q_t(s_t,a_t)$ is an estimate of $qπ(st,at)q_\pi(s_t, a_t)$
  - $αt(st,at)\alpha_t(s_t,a_t)$ is the learning rate depending on $s_t,a_t)$
Sarsa 更新公式
- 当前的估计值： $q_t(s_t,a_t)$
- TD Target： $rt+1+γqt(st+1,at+1)r_{t+1} + \gamma q_t(s_{t+1},a_{t+1})$
  - $r_{t+1}$ ：在执行动作 $a_t$ 后得到的即时奖励。
  - $γqt(st+1,at+1)\gamma q_t(s_{t+1}, a_{t+1})$ ：未来从 $s_{t+1}, a_{t+1})$ 开始继续的长期回报估计。
- TD Error： $Targetq_t(s_t,a_t) - \text{TD Target}$
每次更新，Sarsa 都会把 $q(s_t,a_t)$ 往“下一步回报 + 下一步动作的估计值”这个目标拉近一些。

直观解释
- 我们对 $s_t,a_t)$ 的“好坏”有个旧估计 $q_t(s_t,a_t)$ 。
- 但现在我们看到了一步真实的奖励 $r_{t+1}$ ，以及下一步 $s_{t+1},a_{t+1})$ 的估计值 $q_t(s_{t+1},a_{t+1})$ 。
- 这就提供了一个新目标（TD Target）。
- Sarsa 更新时，不会直接替换，而是慢慢地把 $q(s_t,a_t)$ 往这个目标拉近。
Relationship between Sarsa and TD
- Replace $v (s)$ in TD algorithm with $q (s, a)$ → we obtain Sarsa.
- Sarsa is the action-value version of TD learning.
Mathematical Expression
- The Sarsa algorithm solves:
  
  $qπ(s,a)=E[R+γqπ(S’,A’)∣s,a],∀s,a.q_\pi(s,a) = \mathbb{E}[R + \gamma q_\pi(S’, A’) \mid s,a], \quad \forall s,a.$
- This is another expression of the Bellman equation expressed in terms of action values.

Theorem (Convergence of Sarsa learning)

By the Sarsa algorithm, $q_t(s,a)$ converges with probability $1$ to the action value $qπ(s,a)q_\pi(s,a)$ as $\to \infty$ , for all $(s, a)$ , if $∑tαt(s,a)=∞\sum_t \alpha_t(s,a) = \infty$ and $∑tαt2(s,a)<∞\sum_t \alpha_t^2(s,a) < \infty$ .

Remarks:
- This theorem says the action value can be found by Sarsa for a given policy $π\pi$ .

Pseudocode: Policy searching by Sarsa

For each episode, do
- If current s_t is not the target state, do
  - Collect experience $s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$ :
    - Take action $at∼πt(st)a_t \sim \pi_t(s_t)$
    - Generate $r_{t+1}, s_{t+1}$
    - Take action $at+1∼πt(st+1)a_{t+1} \sim π_t(s_{t+1})$
    - Update q-value:
      
      $qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γqt(st+1,at+1))]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[ q_t(s_t, a_t) - \big( r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}) \big) \Big]$
    - Update policy:
      
      $a=arg⁡max⁡aqt+1(st,a)\pi_{t+1}(a \mid s_t) = 1 - \frac{\epsilon}{|\mathcal{A}|} (|\mathcal{A}| - 1), \quad \text{if } a = \arg \max_a q_{t+1}(s_t, a)$
      
      $πt+1(a∣st)=ϵ∣A∣,otherwise\pi_{t+1}(a \mid s_t) = \frac{\epsilon}{|\mathcal{A}|}, \quad \text{otherwise}$

Sarsa 的伪代码可以概括为三步：

采样交互： $s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$
更新 Q 值：往“即时奖励 + 下一步 Q”方向拉近
更新策略：根据最新 Q 值，调整动作选择概率

这样样就实现了 “一边学 Q 值，一边改策略” 的在线强化学习过程。

Remarks about Sarsa

The policy of $s_t$ is updated immediately after $q(s_t,a_t)$ is updated → based on Generalized Policy Iteration (GPI).
The policy is $ϵ\epsilon$ -greedy instead of greedy → balances exploitation and exploration.

Core Idea vs Complication

Core idea: use an algorithm to solve the Bellman equation of a given policy.
Complication: emerges when we try to find optimal policies and work efficiently.

Expected Sarsa

Algorithm

$qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γE[qt(st+1,A)])],q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[q_t(s_t, a_t) - \big(r_{t+1} + \gamma \mathbb{E}[q_t(s_{t+1}, A)]\big)\Big],$

$qt+1(s,a)=qt(s,a),∀(s,a)≠(st,at),q_{t+1}(s, a) = q_t(s, a), \quad \forall (s,a) \neq (s_t,a_t),$

where

$E[qt(st+1,A))]=∑aπt(a∣st+1)qt(st+1,a)≐vt(st+1)\mathbb{E}[q_t(s_{t+1}, A))] = \sum_a \pi_t(a \mid s_{t+1}) q_t(s_{t+1}, a) \doteq v_t(s_{t+1})$
is the expected value of $q_t(s_{t+1}, a)$ under policy $πt\pi_t$ .

Compared to Sarsa:

The TD Target is changed from $rt+1+γqt(st+1,at+1)r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})$ (as in Sarsa) to $rt+1+γE[qt(st+1,A)]r_{t+1} + \gamma \mathbb{E}[q_t(s_{t+1}, A)]$ (as in Expected Sarsa).
Need more computation. But it is beneficial in the sense that it reduces the estimation variances because it reduces random variables in Sarsa from ${s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}}$ to ${s_t, a_t, r_{t+1}, s_{t+1}}$ .

What does the algorithm do mathematically?

Expected Sarsa is a stochastic approximation algorithm for solving the following equation:

$qπ(s,a)=E[Rt+1+γEAt+1∼π(St+1)[qπ(St+1,At+1)],∣,St=s,At=a],∀s,a.q_\pi(s,a) = \mathbb{E}\Big[ R_{t+1} + \gamma \mathbb{E}{A{t+1} \sim \pi(S_{t+1})}[q_\pi(S_{t+1}, A_{t+1})] ,\Big|, S_t=s, A_t=a \Big], \quad \forall s,a.$
The above equation is another expression of the Bellman equation:

$qπ(s,a)=E[Rt+1+γvπ(St+1)∣St=s,At=a].q_\pi(s,a) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s, A_t=a].$

$n$ -step Sarsa

Introduction

The definition of action value is

$qπ(s,a)=E[Gt∣St=s,At=a].q_\pi(s,a) = \mathbb{E}[G_t \mid S_t = s, A_t = a].$
The discounted return $G_t$ can be written in different forms as
- Sarsa
  
  $Gt(1)=Rt+1+γqπ(St+1,At+1),G_t^{(1)} = R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}),$
  
  $Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2),G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, A_{t+2}),$
  
  $⋮\vdots$
- n-step Sarsa
  
  $Gt(n)=Rt+1+γRt+2+⋯+γnqπ(St+n,At+n),G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n}),$
  
  $⋮\vdots$
- MC
  
  $Gt(∞)=Rt+1+γRt+2+γ2Rt+3+⋯G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$
It should be noted that

$Gt=Gt(1)=Gt(2)=Gt(n)=Gt(∞),G_t = G_t^{(1)} = G_t^{(2)} = G_t^{(n)} = G_t^{(\infty)},$
where the superscripts merely indicate the different decomposition structures of $G_t$ .

Algorithm analysis

Sarsa aims to solve

$qπ(s,a)=E[Gt(1)∣s,a]=E[Rt+1+γqπ(St+1,At+1)∣s,a].q_\pi(s,a) = \mathbb{E}[G_t^{(1)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid s,a].$
MC learning aims to solve

$qπ(s,a)=E[Gt(∞)∣s,a]=E[Rt+1+γRt+2+γ2Rt+3+⋯∣s,a].q_\pi(s,a) = \mathbb{E}[G_t^{(\infty)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid s,a].$
An intermediate algorithm called n-step Sarsa aims to solve

$qπ(s,a)=E[Gt(n)∣s,a]=E[Rt+1+γRt+2+⋯+γnqπ(St+n,At+n)∣s,a].q_\pi(s,a) = \mathbb{E}[G_t^{(n)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n}) \mid s,a].$
The algorithm of n-step Sarsa is

$qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γrt+2+⋯+γnqt(st+n,at+n))].q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t)\Big[q_t(s_t, a_t) - \big(r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^n q_t(s_{t+n}, a_{t+n})\big)\Big].$
$n$ -step Sarsa is more general because it becomes the (one-step) Sarsa algorithm when $n = 1$ and the MC learning algorithm when $n=∞n=\infty$ .

Properties

$n$ -step Sarsa needs

$(st,at,rt+1,st+1,at+1,…,rt+n,st+n,at+n).(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, \ldots, r_{t+n}, s_{t+n}, a_{t+n}).$
Since $r_{t+n}, s_{t+n}, a_{t+n})$ has not been collected at time $t$ , we are not able to implement n-step Sarsa at step $t$ . However, we can wait until time $t + n$ to update the q-value of $s_t,a_t)$ :

$qt+n(st,at)=qt+n−1(st,at)−αt+n−1(st,at)[qt+n−1(st,at)−(rt+1+γrt+2+⋯+γnqt+n−1(st+n,at+n))].q_{t+n}(s_t, a_t) = q_{t+n-1}(s_t, a_t) - \alpha_{t+n-1}(s_t,a_t)\Big[q_{t+n-1}(s_t,a_t) - \big(r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^n q_{t+n-1}(s_{t+n}, a_{t+n})\big)\Big].$
Since $n$ -step Sarsa includes Sarsa and MC learning as two extreme cases, its performance is a blend of Sarsa and MC learning:
- If $n$ is large, its performance is close to MC learning and hence has a large variance but a small bias.
- If $n$ is small, its performance is close to Sarsa and hence has a relatively large bias due to the initial guess and relatively low variance.
Finally, $n$ -step Sarsa is also for policy evaluation. It can be combined with the policy improvement step to search for optimal policies.