RL【7-1】:Temporal-difference Learning
系列文章目录
Fundamental Tools
RL【1】:Basic Concepts
RL【2】:Bellman Equation
RL【3】:Bellman Optimality Equation
Algorithm
RL【4】:Value Iteration and Policy Iteration
RL【5】:Monte Carlo Learning
RL【6】:Stochastic Approximation and Stochastic Gradient Descent
Method
RL【7-1】:Temporal-difference Learning
RL【7-2】:Temporal-difference Learning
文章目录
- 系列文章目录
- Fundamental Tools
- Algorithm
- Method
- 前言
- Stochastic Algorithms
- TD Learning of State Values
- Sarsa
- Base Sarsa
- Expected Sarsa
- nnn-step Sarsa
- 总结
前言
本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning
Stochastic Algorithms
First: Simple mean estimation problem
Calculate w=E[X],w = \mathbb{E}[X],w=E[X], based on some i.i.d. samples {x}\{x\}{x} of XXX.
-
By writing g(w)=w−E[X]g(w) = w - \mathbb{E}[X]g(w)=w−E[X], we can reformulate the problem to a root-finding problem: g(w)=0g(w) = 0g(w)=0.
-
Since we can only obtain samples {x} of X, the noisy observation is
g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η.\tilde{g}(w,\eta) = w - x = (w - \mathbb{E}[X]) + (\mathbb{E}[X] - x) \doteq g(w) + \eta.g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η.
-
Then, according to the RM algorithm, solving g(w)=0:
wk+1=wk−αkg~(wk,ηk)=wk−αk(wk−xk).w_{k+1} = w_k - \alpha_k \tilde{g}(w_k,\eta_k) = w_k - \alpha_k (w_k - x_k).wk+1=wk−αkg~(wk,ηk)=wk−αk(wk−xk).
问题背景
我们想要求解某个随机变量 X 的数学期望:
w=E[X],w = \mathbb{E}[X],w=E[X],
但我们不能直接得到 E[X]\mathbb{E}[X]E[X],只能获取一组来自 XXX 的 i.i.d. 样本 {x}\{x\}{x}。
转换为根寻找问题
我们把问题写成如下形式:
g(w)=w−E[X]=0.g(w) = w - \mathbb{E}[X] = 0.g(w)=w−E[X]=0.
也就是说,如果我们能找到使得 g(w)=0的wg(w)=0 的 wg(w)=0的w,那么这个解就是 E[X]\mathbb{E}[X]E[X]。
噪声观测
因为只能观测到样本 xxx,所以我们实际上得到的不是 g(w)g(w)g(w),而是一个带噪声的观测:
g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η,\tilde{g}(w,\eta) = w - x = (w - \mathbb{E}[X]) + (\mathbb{E}[X] - x) \doteq g(w) + \eta,g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η,
其中,η=E[X]−x\eta = \mathbb{E}[X] - xη=E[X]−x 表示噪声,它的期望为 000。
RM 更新公式
Robbins–Monro 算法通过迭代更新 www,逐步收敛到正确的 E[X]\mathbb{E}[X]E[X]。其更新公式为:
wk+1=wk−αkg~(wk,ηk),w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k),wk+1=wk−αkg~(wk,ηk),
代入上面的噪声观测:
wk+1=wk−αk(wk−xk).w_{k+1} = w_k - \alpha_k (w_k - x_k).wk+1=wk−αk(wk−xk).
算法直观解释
- 当前估计为 wkw_kwk。
- 我们拿到一个新的样本 xkx_kxk。
- 用差值 wk−xkw_k - x_kwk−xk 来修正估计:如果 wkw_kwk 大于样本 xkx_kxk,更新会往下调;反之则往上调。
- αk\alpha_kαk 是步长,通常随迭代次数减小(比如 αk=1/k\alpha_k = 1/kαk=1/k),保证算法收敛。
Second: A more complex problem
Estimate the mean of a function v(X)v(X)v(X): w=E[v(X)]w = \mathbb{E}[v(X)]w=E[v(X)], based on some i.i.d. random samples {x}\{x\}{x} of XXX.
-
To solve this problem, we define
g(w)=w−E[v(X)]g(w) = w - \mathbb{E}[v(X)]g(w)=w−E[v(X)],
g~(w,η)=w−v(x)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η.\tilde{g}(w,\eta) = w - v(x) = (w - \mathbb{E}[v(X)]) + (\mathbb{E}[v(X)] - v(x)) \doteq g(w) + \eta.g~(w,η)=w−v(x)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η.
-
Then, the problem becomes a root-finding problem: g(w)=0g(w) = 0g(w)=0. The corresponding RM algorithm is
wk+1=wk−αkg~(wk,ηk)=wk−αk[wk−v(xk)].w_{k+1} = w_k - \alpha_k \tilde{g}(w_k,\eta_k) = w_k - \alpha_k [w_k - v(x_k)].wk+1=wk−αkg~(wk,ηk)=wk−αk[wk−v(xk)].
问题背景
我们不再直接估计随机变量 XXX 的均值,而是希望估计某个函数 v(X)v(X)v(X) 的期望:
w=E[v(X)],w = \mathbb{E}[v(X)] ,w=E[v(X)],
其中 v(⋅)v(\cdot)v(⋅) 是已知函数,XXX 是随机变量。我们能够获得 XXX 的 i.i.d. 样本 {x}\{x\}{x},但无法直接得到 E[v(X)]\mathbb{E}[v(X)]E[v(X)]。
转换为根寻找问题
类似均值估计问题,我们将目标改写为一个根寻找问题:
g(w)=w−E[v(X)]=0.g(w) = w - \mathbb{E}[v(X)] = 0.g(w)=w−E[v(X)]=0.
噪声观测
我们无法直接计算 E[v(X)]\mathbb{E}[v(X)]E[v(X)],但可以通过样本 xxx 进行观测。于是定义一个带噪声的观测函数:
g~(w,η)=w−v(x).\tilde{g}(w, \eta) = w - v(x).g~(w,η)=w−v(x).
展开来看:
g~(w,η)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η,\tilde{g}(w, \eta) = (w - \mathbb{E}[v(X)]) + (\mathbb{E}[v(X)] - v(x)) \doteq g(w) + \eta,g~(w,η)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η,
其中 η=E[v(X)]−v(x)\eta = \mathbb{E}[v(X)] - v(x)η=E[v(X)]−v(x),是零均值的噪声。
RM 更新公式
Robbins–Monro 算法通过迭代更新 www,使其收敛到 E[v(X)]\mathbb{E}[v(X)]E[v(X)]。更新公式为:
wk+1=wk−αkg~(wk,ηk).w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k).wk+1=wk−αkg~(wk,ηk).
代入 g~(w,η)\tilde{g}(w,\eta)g~(w,η) 得到:
wk+1=wk−αk[wk−v(xk)].w_{k+1} = w_k - \alpha_k [w_k - v(x_k)].wk+1=wk−αk[wk−v(xk)].
算法直观解释
- 当前估计为 wkw_kwk。
- 我们用一个新样本 xkx_kxk,计算函数值 v(xk)v(x_k)v(xk)。
- 更新公式会逐步把 wkw_kwk 调整到 v(xk)v(x_k)v(xk) 的方向。
- 多次迭代后,wkw_kwk 会收敛到所有样本的平均值,即 E[v(X)]\mathbb{E}[v(X)]E[v(X)]。
Third: An even more complex problem
Calculate w=E[R+γv(X)],w = \mathbb{E}[R + \gamma v(X)],w=E[R+γv(X)], where R,XR, XR,X are random variables, γ\gammaγ is a constant, and v(⋅)v(\cdot)v(⋅) is a function.
-
Suppose we can obtain samples {x}\{x\}{x} and {r}\{r\}{r} of XXX and RRR. we define
g(w)=w−E[R+γv(X)],g(w) = w - \mathbb{E}[R + \gamma v(X)],g(w)=w−E[R+γv(X)],
g~(w,η)=w−[r+γv(x)]=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)])≐g(w)+η.\tilde{g}(w,\eta) = w - [r + \gamma v(x)] = (w - \mathbb{E}[R + \gamma v(X)]) + (\mathbb{E}[R + \gamma v(X)] - [r + \gamma v(x)]) \doteq g(w) + \eta.g~(w,η)=w−[r+γv(x)]=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)])≐g(w)+η.
-
Then, the problem becomes a root-finding problem: g(w)=0g(w) = 0g(w)=0. The corresponding RM algorithm is
问题背景
我们希望估计期望值:
w=E[R+γv(X)],w = \mathbb{E}[R + \gamma v(X)],w=E[R+γv(X)],
其中:
- R,XR, XR,X 是随机变量;
- γ\gammaγ 是一个常数;
- v(⋅)v(\cdot)v(⋅) 是一个函数。
也就是说,目标是同时考虑随机奖励 RRR 和函数 v(X)v(X)v(X) 的加权期望。
转换为根寻找问题
将问题改写为求解方程 g(w)=0g(w) = 0g(w)=0:
g(w)=w−E[R+γv(X)].g(w) = w - \mathbb{E}[R + \gamma v(X)].g(w)=w−E[R+γv(X)].
显然,解为 w⋆=E[R+γv(X)]w^\star = \mathbb{E}[R + \gamma v(X)]w⋆=E[R+γv(X)]。
噪声观测
由于我们无法直接得到 E[R+γv(X)]\mathbb{E}[R + \gamma v(X)]E[R+γv(X)],只能通过样本 (r,x)(r, x)(r,x) 来观测:
g~(w,η)=w−[r+γv(x)].\tilde{g}(w, \eta) = w - [r + \gamma v(x)].g~(w,η)=w−[r+γv(x)].
展开来看:
g~(w,η)=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)]).\tilde{g}(w, \eta) = (w - \mathbb{E}[R + \gamma v(X)]) + \big(\mathbb{E}[R + \gamma v(X)] - [r + \gamma v(x)]\big).g~(w,η)=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)]).
于是可以写作:
g~(w,η)≐g(w)+η,\tilde{g}(w, \eta) \doteq g(w) + \eta,g~(w,η)≐g(w)+η,
其中 η\etaη 是零均值噪声项。
RM 算法迭代公式
Robbins–Monro 算法通过递推公式更新 www,逐渐逼近最优解:
wk+1=wk−αkg~(wk,ηk).w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k).wk+1=wk−αkg~(wk,ηk).
代入具体的噪声观测函数,得到:
wk+1=wk−αk(wk−(rk+γv(xk))).w_{k+1} = w_k - \alpha_k \Big(w_k - (r_k + \gamma v(x_k))\Big).wk+1=wk−αk(wk−(rk+γv(xk))).
直观解释
- 当前估计为 wkw_kwk;
- 用样本 (rk,xk)(r_k, x_k)(rk,xk) 计算近似目标值 rk+γv(xk)r_k + \gamma v(x_k)rk+γv(xk);
- 更新公式让 wkw_kwk 向这个近似目标靠近;
- 随着迭代次数增加,wkw_kwk 会收敛到 E[R+γv(X)]\mathbb{E}[R + \gamma v(X)]E[R+γv(X)]。
TD Learning of State Values
Algorithm description
-
The data/experience required by the algorithm:
- (s0,r1,s1,…,st,rt+1,st+1,…)(s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots)(s0,r1,s1,…,st,rt+1,st+1,…) or {(st,rt+1,st+1)}t\{(s_t, r_{t+1}, s_{t+1})\}_t{(st,rt+1,st+1)}t generated following the given policy π\piπ.
-
The TD learning algorithm is
vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))](1)v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \Big[ v_t(s_t) - \big( r_{t+1} + \gamma v_t(s_{t+1}) \big) \Big] \quad (1)vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))](1)
vt+1(s)=vt(s),∀s≠st(2)v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t \quad (2)vt+1(s)=vt(s),∀s=st(2)
- where t=0,1,2,…t = 0,1,2,\ldotst=0,1,2,…. Here, vt(st)v_t(s_t)vt(st) is the estimated state value of vπ(st)v_\pi(s_t)vπ(st); αt(st)\alpha_t(s_t)αt(st) is the learning rate of sts_tst at time ttt.
- At time ttt, only the value of the visited state sts_tst is updated whereas the values of the unvisited states s≠sts \neq s_ts=st remain unchanged.
背景
在强化学习中,我们希望估计某个策略 π\pi 下的 状态价值函数:
vπ(s)=E[Gt∣St=s,π],v_\pi(s) = \mathbb{E}[G_t \mid S_t = s, \pi],vπ(s)=E[Gt∣St=s,π],
- 其中 GtG_tGt 是从状态 sss 出发得到的未来累计回报。
但是我们往往 没有环境的模型(转移概率/奖励分布),所以不能直接用 Bellman 方程去算,只能用采样到的轨迹数据来更新估计值。
算法需要的数据
- 算法只需要从策略 π\pi 下采样的轨迹:
- 完整轨迹:(s0,r1,s1,…,st,rt+1,st+1,…)(s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots)(s0,r1,s1,…,st,rt+1,st+1,…),或者
- 三元组集合:{(st,rt+1,st+1)}t\{(s_t, r_{t+1}, s_{t+1})\}_t{(st,rt+1,st+1)}t
- 这意味着 TD 学习可以 在线学习,只需一小步经验(状态-奖励-下一个状态)即可更新。
TD 更新公式
更新访问过的状态:
vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))](1)v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \Big[ v_t(s_t) - \big( r_{t+1} + \gamma v_t(s_{t+1}) \big) \Big] \quad (1)vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))](1)
未访问的状态不变:
vt+1(s)=vt(s),∀s≠st(2)v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t \quad (2)vt+1(s)=vt(s),∀s=st(2)
含义
更新公式说明,当前状态 sts_tst 的价值估计会朝着 TD Target 靠拢:
TD Target=rt+1+γvt(st+1),\text{TD Target} = r_{t+1} + \gamma v_t(s_{t+1}),TD Target=rt+1+γvt(st+1),
- 即“一步奖励 + 折扣后的下一状态价值”。
更新量由 TD Error 控制:
δt=(rt+1+γvt(st+1))−vt(st).\delta_t = \big(r_{t+1} + \gamma v_t(s_{t+1})\big) - v_t(s_t).δt=(rt+1+γvt(st+1))−vt(st).
- 它刻画了“预测”和“实际一步观察”之间的差异。
学习率 αt(st)\alpha_t(s_t)αt(st) 决定了更新幅度。
- 如果 α\alphaα 较大,更新更快但不稳定。
- 如果 α\alphaα 较小,更新更慢但更稳定。
Algorithm properties
-
The TD algorithm can be annotated as
$v_{t+1}(s_t)
= \underbrace{v_t(s_t)}_{\text{current estimate}}|-
\alpha_t(s_t)\Big[\underbrace{v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})]}_{\substack{\text{TD error } \delta_t \ \text{TD target } \bar v_t}}\Big],
\quad (3)$ -
Here,
vˉt≐rt+1+γv(st+1)\bar v_t \doteq r_{t+1} + \gamma v(s_{t+1})vˉt≐rt+1+γv(st+1)
- is called the TD Target.
δt≐v(st)−[rt+1+γv(st+1)]=v(st)−vˉt\delta_t \doteq v(s_t) - [r_{t+1} + \gamma v(s_{t+1})] = v(s_t) - \bar v_tδt≐v(st)−[rt+1+γv(st+1)]=v(st)−vˉt
- is called the TD error.
-
It is clear that the new estimate vt+1(st)v_{t+1}(s_t)vt+1(st) is a combination of the current estimate vt(st)v_t(s_t)vt(st) and the TD error.
-
-
First, why is vˉt\bar v_tvˉt called the TD Target?
-
That is because the algorithm drives v(st)v(s_t)v(st) towards vˉt\bar v_tvˉt.
-
To see that,
vt+1(st)=vt(st)−αt(st)[vt(st)−vˉt]v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)[v_t(s_t) - \bar v_t]vt+1(st)=vt(st)−αt(st)[vt(st)−vˉt]
⟹vt+1(st)−vˉt=vt(st)−vˉt−αt(st)[vt(st)−vˉt]\implies v_{t+1}(s_t) - \bar v_t = v_t(s_t) - \bar v_t - \alpha_t(s_t)[v_t(s_t) - \bar v_t]⟹vt+1(st)−vˉt=vt(st)−vˉt−αt(st)[vt(st)−vˉt]
⟹vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt]\implies v_{t+1}(s_t) - \bar v_t = [1 - \alpha_t(s_t)][v_t(s_t) - \bar v_t]⟹vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt]
⟹∣vt+1(st)−vˉt∣=∣1−αt(st)∣∣vt(st)−vˉt∣\implies |v_{t+1}(s_t) - \bar v_t| = |1 - \alpha_t(s_t)||v_t(s_t) - \bar v_t|⟹∣vt+1(st)−vˉt∣=∣1−αt(st)∣∣vt(st)−vˉt∣
-
Since αt(st)\alpha_t(s_t)αt(st) is a small positive number, we have
0<1−αt(st)<10 < 1 - \alpha_t(s_t) < 10<1−αt(st)<1
-
Therefore,
∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣|v_{t+1}(s_t) - \bar v_t| \le |v_t(s_t) - \bar v_t|∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣
-
which means v(st)v(s_t)v(st) is driven towards vˉt\bar v_tvˉt!
-
-
-
Second, what is the interpretation of the TD error?
δt=v(st)−[rt+1+γv(st+1)]\delta_t = v(s_t) - [r_{t+1} + \gamma v(s_{t+1})]δt=v(st)−[rt+1+γv(st+1)]
-
It is a difference between two consequent time steps.
-
It reflects the deficiency between vtv_tvt and vπv_\pivπ.
To see that, denoteδπ,t≐vπ(st)−[rt+1+γvπ(st+1)]\delta_{\pi,t} \doteq v_\pi(s_t) - [r_{t+1} + \gamma v_\pi(s_{t+1})]δπ,t≐vπ(st)−[rt+1+γvπ(st+1)]
-
Note that
E[δπ,t∣St=st]=vπ(st)−E[Rt+1+γvπ(St+1)∣St=st]=0.\mathbb{E}[\delta_{\pi,t} \mid S_t = s_t] = v_\pi(s_t) - \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s_t] = 0.E[δπ,t∣St=st]=vπ(st)−E[Rt+1+γvπ(St+1)∣St=st]=0.
- If vt=vπv_t = v_\pivt=vπ, then δt\delta_tδt should be zero (in the expectation sense).
- Hence, if δt\delta_tδt is not zero, then vtv_tvt is not equal to vπv_\pivπ.
-
The TD error can be interpreted as innovation, which means new information obtained from the experience (st,rt+1,st+1)(s_t, r_{t+1}, s_{t+1})(st,rt+1,st+1).
-
-
Other properties:
- The TD algorithm in (3) only estimates the state value of a given policy.
- It does not estimate the action values.
- It does not search for optimal policies.
- Later, we will see how to estimate action values and then search for optimal policies.
- Nonetheless, the TD algorithm in (3) is fundamental for understanding the core idea.
- The TD algorithm in (3) only estimates the state value of a given policy.
Explanation of TD Algorithm Properties
- TD 更新公式回顾
TD 的核心更新公式是:
vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))].v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\Big[v_t(s_t) - \big(r_{t+1} + \gamma v_t(s_{t+1})\big)\Big].vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))].
- 它包含三部分:
- 当前估计 (current estimate): vt(st)v_t(s_t)vt(st)
- TD Target: vˉt=rt+1+γvt(st+1)\bar v_t = r_{t+1} + \gamma v_t(s_{t+1})vˉt=rt+1+γvt(st+1)
- TD Error: δt=vt(st)−vˉt\delta_t = v_t(s_t) - \bar v_tδt=vt(st)−vˉt
因此,更新的本质就是:
vt+1(st)=vt(st)−αt(st)δt,v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\delta_t,vt+1(st)=vt(st)−αt(st)δt,
即在当前估计的基础上,减去与 TD Target 的差值。
- 为什么 vˉt\bar v_tvˉt 被称为 TD Target?
直观解释
- 算法的目标是让 v(st)v(s_t)v(st) 逐渐逼近 vˉt\bar v_tvˉt。
- 每次更新时,都会缩小两者之间的差距。
推导过程
vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt].v_{t+1}(s_t) - \bar v_t = [1 - \alpha_t(s_t)]\,[v_t(s_t) - \bar v_t].vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt].
因为学习率 αt(st)\alpha_t(s_t)αt(st) 在 0<αt(st)<10< \alpha_t(s_t) < 10<αt(st)<1,所以每次迭代都会让
∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣.|v_{t+1}(s_t) - \bar v_t| \le |v_t(s_t) - \bar v_t|.∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣.
这说明 估计值一步步朝着 TD Target 收敛。
- TD Error 的解释
定义:
δt=v(st)−(rt+1+γv(st+1)).\delta_t = v(s_t) - \big(r_{t+1} + \gamma v(s_{t+1})\big).δt=v(st)−(rt+1+γv(st+1)).
含义:
- 它是 两个连续时间步之间的差值;
- 反映了当前估计 vtv_tvt 与真实价值函数 vπv_\pivπ 的差距。
关键结论:
- E[δπ,t∣St=st]=0,当且仅当 vt=vπ\mathbb{E}[\delta_{\pi,t} \mid S_t = s_t] = 0, \quad \text{当且仅当 } v_t = v_\piE[δπ,t∣St=st]=0,当且仅当 vt=vπ
- 如果 δt=0\delta_t = 0δt=0,说明估计完全正确;
- 如果 δt≠0\delta_t \neq 0δt=0,说明估计与真实价值有偏差。
直观理解:
- TD Error 可以理解为 创新 (innovation),即每次从经验中获得的新信息。
- 更新就是通过 TD Error 将估计逐步修正。
Further Explanation: TD Target & TD Error
TD Target vˉt\bar v_tvˉt
vˉt≐rt+1+γv(st+1)\bar v_t \doteq r_{t+1} + \gamma v(s_{t+1})vˉt≐rt+1+γv(st+1)
- 直观含义:
- 它是 下一时刻回报的估计,由当前奖励 rt+1r_{t+1}rt+1 和未来状态价值 v(st+1)v(s_{t+1})v(st+1) 构成。
- 可以看作是 一步预测 (one-step lookahead):从当前状态 sts_tst 出发,走一步,收集到即时奖励,再加上下一个状态的估值。
- 类比:
- 如果把 v(st)v(s_t)v(st) 看作我们对房价的预测,那么 vˉt\bar v_tvˉt 就是“根据最新成交价 + 未来走势”的修正目标。
- 每次更新就是让预测逐渐接近这个“修正后的目标”。
- 为什么是 target?
因为在 Bellman 方程中:
vπ(s)=E[Rt+1+γvπ(St+1)∣St=s],v_\pi(s) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s],vπ(s)=E[Rt+1+γvπ(St+1)∣St=s],
- TD Target vˉt\bar v_tvˉt 正是右边的样本近似。
所以 TD Target 是 Bellman 方程的局部实现。
TD Error δt\delta_tδt
δt=v(st)−(rt+1+γv(st+1))=v(st)−vˉt\delta_t = v(s_t) - \big(r_{t+1} + \gamma v(s_{t+1})\big) = v(s_t) - \bar v_tδt=v(st)−(rt+1+γv(st+1))=v(st)−vˉt
- 直观含义:
- 它是 当前估计 与 TD Target 的差距。
- 如果差距为零,说明预测完美;否则,差距的大小和符号告诉我们更新的方向。
- 解释 1:预测误差
- δt\delta_tδt 就像是在问:“我的预测 v(st)v(s_t)v(st),和根据实际观察到的奖励修正后的预测 vˉt\bar v_tvˉt,差多少?”
- 解释 2:学习信号
δt\delta_tδt 是 更新的驱动力:
vt+1(st)=vt(st)−αt(st)δtv_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\,\delta_tvt+1(st)=vt(st)−αt(st)δt
- 如果 δt>0\delta_t > 0δt>0:说明预测过高,要下降;
- 如果 δt<0\delta_t < 0δt<0:说明预测过低,要上升。
- 解释 3:创新 (innovation)
- 在统计学里,innovation 指新信息与已有预测之间的差异。
- 在 TD 中,δt\delta_tδt 就是 从经验中获得的新信息,它衡量了“我们学到的和我们以为的之间的差异”。
TD Target & TD Error 的关系
- TD Target 提供 学习的目标;
- TD Error 衡量 当前预测与目标的差距;
更新规则就是:
New Estimate=Old Estimate−Learning Rate×TD Error.\text{New Estimate} = \text{Old Estimate} - \text{Learning Rate} \times \text{TD Error}.New Estimate=Old Estimate−Learning Rate×TD Error.
- 即:
- TD Target = “我应该往哪走”
- TD Error = “我现在离目标有多远”
- TD Update = “往目标迈一小步”
The idea of the algorithm
-
First, a new expression of the Bellman equation
-
The definition of state value of π\piπ is
vπ(s)=E[R+γG∣S=s],s∈S(4)v_\pi(s) = \mathbb{E}[R + \gamma G \mid S = s], \quad s \in \mathcal{S} \quad (4)vπ(s)=E[R+γG∣S=s],s∈S(4)
-
where GGG is discounted return. Since
E[G∣S=s]=∑aπ(a∣s)∑s′p(s′∣s,a)vπ(s′)=E[vπ(S′)∣S=s],\mathbb{E}[G \mid S = s] = \sum_a \pi(a \mid s) \sum_{s'} p(s' \mid s,a) v_\pi(s') = \mathbb{E}[v_\pi(S') \mid S = s],E[G∣S=s]=∑aπ(a∣s)∑s′p(s′∣s,a)vπ(s′)=E[vπ(S′)∣S=s],
-
where S′S'S′ is the next state, we can rewrite (4) as
vπ(s)=E[R+γvπ(S′)∣S=s],s∈S.(5)v_\pi(s) = \mathbb{E}[R + \gamma v_\pi(S') \mid S = s], \quad s \in \mathcal{S}. \quad (5)vπ(s)=E[R+γvπ(S′)∣S=s],s∈S.(5)
-
Equation (5) is another expression of the Bellman equation. It is sometimes called the Bellman expectation equation, an important tool to design and analyze TD algorithms.
这说明 当前状态的价值 可以用 一步奖励 + 下一状态价值的期望 来表示。
-
-
Second, solve the Bellman equation in (5) using the RM algorithm
-
In particular, by defining
g(v(s))=v(s)−E[R+γvπ(S′)∣s],g(v(s)) = v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s],g(v(s))=v(s)−E[R+γvπ(S′)∣s],
-
we can rewrite (5) as
g(v(s))=0.g(v(s)) = 0.g(v(s))=0.
-
Since we can only obtain the samples rrr and s′s's′ of RRR and S′S'S′, the noisy observation we have is
g~(v(s))=v(s)−[r+γvπ(s′)]=(v(s)−E[R+γvπ(S′)∣s])⏟g(v(s))+(E[R+γvπ(S′)∣s]−[r+γvπ(s′)])⏟η.\tilde g(v(s)) = v(s) - [r + \gamma v_\pi(s')] = \underbrace{(v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s])}_{g(v(s))} + \underbrace{(\mathbb{E}[R + \gamma v_\pi(S') \mid s] - [r + \gamma v_\pi(s')])}_{\eta}.g~(v(s))=v(s)−[r+γvπ(s′)]=g(v(s))(v(s)−E[R+γvπ(S′)∣s])+η(E[R+γvπ(S′)∣s]−[r+γvπ(s′)]).
用 RM 算法求解 Bellman 方程
-
Bellman 方程形式上是一个 不动点方程:
v(s)=E[R+γvπ(S′)∣s].v(s) = \mathbb{E}[R + \gamma v_\pi(S') \mid s].v(s)=E[R+γvπ(S′)∣s].
-
我们可以改写为 零点问题:
g(v(s))=v(s)−E[R+γvπ(S′)∣s]=0.g(v(s)) = v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s] = 0.g(v(s))=v(s)−E[R+γvπ(S′)∣s]=0.
-
为什么要改写成 g(v(s))=0g(v(s))=0g(v(s))=0?
- 这是为了使用 随机逼近(Robbins-Monro, RM)算法。
- RM 专门用于在存在噪声的情况下求解零点问题。
-
RM 的做法
-
我们无法直接计算 E[R+γvπ(S′)∣s]\mathbb{E}[R + \gamma v_\pi(S') \mid s]E[R+γvπ(S′)∣s],只能通过样本 (r,s′)(r, s')(r,s′) 来近似。于是我们定义:
g~(v(s))=v(s)−[r+γvπ(s′)].\tilde g(v(s)) = v(s) - [r + \gamma v_\pi(s')].g~(v(s))=v(s)−[r+γvπ(s′)].
-
RM 更新式为:
vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]).v_{k+1}(s) = v_k(s) - \alpha_k \tilde g(v_k(s)) = v_k(s) - \alpha_k \Big( v_k(s) - [r_k + \gamma v_\pi(s'_k)] \Big).vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]).
-
这一步的含义:
- 用样本 (rk,sk′)(r_k, s'_k)(rk,sk′) 构造近似的“梯度”g~\tilde gg~。
- 然后不断更新 vk(s)v_k(s)vk(s),使其逐渐逼近 Bellman 方程的解。
-
-
-
Therefore, the RM algorithm for solving g(v(s))=0g(v(s)) = 0g(v(s))=0 is
vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]),k=1,2,3,…(6)v_{k+1}(s) = v_k(s) - \alpha_k \tilde g(v_k(s)) = v_k(s) - \alpha_k \Big( v_k(s) - [r_k + \gamma v_\pi(s'_k)] \Big), \quad k = 1,2,3,\ldots \quad (6)vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]),k=1,2,3,…(6)
- where vk(s)v_k(s)vk(s) is the estimate of vπ(s)v_\pi(s)vπ(s) at the kkkth step; rk,sk′r_k, s'_krk,sk′ are the samples of R,S′R, S'R,S′ obtained at the kkkth step.
- The RM algorithm in (6) has two assumptions that deserve special attention:
- We must have the experience set (s,r,s′){(s, r, s')}(s,r,s′) for k=1,2,3,…k=1,2,3,\ldotsk=1,2,3,….
- We assume that vπ(s′)v_\pi(s')vπ(s′) is already known for any s′s's′.
- To remove the two assumptions in the RM algorithm, we can modify it
- One modification is that (s,r,s′){(s,r,s')}(s,r,s′) is changed to (st,rt+1,st+1){(s_t, r_{t+1}, s_{t+1})}(st,rt+1,st+1) so that the algorithm can utilize the sequential samples in an episode.
- Another modification is that vπ(s′)v_\pi(s')vπ(s′) is replaced by an estimate of it because we don’t know it in advance.
与 TD 学习的联系
-
RM 算法有两个限制:
- 需要知道 vπ(s′)v_\pi(s')vπ(s′),但实际上我们并不知道;
- 需要 (s,r,s′)(s, r, s')(s,r,s′) 样本,最好是整个 episode 数据。
-
为了解决这两个问题:
- 我们 用当前的估计 v(s′)v(s')v(s′) 替换真值 vπ(s′)v_\pi(s')vπ(s′);
- 我们 用序列样本 (st,rt+1,st+1)(s_t, r_{t+1}, s_{t+1})(st,rt+1,st+1) 来进行逐步更新。
-
于是就得到了 TD 更新公式:
vt+1(st)=vt(st)+αt[rt+1+γvt(st+1)−vt(st)].v_{t+1}(s_t) = v_t(s_t) + \alpha_t \big[ r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \big].vt+1(st)=vt(st)+αt[rt+1+γvt(st+1)−vt(st)].
- 这里:
- TD Target: vˉt=rt+1+γvt(st+1)\bar v_t = r_{t+1} + \gamma v_t(s_{t+1})vˉt=rt+1+γvt(st+1)
- TD Error: δt=vˉt−vt(st)\delta_t = \bar v_t - v_t(s_t)δt=vˉt−vt(st)
- 这里:
直观理解
- 对于 trajectory 的每一个 sss 都用 TD 公式更新到收敛
- “所有状态互相帮助一起慢慢收敛”
Algorithm
-
Algorithm convergence
- By the TD algorithm (1), vt(s)v_t(s)vt(s) converges with probability 1 to vπ(s)v_\pi(s)vπ(s) for all s∈Ss \in \mathcal{S}s∈S as t→∞t \to \inftyt→∞ if ∑tαt(s)=∞\sum_t \alpha_t(s) = \infty∑tαt(s)=∞ and ∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \infty∑tαt2(s)<∞ for all s∈Ss \in \mathcal{S}s∈S.
- Remarks:
- This theorem says the state value can be found by the TD algorithm for a given a policy π\piπ.
- ∑tαt(s)=∞\sum_t \alpha_t(s) = \infty∑tαt(s)=∞ and ∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \infty∑tαt2(s)<∞ must be valid for all s∈Ss \in \mathcal{S}s∈S. At time step ttt, if s=sts = s_ts=st which means that sss is visited at time ttt, then αt(s)>0\alpha_t(s) > 0αt(s)>0; otherwise, αt(s)=0\alpha_t(s) = 0αt(s)=0 for all the other s≠sts \ne s_ts=st. That requires every state must be visited an infinite (or sufficiently many) number of times.
- The learning rate α\alphaα is often selected as a small constant. In this case, the condition that ∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \infty∑tαt2(s)<∞ is invalid anymore. When α\alphaα is constant, it can still be shown that the algorithm converges in the sense of expectation sense.
定理内容
- 在合适的条件下,TD 学习能够收敛到真实的状态值函数 vπ(s)v_\pi(s)vπ(s)。
- 条件是:
- ∑tαt(s)=∞\sum_t \alpha_t(s) = \infty∑tαt(s)=∞ (学习率必须足够大,保证无穷多次更新)
- ∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \infty∑tαt2(s)<∞ (学习率不能太大,保证更新逐渐收敛而不是震荡)
解释
- 第一条条件保证了 TD 算法不断吸收新信息,不会过早停止学习。
- 第二条条件保证了 TD 算法不会因为过大的更新幅度而在收敛点附近震荡。
学习率的取值
- 实践中,通常将 αt(s)\alpha_t(s)αt(s) 设为一个 小常数(如 0.10.10.1)。
- 这样严格来说不满足 ∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \infty∑tαt2(s)<∞,但在“期望意义”下依然能保证收敛到接近 vπ(s)v_\pi(s)vπ(s)。
-
Algorithm properties
TD/Sarsa learning MC learning Online: TD learning is online. It can update the state/action values immediately after receiving a reward. Offline: MC learning is offline. It has to wait until an episode has been completely collected. Continuing tasks: Since TD learning is online, it can handle both episodic and continuing tasks. Episodic tasks: Since MC learning is offline, it can only handle episodic tasks that has terminate states. Bootstrapping: TD bootstraps because the update of a value relies on the previous estimate of this value. Hence, it requires initial guesses. Non-bootstrapping: MC is not bootstrapping, because it can directly estimate state/action values without any initial guess. Low estimation variance: TD has lower than MC because there are fewer random variables. For instance, Sarsa requires Rt+1,St+1,At+1R_{t+1}, S_{t+1}, A_{t+1}Rt+1,St+1,At+1. High estimation variance: To estimate qπ(st,at)q_\pi(s_t, a_t)qπ(st,at), we need samples of Rt+1+γRt+2+γ2Rt+3+…R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldotsRt+1+γRt+2+γ2Rt+3+…. Suppose the length of each episode is LLL. There are $ TD (Temporal Difference) 学习和 MC (Monte Carlo) 学习都是 model-free 方法,但它们有显著的差异。
- 在线 vs 离线
- TD 是 在线更新:每一步获得奖励 rt+1r_{t+1}rt+1 后,就能立刻更新 v(st)v(s_t)v(st)。
- MC 是 离线更新:必须等整个 episode 结束,才能计算回报并更新。
- 含义:TD 更适合 实时学习场景,MC 适合 完整轨迹收集 的情况。
- 任务类型
- TD:既能处理 episodic(有终止状态),也能处理 continuing tasks(无终止状态)。
- MC:只能处理 episodic 任务,因为需要完整回报作为更新目标。
- 含义:TD 更灵活,适合长期运行的系统(如智能体在无限时间的环境中学习)。
- Bootstrapping
- TD 使用 bootstrapping:即更新时依赖当前的估计值(如 v(st+1)v(s_{t+1})v(st+1))。
- MC 是 非 bootstrapping:直接使用完整回报 GtG_tGt 更新。
- 含义:TD 更快,因为它不需要等完整 episode,但它依赖初始值。
- 估计方差
- TD:低方差,因为更新只依赖一个即时奖励和下一个状态估计。
- MC:高方差,因为回报包含很多随机变量,导致估计更不稳定(无偏估计)。
- 含义:TD 学习通常比 MC 收敛更快、更稳定,但也可能引入 偏差 (bias),因为 bootstrapping 使用了近似的估计。
- 在线 vs 离线
Sarsa
Base Sarsa
Sarsa algorithm
-
First, our aim is to estimate the action values of a given policy π\piπ.
-
Suppose we have some experience (st,at,rt+1,st+1,at+1)t{(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})}_t(st,at,rt+1,st+1,at+1)t.
-
We can use the following Sarsa algorithm to estimate the action values:
qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γqt(st+1,at+1)]],q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[q_t(s_t, a_t) - [r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})] \Big],qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γqt(st+1,at+1)]],
qt+1(s,a)=qt(s,a),∀(s,a)≠(st,at),q_{t+1}(s,a) = q_t(s,a), \quad \forall (s,a) \neq (s_t, a_t),qt+1(s,a)=qt(s,a),∀(s,a)=(st,at),
- where t=0,1,2,…t = 0,1,2,\ldotst=0,1,2,…
- qt(st,at)q_t(s_t,a_t)qt(st,at) is an estimate of qπ(st,at)q_\pi(s_t, a_t)qπ(st,at)
- αt(st,at)\alpha_t(s_t,a_t)αt(st,at) is the learning rate depending on (st,at)(s_t,a_t)(st,at)
Sarsa 更新公式
- 当前的估计值:qt(st,at)q_t(s_t,a_t)qt(st,at)
- TD Target:rt+1+γqt(st+1,at+1)r_{t+1} + \gamma q_t(s_{t+1},a_{t+1})rt+1+γqt(st+1,at+1)
- rt+1r_{t+1}rt+1:在执行动作 ata_tat 后得到的即时奖励。
- γqt(st+1,at+1)\gamma q_t(s_{t+1}, a_{t+1})γqt(st+1,at+1):未来从 (st+1,at+1)(s_{t+1}, a_{t+1})(st+1,at+1) 开始继续的长期回报估计。
- TD Error:qt(st,at)−TD Targetq_t(s_t,a_t) - \text{TD Target}qt(st,at)−TD Target
每次更新,Sarsa 都会把 q(st,at)q(s_t,a_t)q(st,at) 往“下一步回报 + 下一步动作的估计值”这个目标拉近一些。
直观解释
- 我们对 (st,at)(s_t,a_t)(st,at) 的“好坏”有个旧估计 qt(st,at)q_t(s_t,a_t)qt(st,at)。
- 但现在我们看到了一步真实的奖励 rt+1r_{t+1}rt+1,以及下一步 (st+1,at+1)(s_{t+1},a_{t+1})(st+1,at+1) 的估计值 qt(st+1,at+1)q_t(s_{t+1},a_{t+1})qt(st+1,at+1)。
- 这就提供了一个新目标(TD Target)。
- Sarsa 更新时,不会直接替换,而是慢慢地把 q(st,at)q(s_t,a_t)q(st,at) 往这个目标拉近。
- where t=0,1,2,…t = 0,1,2,\ldotst=0,1,2,…
-
Relationship between Sarsa and TD
- Replace v(s)v(s)v(s) in TD algorithm with q(s,a)q(s,a)q(s,a) → we obtain Sarsa.
- Sarsa is the action-value version of TD learning.
-
Mathematical Expression
-
The Sarsa algorithm solves:
qπ(s,a)=E[R+γqπ(S’,A’)∣s,a],∀s,a.q_\pi(s,a) = \mathbb{E}[R + \gamma q_\pi(S’, A’) \mid s,a], \quad \forall s,a.qπ(s,a)=E[R+γqπ(S’,A’)∣s,a],∀s,a.
-
This is another expression of the Bellman equation expressed in terms of action values.
-
Theorem (Convergence of Sarsa learning)
By the Sarsa algorithm, qt(s,a)q_t(s,a)qt(s,a) converges with probability 111 to the action value qπ(s,a)q_\pi(s,a)qπ(s,a) as t→∞t \to \inftyt→∞, for all (s,a)(s,a)(s,a), if ∑tαt(s,a)=∞\sum_t \alpha_t(s,a) = \infty∑tαt(s,a)=∞ and ∑tαt2(s,a)<∞\sum_t \alpha_t^2(s,a) < \infty∑tαt2(s,a)<∞.
- Remarks:
- This theorem says the action value can be found by Sarsa for a given policy π\piπ.
Pseudocode: Policy searching by Sarsa
- For each episode, do
- If current s_t is not the target state, do
- Collect experience (st,at,rt+1,st+1,at+1)(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})(st,at,rt+1,st+1,at+1):
-
Take action at∼πt(st)a_t \sim \pi_t(s_t)at∼πt(st)
-
Generate rt+1,st+1r_{t+1}, s_{t+1}rt+1,st+1
-
Take action at+1∼πt(st+1)a_{t+1} \sim π_t(s_{t+1})at+1∼πt(st+1)
-
Update q-value:
qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γqt(st+1,at+1))]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[ q_t(s_t, a_t) - \big( r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}) \big) \Big]qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γqt(st+1,at+1))]
-
Update policy:
πt+1(a∣st)=1−ϵ∣A∣(∣A∣−1),if a=argmaxaqt+1(st,a)\pi_{t+1}(a \mid s_t) = 1 - \frac{\epsilon}{|\mathcal{A}|} (|\mathcal{A}| - 1), \quad \text{if } a = \arg \max_a q_{t+1}(s_t, a)πt+1(a∣st)=1−∣A∣ϵ(∣A∣−1),if a=argmaxaqt+1(st,a)
πt+1(a∣st)=ϵ∣A∣,otherwise\pi_{t+1}(a \mid s_t) = \frac{\epsilon}{|\mathcal{A}|}, \quad \text{otherwise}πt+1(a∣st)=∣A∣ϵ,otherwise
-
- Collect experience (st,at,rt+1,st+1,at+1)(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})(st,at,rt+1,st+1,at+1):
- If current s_t is not the target state, do
Sarsa 的伪代码可以概括为三步:
- 采样交互:(st,at,rt+1,st+1,at+1)(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})(st,at,rt+1,st+1,at+1)
- 更新 Q 值:往“即时奖励 + 下一步 Q”方向拉近
- 更新策略:根据最新 Q 值,调整动作选择概率
这样样就实现了 “一边学 Q 值,一边改策略” 的在线强化学习过程。
Remarks about Sarsa
- The policy of sts_tst is updated immediately after q(st,at)q(s_t,a_t)q(st,at) is updated → based on Generalized Policy Iteration (GPI).
- The policy is ϵ\epsilonϵ-greedy instead of greedy → balances exploitation and exploration.
Core Idea vs Complication
- Core idea: use an algorithm to solve the Bellman equation of a given policy.
- Complication: emerges when we try to find optimal policies and work efficiently.
Expected Sarsa
Algorithm
qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γE[qt(st+1,A)])],q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[q_t(s_t, a_t) - \big(r_{t+1} + \gamma \mathbb{E}[q_t(s_{t+1}, A)]\big)\Big],qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γE[qt(st+1,A)])],
qt+1(s,a)=qt(s,a),∀(s,a)≠(st,at),q_{t+1}(s, a) = q_t(s, a), \quad \forall (s,a) \neq (s_t,a_t),qt+1(s,a)=qt(s,a),∀(s,a)=(st,at),
-
where
E[qt(st+1,A))]=∑aπt(a∣st+1)qt(st+1,a)≐vt(st+1)\mathbb{E}[q_t(s_{t+1}, A))] = \sum_a \pi_t(a \mid s_{t+1}) q_t(s_{t+1}, a) \doteq v_t(s_{t+1})E[qt(st+1,A))]=∑aπt(a∣st+1)qt(st+1,a)≐vt(st+1)
-
is the expected value of qt(st+1,a)q_t(s_{t+1}, a)qt(st+1,a) under policy πt\pi_tπt.
Compared to Sarsa:
- The TD Target is changed from rt+1+γqt(st+1,at+1)r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})rt+1+γqt(st+1,at+1) (as in Sarsa) to rt+1+γE[qt(st+1,A)]r_{t+1} + \gamma \mathbb{E}[q_t(s_{t+1}, A)]rt+1+γE[qt(st+1,A)] (as in Expected Sarsa).
- Need more computation. But it is beneficial in the sense that it reduces the estimation variances because it reduces random variables in Sarsa from st,at,rt+1,st+1,at+1{s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}}st,at,rt+1,st+1,at+1 to st,at,rt+1,st+1{s_t, a_t, r_{t+1}, s_{t+1}}st,at,rt+1,st+1.
What does the algorithm do mathematically?
-
Expected Sarsa is a stochastic approximation algorithm for solving the following equation:
qπ(s,a)=E[Rt+1+γEAt+1∼π(St+1)[qπ(St+1,At+1)],∣,St=s,At=a],∀s,a.q_\pi(s,a) = \mathbb{E}\Big[ R_{t+1} + \gamma \mathbb{E}{A{t+1} \sim \pi(S_{t+1})}[q_\pi(S_{t+1}, A_{t+1})] ,\Big|, S_t=s, A_t=a \Big], \quad \forall s,a.qπ(s,a)=E[Rt+1+γEAt+1∼π(St+1)[qπ(St+1,At+1)],,St=s,At=a],∀s,a.
-
The above equation is another expression of the Bellman equation:
qπ(s,a)=E[Rt+1+γvπ(St+1)∣St=s,At=a].q_\pi(s,a) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s, A_t=a].qπ(s,a)=E[Rt+1+γvπ(St+1)∣St=s,At=a].
nnn-step Sarsa
Introduction
-
The definition of action value is
qπ(s,a)=E[Gt∣St=s,At=a].q_\pi(s,a) = \mathbb{E}[G_t \mid S_t = s, A_t = a].qπ(s,a)=E[Gt∣St=s,At=a].
-
The discounted return GtG_tGt can be written in different forms as
-
Sarsa
Gt(1)=Rt+1+γqπ(St+1,At+1),G_t^{(1)} = R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}),Gt(1)=Rt+1+γqπ(St+1,At+1),
Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2),G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, A_{t+2}),Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2),
⋮\vdots⋮
-
n-step Sarsa
Gt(n)=Rt+1+γRt+2+⋯+γnqπ(St+n,At+n),G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n}),Gt(n)=Rt+1+γRt+2+⋯+γnqπ(St+n,At+n),
⋮\vdots⋮
-
MC
Gt(∞)=Rt+1+γRt+2+γ2Rt+3+⋯G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdotsGt(∞)=Rt+1+γRt+2+γ2Rt+3+⋯
-
-
It should be noted that
Gt=Gt(1)=Gt(2)=Gt(n)=Gt(∞),G_t = G_t^{(1)} = G_t^{(2)} = G_t^{(n)} = G_t^{(\infty)},Gt=Gt(1)=Gt(2)=Gt(n)=Gt(∞),
-
where the superscripts merely indicate the different decomposition structures of GtG_tGt.
Algorithm analysis
-
Sarsa aims to solve
qπ(s,a)=E[Gt(1)∣s,a]=E[Rt+1+γqπ(St+1,At+1)∣s,a].q_\pi(s,a) = \mathbb{E}[G_t^{(1)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid s,a].qπ(s,a)=E[Gt(1)∣s,a]=E[Rt+1+γqπ(St+1,At+1)∣s,a].
-
MC learning aims to solve
qπ(s,a)=E[Gt(∞)∣s,a]=E[Rt+1+γRt+2+γ2Rt+3+⋯∣s,a].q_\pi(s,a) = \mathbb{E}[G_t^{(\infty)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid s,a].qπ(s,a)=E[Gt(∞)∣s,a]=E[Rt+1+γRt+2+γ2Rt+3+⋯∣s,a].
-
An intermediate algorithm called n-step Sarsa aims to solve
qπ(s,a)=E[Gt(n)∣s,a]=E[Rt+1+γRt+2+⋯+γnqπ(St+n,At+n)∣s,a].q_\pi(s,a) = \mathbb{E}[G_t^{(n)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n}) \mid s,a].qπ(s,a)=E[Gt(n)∣s,a]=E[Rt+1+γRt+2+⋯+γnqπ(St+n,At+n)∣s,a].
-
The algorithm of n-step Sarsa is
qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γrt+2+⋯+γnqt(st+n,at+n))].q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t)\Big[q_t(s_t, a_t) - \big(r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^n q_t(s_{t+n}, a_{t+n})\big)\Big].qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γrt+2+⋯+γnqt(st+n,at+n))].
-
nnn-step Sarsa is more general because it becomes the (one-step) Sarsa algorithm when n=1n=1n=1 and the MC learning algorithm when n=∞n=\inftyn=∞.
Properties
-
nnn-step Sarsa needs
(st,at,rt+1,st+1,at+1,…,rt+n,st+n,at+n).(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, \ldots, r_{t+n}, s_{t+n}, a_{t+n}).(st,at,rt+1,st+1,at+1,…,rt+n,st+n,at+n).
-
Since (rt+n,st+n,at+n)(r_{t+n}, s_{t+n}, a_{t+n})(rt+n,st+n,at+n) has not been collected at time ttt, we are not able to implement n-step Sarsa at step ttt. However, we can wait until time t+nt+nt+n to update the q-value of (st,at)(s_t,a_t)(st,at):
qt+n(st,at)=qt+n−1(st,at)−αt+n−1(st,at)[qt+n−1(st,at)−(rt+1+γrt+2+⋯+γnqt+n−1(st+n,at+n))].q_{t+n}(s_t, a_t) = q_{t+n-1}(s_t, a_t) - \alpha_{t+n-1}(s_t,a_t)\Big[q_{t+n-1}(s_t,a_t) - \big(r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^n q_{t+n-1}(s_{t+n}, a_{t+n})\big)\Big].qt+n(st,at)=qt+n−1(st,at)−αt+n−1(st,at)[qt+n−1(st,at)−(rt+1+γrt+2+⋯+γnqt+n−1(st+n,at+n))].
-
Since nnn-step Sarsa includes Sarsa and MC learning as two extreme cases, its performance is a blend of Sarsa and MC learning:
- If nnn is large, its performance is close to MC learning and hence has a large variance but a small bias.
- If nnn is small, its performance is close to Sarsa and hence has a relatively large bias due to the initial guess and relatively low variance.
-
Finally, nnn-step Sarsa is also for policy evaluation. It can be combined with the policy improvement step to search for optimal policies.
总结
Sarsa 是 on-policy,学习并评估当前执行的策略;Q-learning 是 off-policy,利用任意行为策略采样却始终朝最优策略收敛;TD 方法则是统一的核心框架,通过 TD Target 和 TD Error 将价值估计逐步逼近 Bellman 方程的解。