当前位置: 首页 > news >正文

RL【7-1】:Temporal-difference Learning

系列文章目录

Fundamental Tools

RL【1】:Basic Concepts
RL【2】:Bellman Equation
RL【3】:Bellman Optimality Equation

Algorithm

RL【4】:Value Iteration and Policy Iteration
RL【5】:Monte Carlo Learning
RL【6】:Stochastic Approximation and Stochastic Gradient Descent

Method

RL【7-1】:Temporal-difference Learning
RL【7-2】:Temporal-difference Learning


文章目录

  • 系列文章目录
    • Fundamental Tools
    • Algorithm
    • Method
  • 前言
  • Stochastic Algorithms
  • TD Learning of State Values
  • Sarsa
  • 总结


前言

本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning


Stochastic Algorithms

First: Simple mean estimation problem

Calculate w=E[X],w = \mathbb{E}[X],w=E[X], based on some i.i.d. samples {x}\{x\}{x} of XXX.

  • By writing g(w)=w−E[X]g(w) = w - \mathbb{E}[X]g(w)=wE[X], we can reformulate the problem to a root-finding problem: g(w)=0g(w) = 0g(w)=0.

  • Since we can only obtain samples {x} of X, the noisy observation is

    g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η.\tilde{g}(w,\eta) = w - x = (w - \mathbb{E}[X]) + (\mathbb{E}[X] - x) \doteq g(w) + \eta.g~(w,η)=wx=(wE[X])+(E[X]x)g(w)+η.

  • Then, according to the RM algorithm, solving g(w)=0:

    wk+1=wk−αkg~(wk,ηk)=wk−αk(wk−xk).w_{k+1} = w_k - \alpha_k \tilde{g}(w_k,\eta_k) = w_k - \alpha_k (w_k - x_k).wk+1=wkαkg~(wk,ηk)=wkαk(wkxk).

问题背景

  • 我们想要求解某个随机变量 X 的数学期望:

    w=E[X],w = \mathbb{E}[X],w=E[X],

  • 但我们不能直接得到 E[X]\mathbb{E}[X]E[X],只能获取一组来自 XXX 的 i.i.d. 样本 {x}\{x\}{x}

转换为根寻找问题

  • 我们把问题写成如下形式:

    g(w)=w−E[X]=0.g(w) = w - \mathbb{E}[X] = 0.g(w)=wE[X]=0.

  • 也就是说,如果我们能找到使得 g(w)=0的wg(w)=0 的 wg(w)=0w,那么这个解就是 E[X]\mathbb{E}[X]E[X]

噪声观测

  • 因为只能观测到样本 xxx,所以我们实际上得到的不是 g(w)g(w)g(w),而是一个带噪声的观测:

    g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η,\tilde{g}(w,\eta) = w - x = (w - \mathbb{E}[X]) + (\mathbb{E}[X] - x) \doteq g(w) + \eta,g~(w,η)=wx=(wE[X])+(E[X]x)g(w)+η,

  • 其中,η=E[X]−x\eta = \mathbb{E}[X] - xη=E[X]x 表示噪声,它的期望为 000

RM 更新公式

  • Robbins–Monro 算法通过迭代更新 www,逐步收敛到正确的 E[X]\mathbb{E}[X]E[X]。其更新公式为:

    wk+1=wk−αkg~(wk,ηk),w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k),wk+1=wkαkg~(wk,ηk),

  • 代入上面的噪声观测:

    wk+1=wk−αk(wk−xk).w_{k+1} = w_k - \alpha_k (w_k - x_k).wk+1=wkαk(wkxk).

算法直观解释

  • 当前估计为 wkw_kwk
  • 我们拿到一个新的样本 xkx_kxk
  • 用差值 wk−xkw_k - x_kwkxk 来修正估计:如果 wkw_kwk 大于样本 xkx_kxk,更新会往下调;反之则往上调。
  • αk\alpha_kαk 是步长,通常随迭代次数减小(比如 αk=1/k\alpha_k = 1/kαk=1/k),保证算法收敛。

Second: A more complex problem

Estimate the mean of a function v(X)v(X)v(X): w=E[v(X)]w = \mathbb{E}[v(X)]w=E[v(X)], based on some i.i.d. random samples {x}\{x\}{x} of XXX.

  • To solve this problem, we define

    g(w)=w−E[v(X)]g(w) = w - \mathbb{E}[v(X)]g(w)=wE[v(X)],

    g~(w,η)=w−v(x)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η.\tilde{g}(w,\eta) = w - v(x) = (w - \mathbb{E}[v(X)]) + (\mathbb{E}[v(X)] - v(x)) \doteq g(w) + \eta.g~(w,η)=wv(x)=(wE[v(X)])+(E[v(X)]v(x))g(w)+η.

  • Then, the problem becomes a root-finding problem: g(w)=0g(w) = 0g(w)=0. The corresponding RM algorithm is

    wk+1=wk−αkg~(wk,ηk)=wk−αk[wk−v(xk)].w_{k+1} = w_k - \alpha_k \tilde{g}(w_k,\eta_k) = w_k - \alpha_k [w_k - v(x_k)].wk+1=wkαkg~(wk,ηk)=wkαk[wkv(xk)].

问题背景

  • 我们不再直接估计随机变量 XXX 的均值,而是希望估计某个函数 v(X)v(X)v(X) 的期望:

    w=E[v(X)],w = \mathbb{E}[v(X)] ,w=E[v(X)],

  • 其中 v(⋅)v(\cdot)v() 是已知函数,XXX 是随机变量。我们能够获得 XXX 的 i.i.d. 样本 {x}\{x\}{x},但无法直接得到 E[v(X)]\mathbb{E}[v(X)]E[v(X)]

转换为根寻找问题

  • 类似均值估计问题,我们将目标改写为一个根寻找问题:

    g(w)=w−E[v(X)]=0.g(w) = w - \mathbb{E}[v(X)] = 0.g(w)=wE[v(X)]=0.

噪声观测

  • 我们无法直接计算 E[v(X)]\mathbb{E}[v(X)]E[v(X)],但可以通过样本 xxx 进行观测。于是定义一个带噪声的观测函数:

    g~(w,η)=w−v(x).\tilde{g}(w, \eta) = w - v(x).g~(w,η)=wv(x).

  • 展开来看:

    g~(w,η)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η,\tilde{g}(w, \eta) = (w - \mathbb{E}[v(X)]) + (\mathbb{E}[v(X)] - v(x)) \doteq g(w) + \eta,g~(w,η)=(wE[v(X)])+(E[v(X)]v(x))g(w)+η,

  • 其中 η=E[v(X)]−v(x)\eta = \mathbb{E}[v(X)] - v(x)η=E[v(X)]v(x),是零均值的噪声。

RM 更新公式

  • Robbins–Monro 算法通过迭代更新 www,使其收敛到 E[v(X)]\mathbb{E}[v(X)]E[v(X)]。更新公式为:

    wk+1=wk−αkg~(wk,ηk).w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k).wk+1=wkαkg~(wk,ηk).

  • 代入 g~(w,η)\tilde{g}(w,\eta)g~(w,η) 得到:

    wk+1=wk−αk[wk−v(xk)].w_{k+1} = w_k - \alpha_k [w_k - v(x_k)].wk+1=wkαk[wkv(xk)].

算法直观解释

  • 当前估计为 wkw_kwk
  • 我们用一个新样本 xkx_kxk,计算函数值 v(xk)v(x_k)v(xk)
  • 更新公式会逐步把 wkw_kwk 调整到 v(xk)v(x_k)v(xk) 的方向。
  • 多次迭代后,wkw_kwk 会收敛到所有样本的平均值,即 E[v(X)]\mathbb{E}[v(X)]E[v(X)]

Third: An even more complex problem

Calculate w=E[R+γv(X)],w = \mathbb{E}[R + \gamma v(X)],w=E[R+γv(X)], where R,XR, XR,X are random variables, γ\gammaγ is a constant, and v(⋅)v(\cdot)v() is a function.

  • Suppose we can obtain samples {x}\{x\}{x} and {r}\{r\}{r} of XXX and RRR. we define

    g(w)=w−E[R+γv(X)],g(w) = w - \mathbb{E}[R + \gamma v(X)],g(w)=wE[R+γv(X)],

    g~(w,η)=w−[r+γv(x)]=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)])≐g(w)+η.\tilde{g}(w,\eta) = w - [r + \gamma v(x)] = (w - \mathbb{E}[R + \gamma v(X)]) + (\mathbb{E}[R + \gamma v(X)] - [r + \gamma v(x)]) \doteq g(w) + \eta.g~(w,η)=w[r+γv(x)]=(wE[R+γv(X)])+(E[R+γv(X)][r+γv(x)])g(w)+η.

  • Then, the problem becomes a root-finding problem: g(w)=0g(w) = 0g(w)=0. The corresponding RM algorithm is

问题背景

  • 我们希望估计期望值:

    w=E[R+γv(X)],w = \mathbb{E}[R + \gamma v(X)],w=E[R+γv(X)],

  • 其中:

    • R,XR, XR,X 是随机变量;
    • γ\gammaγ 是一个常数;
    • v(⋅)v(\cdot)v() 是一个函数。
  • 也就是说,目标是同时考虑随机奖励 RRR 和函数 v(X)v(X)v(X) 的加权期望。

转换为根寻找问题

  • 将问题改写为求解方程 g(w)=0g(w) = 0g(w)=0

    g(w)=w−E[R+γv(X)].g(w) = w - \mathbb{E}[R + \gamma v(X)].g(w)=wE[R+γv(X)].

  • 显然,解为 w⋆=E[R+γv(X)]w^\star = \mathbb{E}[R + \gamma v(X)]w=E[R+γv(X)]

噪声观测

  • 由于我们无法直接得到 E[R+γv(X)]\mathbb{E}[R + \gamma v(X)]E[R+γv(X)],只能通过样本 (r,x)(r, x)(r,x) 来观测:

    g~(w,η)=w−[r+γv(x)].\tilde{g}(w, \eta) = w - [r + \gamma v(x)].g~(w,η)=w[r+γv(x)].

  • 展开来看:

    g~(w,η)=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)]).\tilde{g}(w, \eta) = (w - \mathbb{E}[R + \gamma v(X)]) + \big(\mathbb{E}[R + \gamma v(X)] - [r + \gamma v(x)]\big).g~(w,η)=(wE[R+γv(X)])+(E[R+γv(X)][r+γv(x)]).

  • 于是可以写作:

    g~(w,η)≐g(w)+η,\tilde{g}(w, \eta) \doteq g(w) + \eta,g~(w,η)g(w)+η,

  • 其中 η\etaη 是零均值噪声项。

RM 算法迭代公式

  • Robbins–Monro 算法通过递推公式更新 www,逐渐逼近最优解:

    wk+1=wk−αkg~(wk,ηk).w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k).wk+1=wkαkg~(wk,ηk).

  • 代入具体的噪声观测函数,得到:

    wk+1=wk−αk(wk−(rk+γv(xk))).w_{k+1} = w_k - \alpha_k \Big(w_k - (r_k + \gamma v(x_k))\Big).wk+1=wkαk(wk(rk+γv(xk))).

直观解释

  1. 当前估计为 wkw_kwk
  2. 用样本 (rk,xk)(r_k, x_k)(rk,xk) 计算近似目标值 rk+γv(xk)r_k + \gamma v(x_k)rk+γv(xk)
  3. 更新公式让 wkw_kwk 向这个近似目标靠近;
  4. 随着迭代次数增加,wkw_kwk 会收敛到 E[R+γv(X)]\mathbb{E}[R + \gamma v(X)]E[R+γv(X)]

TD Learning of State Values

Algorithm description

  • The data/experience required by the algorithm:

    • (s0,r1,s1,…,st,rt+1,st+1,…)(s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots)(s0,r1,s1,,st,rt+1,st+1,) or {(st,rt+1,st+1)}t\{(s_t, r_{t+1}, s_{t+1})\}_t{(st,rt+1,st+1)}t generated following the given policy π\piπ.
  • The TD learning algorithm is

    vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))](1)v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \Big[ v_t(s_t) - \big( r_{t+1} + \gamma v_t(s_{t+1}) \big) \Big] \quad (1)vt+1(st)=vt(st)αt(st)[vt(st)(rt+1+γvt(st+1))](1)

    vt+1(s)=vt(s),∀s≠st(2)v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t \quad (2)vt+1(s)=vt(s),s=st(2)

    • where t=0,1,2,…t = 0,1,2,\ldotst=0,1,2,. Here, vt(st)v_t(s_t)vt(st) is the estimated state value of vπ(st)v_\pi(s_t)vπ(st); αt(st)\alpha_t(s_t)αt(st) is the learning rate of sts_tst at time ttt.
    • At time ttt, only the value of the visited state sts_tst is updated whereas the values of the unvisited states s≠sts \neq s_ts=st remain unchanged.

背景

  • 在强化学习中,我们希望估计某个策略 π\pi 下的 状态价值函数

    vπ(s)=E[Gt∣St=s,π],v_\pi(s) = \mathbb{E}[G_t \mid S_t = s, \pi],vπ(s)=E[GtSt=s,π],

    • 其中 GtG_tGt 是从状态 sss 出发得到的未来累计回报。
  • 但是我们往往 没有环境的模型(转移概率/奖励分布),所以不能直接用 Bellman 方程去算,只能用采样到的轨迹数据来更新估计值。

算法需要的数据

  • 算法只需要从策略 π\pi 下采样的轨迹:
    • 完整轨迹(s0,r1,s1,…,st,rt+1,st+1,…)(s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots)(s0,r1,s1,,st,rt+1,st+1,),或者
    • 三元组集合{(st,rt+1,st+1)}t\{(s_t, r_{t+1}, s_{t+1})\}_t{(st,rt+1,st+1)}t
  • 这意味着 TD 学习可以 在线学习,只需一小步经验(状态-奖励-下一个状态)即可更新。

TD 更新公式

  1. 更新访问过的状态:

    vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))](1)v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \Big[ v_t(s_t) - \big( r_{t+1} + \gamma v_t(s_{t+1}) \big) \Big] \quad (1)vt+1(st)=vt(st)αt(st)[vt(st)(rt+1+γvt(st+1))](1)

  2. 未访问的状态不变:

    vt+1(s)=vt(s),∀s≠st(2)v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t \quad (2)vt+1(s)=vt(s),s=st(2)

含义

  • 更新公式说明,当前状态 sts_tst 的价值估计会朝着 TD Target 靠拢:

    TD Target=rt+1+γvt(st+1),\text{TD Target} = r_{t+1} + \gamma v_t(s_{t+1}),TD Target=rt+1+γvt(st+1),

    • 即“一步奖励 + 折扣后的下一状态价值”。
  • 更新量由 TD Error 控制:

    δt=(rt+1+γvt(st+1))−vt(st).\delta_t = \big(r_{t+1} + \gamma v_t(s_{t+1})\big) - v_t(s_t).δt=(rt+1+γvt(st+1))vt(st).

    • 它刻画了“预测”和“实际一步观察”之间的差异。
  • 学习率 αt(st)\alpha_t(s_t)αt(st) 决定了更新幅度。

    • 如果 α\alphaα 较大,更新更快但不稳定。
    • 如果 α\alphaα 较小,更新更慢但更稳定。

Algorithm properties

  • The TD algorithm can be annotated as

    $v_{t+1}(s_t)
    = \underbrace{v_t(s_t)}_{\text{current estimate}}|

    • \alpha_t(s_t)\Big[\underbrace{v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})]}_{\substack{\text{TD error } \delta_t \ \text{TD target } \bar v_t}}\Big],
      \quad (3)$

    • Here,

      vˉt≐rt+1+γv(st+1)\bar v_t \doteq r_{t+1} + \gamma v(s_{t+1})vˉtrt+1+γv(st+1)

      • is called the TD Target.

      δt≐v(st)−[rt+1+γv(st+1)]=v(st)−vˉt\delta_t \doteq v(s_t) - [r_{t+1} + \gamma v(s_{t+1})] = v(s_t) - \bar v_tδtv(st)[rt+1+γv(st+1)]=v(st)vˉt

      • is called the TD error.
    • It is clear that the new estimate vt+1(st)v_{t+1}(s_t)vt+1(st) is a combination of the current estimate vt(st)v_t(s_t)vt(st) and the TD error.

  • First, why is vˉt\bar v_tvˉt called the TD Target?

    • That is because the algorithm drives v(st)v(s_t)v(st) towards vˉt\bar v_tvˉt.

    • To see that,

      vt+1(st)=vt(st)−αt(st)[vt(st)−vˉt]v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)[v_t(s_t) - \bar v_t]vt+1(st)=vt(st)αt(st)[vt(st)vˉt]

      ⟹vt+1(st)−vˉt=vt(st)−vˉt−αt(st)[vt(st)−vˉt]\implies v_{t+1}(s_t) - \bar v_t = v_t(s_t) - \bar v_t - \alpha_t(s_t)[v_t(s_t) - \bar v_t]vt+1(st)vˉt=vt(st)vˉtαt(st)[vt(st)vˉt]

      ⟹vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt]\implies v_{t+1}(s_t) - \bar v_t = [1 - \alpha_t(s_t)][v_t(s_t) - \bar v_t]vt+1(st)vˉt=[1αt(st)][vt(st)vˉt]

      ⟹∣vt+1(st)−vˉt∣=∣1−αt(st)∣∣vt(st)−vˉt∣\implies |v_{t+1}(s_t) - \bar v_t| = |1 - \alpha_t(s_t)||v_t(s_t) - \bar v_t|vt+1(st)vˉt=∣1αt(st)∣∣vt(st)vˉt

    • Since αt(st)\alpha_t(s_t)αt(st) is a small positive number, we have

      0<1−αt(st)<10 < 1 - \alpha_t(s_t) < 10<1αt(st)<1

      • Therefore,

        ∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣|v_{t+1}(s_t) - \bar v_t| \le |v_t(s_t) - \bar v_t|vt+1(st)vˉtvt(st)vˉt

      • which means v(st)v(s_t)v(st) is driven towards vˉt\bar v_tvˉt!

  • Second, what is the interpretation of the TD error?

    δt=v(st)−[rt+1+γv(st+1)]\delta_t = v(s_t) - [r_{t+1} + \gamma v(s_{t+1})]δt=v(st)[rt+1+γv(st+1)]

    • It is a difference between two consequent time steps.

    • It reflects the deficiency between vtv_tvt and vπv_\pivπ.
      To see that, denote

      δπ,t≐vπ(st)−[rt+1+γvπ(st+1)]\delta_{\pi,t} \doteq v_\pi(s_t) - [r_{t+1} + \gamma v_\pi(s_{t+1})]δπ,tvπ(st)[rt+1+γvπ(st+1)]

    • Note that

      E[δπ,t∣St=st]=vπ(st)−E[Rt+1+γvπ(St+1)∣St=st]=0.\mathbb{E}[\delta_{\pi,t} \mid S_t = s_t] = v_\pi(s_t) - \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s_t] = 0.E[δπ,tSt=st]=vπ(st)E[Rt+1+γvπ(St+1)St=st]=0.

      • If vt=vπv_t = v_\pivt=vπ, then δt\delta_tδt should be zero (in the expectation sense).
      • Hence, if δt\delta_tδt is not zero, then vtv_tvt is not equal to vπv_\pivπ.
    • The TD error can be interpreted as innovation, which means new information obtained from the experience (st,rt+1,st+1)(s_t, r_{t+1}, s_{t+1})(st,rt+1,st+1).

  • Other properties:

    • The TD algorithm in (3) only estimates the state value of a given policy.
      • It does not estimate the action values.
      • It does not search for optimal policies.
    • Later, we will see how to estimate action values and then search for optimal policies.
    • Nonetheless, the TD algorithm in (3) is fundamental for understanding the core idea.

Explanation of TD Algorithm Properties

  1. TD 更新公式回顾
    • TD 的核心更新公式是:

      vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))].v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\Big[v_t(s_t) - \big(r_{t+1} + \gamma v_t(s_{t+1})\big)\Big].vt+1(st)=vt(st)αt(st)[vt(st)(rt+1+γvt(st+1))].

      • 它包含三部分:
        • 当前估计 (current estimate): vt(st)v_t(s_t)vt(st)
        • TD Target: vˉt=rt+1+γvt(st+1)\bar v_t = r_{t+1} + \gamma v_t(s_{t+1})vˉt=rt+1+γvt(st+1)
        • TD Error: δt=vt(st)−vˉt\delta_t = v_t(s_t) - \bar v_tδt=vt(st)vˉt
    • 因此,更新的本质就是:

      vt+1(st)=vt(st)−αt(st)δt,v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\delta_t,vt+1(st)=vt(st)αt(st)δt,

    • 即在当前估计的基础上,减去与 TD Target 的差值。

  2. 为什么 vˉt\bar v_tvˉt 被称为 TD Target?
    • 直观解释

      • 算法的目标是让 v(st)v(s_t)v(st) 逐渐逼近 vˉt\bar v_tvˉt
      • 每次更新时,都会缩小两者之间的差距。
    • 推导过程

      vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt].v_{t+1}(s_t) - \bar v_t = [1 - \alpha_t(s_t)]\,[v_t(s_t) - \bar v_t].vt+1(st)vˉt=[1αt(st)][vt(st)vˉt].

      • 因为学习率 αt(st)\alpha_t(s_t)αt(st)0<αt(st)<10< \alpha_t(s_t) < 10<αt(st)<1,所以每次迭代都会让

        ∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣.|v_{t+1}(s_t) - \bar v_t| \le |v_t(s_t) - \bar v_t|.vt+1(st)vˉtvt(st)vˉt∣.

      • 这说明 估计值一步步朝着 TD Target 收敛

  3. TD Error 的解释
    • 定义:

      δt=v(st)−(rt+1+γv(st+1)).\delta_t = v(s_t) - \big(r_{t+1} + \gamma v(s_{t+1})\big).δt=v(st)(rt+1+γv(st+1)).

    • 含义:

      • 它是 两个连续时间步之间的差值
      • 反映了当前估计 vtv_tvt 与真实价值函数 vπv_\pivπ 的差距。
    • 关键结论:

      • E[δπ,t∣St=st]=0,当且仅当 vt=vπ\mathbb{E}[\delta_{\pi,t} \mid S_t = s_t] = 0, \quad \text{当且仅当 } v_t = v_\piE[δπ,tSt=st]=0,当且仅当 vt=vπ
      • 如果 δt=0\delta_t = 0δt=0,说明估计完全正确;
      • 如果 δt≠0\delta_t \neq 0δt=0,说明估计与真实价值有偏差。
    • 直观理解:

      • TD Error 可以理解为 创新 (innovation),即每次从经验中获得的新信息。
      • 更新就是通过 TD Error 将估计逐步修正。

Further Explanation: TD Target & TD Error

  1. TD Target vˉt\bar v_tvˉt

    vˉt≐rt+1+γv(st+1)\bar v_t \doteq r_{t+1} + \gamma v(s_{t+1})vˉtrt+1+γv(st+1)

    • 直观含义:
      • 它是 下一时刻回报的估计,由当前奖励 rt+1r_{t+1}rt+1 和未来状态价值 v(st+1)v(s_{t+1})v(st+1) 构成。
      • 可以看作是 一步预测 (one-step lookahead):从当前状态 sts_tst 出发,走一步,收集到即时奖励,再加上下一个状态的估值。
    • 类比:
      • 如果把 v(st)v(s_t)v(st) 看作我们对房价的预测,那么 vˉt\bar v_tvˉt 就是“根据最新成交价 + 未来走势”的修正目标。
      • 每次更新就是让预测逐渐接近这个“修正后的目标”。
    • 为什么是 target?
      • 因为在 Bellman 方程中:

        vπ(s)=E[Rt+1+γvπ(St+1)∣St=s],v_\pi(s) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s],vπ(s)=E[Rt+1+γvπ(St+1)St=s],

        • TD Target vˉt\bar v_tvˉt 正是右边的样本近似。
      • 所以 TD Target 是 Bellman 方程的局部实现

  2. TD Error δt\delta_tδt

    δt=v(st)−(rt+1+γv(st+1))=v(st)−vˉt\delta_t = v(s_t) - \big(r_{t+1} + \gamma v(s_{t+1})\big) = v(s_t) - \bar v_tδt=v(st)(rt+1+γv(st+1))=v(st)vˉt

    • 直观含义:
      • 它是 当前估计TD Target 的差距。
      • 如果差距为零,说明预测完美;否则,差距的大小和符号告诉我们更新的方向。
    • 解释 1:预测误差
      • δt\delta_tδt 就像是在问:“我的预测 v(st)v(s_t)v(st),和根据实际观察到的奖励修正后的预测 vˉt\bar v_tvˉt,差多少?”
    • 解释 2:学习信号
      • δt\delta_tδt更新的驱动力

        vt+1(st)=vt(st)−αt(st)δtv_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\,\delta_tvt+1(st)=vt(st)αt(st)δt

        • 如果 δt>0\delta_t > 0δt>0:说明预测过高,要下降;
        • 如果 δt<0\delta_t < 0δt<0:说明预测过低,要上升。
    • 解释 3:创新 (innovation)
      • 在统计学里,innovation 指新信息与已有预测之间的差异。
      • 在 TD 中,δt\delta_tδt 就是 从经验中获得的新信息,它衡量了“我们学到的和我们以为的之间的差异”。
  3. TD Target & TD Error 的关系

    • TD Target 提供 学习的目标
    • TD Error 衡量 当前预测与目标的差距
  • 更新规则就是:

    New Estimate=Old Estimate−Learning Rate×TD Error.\text{New Estimate} = \text{Old Estimate} - \text{Learning Rate} \times \text{TD Error}.New Estimate=Old EstimateLearning Rate×TD Error.

    • 即:
      • TD Target = “我应该往哪走”
      • TD Error = “我现在离目标有多远”
      • TD Update = “往目标迈一小步”

The idea of the algorithm

  • First, a new expression of the Bellman equation

    • The definition of state value of π\piπ is

      vπ(s)=E[R+γG∣S=s],s∈S(4)v_\pi(s) = \mathbb{E}[R + \gamma G \mid S = s], \quad s \in \mathcal{S} \quad (4)vπ(s)=E[R+γGS=s],sS(4)

    • where GGG is discounted return. Since

      E[G∣S=s]=∑aπ(a∣s)∑s′p(s′∣s,a)vπ(s′)=E[vπ(S′)∣S=s],\mathbb{E}[G \mid S = s] = \sum_a \pi(a \mid s) \sum_{s'} p(s' \mid s,a) v_\pi(s') = \mathbb{E}[v_\pi(S') \mid S = s],E[GS=s]=aπ(as)sp(ss,a)vπ(s)=E[vπ(S)S=s],

    • where S′S'S is the next state, we can rewrite (4) as

      vπ(s)=E[R+γvπ(S′)∣S=s],s∈S.(5)v_\pi(s) = \mathbb{E}[R + \gamma v_\pi(S') \mid S = s], \quad s \in \mathcal{S}. \quad (5)vπ(s)=E[R+γvπ(S)S=s],sS.(5)

    • Equation (5) is another expression of the Bellman equation. It is sometimes called the Bellman expectation equation, an important tool to design and analyze TD algorithms.

    这说明 当前状态的价值 可以用 一步奖励 + 下一状态价值的期望 来表示。

  • Second, solve the Bellman equation in (5) using the RM algorithm

    • In particular, by defining

      g(v(s))=v(s)−E[R+γvπ(S′)∣s],g(v(s)) = v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s],g(v(s))=v(s)E[R+γvπ(S)s],

    • we can rewrite (5) as

      g(v(s))=0.g(v(s)) = 0.g(v(s))=0.

    • Since we can only obtain the samples rrr and s′s's of RRR and S′S'S, the noisy observation we have is

      g~(v(s))=v(s)−[r+γvπ(s′)]=(v(s)−E[R+γvπ(S′)∣s])⏟g(v(s))+(E[R+γvπ(S′)∣s]−[r+γvπ(s′)])⏟η.\tilde g(v(s)) = v(s) - [r + \gamma v_\pi(s')] = \underbrace{(v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s])}_{g(v(s))} + \underbrace{(\mathbb{E}[R + \gamma v_\pi(S') \mid s] - [r + \gamma v_\pi(s')])}_{\eta}.g~(v(s))=v(s)[r+γvπ(s)]=g(v(s))(v(s)E[R+γvπ(S)s])+η(E[R+γvπ(S)s][r+γvπ(s)]).

    用 RM 算法求解 Bellman 方程

    • Bellman 方程形式上是一个 不动点方程

      v(s)=E[R+γvπ(S′)∣s].v(s) = \mathbb{E}[R + \gamma v_\pi(S') \mid s].v(s)=E[R+γvπ(S)s].

    • 我们可以改写为 零点问题

      g(v(s))=v(s)−E[R+γvπ(S′)∣s]=0.g(v(s)) = v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s] = 0.g(v(s))=v(s)E[R+γvπ(S)s]=0.

    • 为什么要改写成 g(v(s))=0g(v(s))=0g(v(s))=0

      • 这是为了使用 随机逼近(Robbins-Monro, RM)算法
      • RM 专门用于在存在噪声的情况下求解零点问题。
    • RM 的做法

      • 我们无法直接计算 E[R+γvπ(S′)∣s]\mathbb{E}[R + \gamma v_\pi(S') \mid s]E[R+γvπ(S)s],只能通过样本 (r,s′)(r, s')(r,s) 来近似。于是我们定义:

        g~(v(s))=v(s)−[r+γvπ(s′)].\tilde g(v(s)) = v(s) - [r + \gamma v_\pi(s')].g~(v(s))=v(s)[r+γvπ(s)].

      • RM 更新式为:

        vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]).v_{k+1}(s) = v_k(s) - \alpha_k \tilde g(v_k(s)) = v_k(s) - \alpha_k \Big( v_k(s) - [r_k + \gamma v_\pi(s'_k)] \Big).vk+1(s)=vk(s)αkg~(vk(s))=vk(s)αk(vk(s)[rk+γvπ(sk)]).

      • 这一步的含义:

        • 用样本 (rk,sk′)(r_k, s'_k)(rk,sk) 构造近似的“梯度”g~\tilde gg~
        • 然后不断更新 vk(s)v_k(s)vk(s),使其逐渐逼近 Bellman 方程的解。
  • Therefore, the RM algorithm for solving g(v(s))=0g(v(s)) = 0g(v(s))=0 is

    vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]),k=1,2,3,…(6)v_{k+1}(s) = v_k(s) - \alpha_k \tilde g(v_k(s)) = v_k(s) - \alpha_k \Big( v_k(s) - [r_k + \gamma v_\pi(s'_k)] \Big), \quad k = 1,2,3,\ldots \quad (6)vk+1(s)=vk(s)αkg~(vk(s))=vk(s)αk(vk(s)[rk+γvπ(sk)]),k=1,2,3,(6)

    • where vk(s)v_k(s)vk(s) is the estimate of vπ(s)v_\pi(s)vπ(s) at the kkkth step; rk,sk′r_k, s'_krk,sk are the samples of R,S′R, S'R,S obtained at the kkkth step.
    • The RM algorithm in (6) has two assumptions that deserve special attention:
      • We must have the experience set (s,r,s′){(s, r, s')}(s,r,s) for k=1,2,3,…k=1,2,3,\ldotsk=1,2,3,.
      • We assume that vπ(s′)v_\pi(s')vπ(s) is already known for any s′s's.
    • To remove the two assumptions in the RM algorithm, we can modify it
      • One modification is that (s,r,s′){(s,r,s')}(s,r,s) is changed to (st,rt+1,st+1){(s_t, r_{t+1}, s_{t+1})}(st,rt+1,st+1) so that the algorithm can utilize the sequential samples in an episode.
      • Another modification is that vπ(s′)v_\pi(s')vπ(s) is replaced by an estimate of it because we don’t know it in advance.

    与 TD 学习的联系

    • RM 算法有两个限制:

      1. 需要知道 vπ(s′)v_\pi(s')vπ(s),但实际上我们并不知道;
      2. 需要 (s,r,s′)(s, r, s')(s,r,s) 样本,最好是整个 episode 数据。
    • 为了解决这两个问题:

      • 我们 用当前的估计 v(s′)v(s')v(s) 替换真值 vπ(s′)v_\pi(s')vπ(s)
      • 我们 用序列样本 (st,rt+1,st+1)(s_t, r_{t+1}, s_{t+1})(st,rt+1,st+1) 来进行逐步更新。
    • 于是就得到了 TD 更新公式

      vt+1(st)=vt(st)+αt[rt+1+γvt(st+1)−vt(st)].v_{t+1}(s_t) = v_t(s_t) + \alpha_t \big[ r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \big].vt+1(st)=vt(st)+αt[rt+1+γvt(st+1)vt(st)].

      • 这里:
        • TD Target: vˉt=rt+1+γvt(st+1)\bar v_t = r_{t+1} + \gamma v_t(s_{t+1})vˉt=rt+1+γvt(st+1)
        • TD Error: δt=vˉt−vt(st)\delta_t = \bar v_t - v_t(s_t)δt=vˉtvt(st)

    直观理解

    • 对于 trajectory 的每一个 sss 都用 TD 公式更新到收敛
    • “所有状态互相帮助一起慢慢收敛”

Algorithm

  • Algorithm convergence

    • By the TD algorithm (1), vt(s)v_t(s)vt(s) converges with probability 1 to vπ(s)v_\pi(s)vπ(s) for all s∈Ss \in \mathcal{S}sS as t→∞t \to \inftyt if ∑tαt(s)=∞\sum_t \alpha_t(s) = \inftytαt(s)= and ∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \inftytαt2(s)< for all s∈Ss \in \mathcal{S}sS.
    • Remarks:
      • This theorem says the state value can be found by the TD algorithm for a given a policy π\piπ.
      • ∑tαt(s)=∞\sum_t \alpha_t(s) = \inftytαt(s)= and ∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \inftytαt2(s)< must be valid for all s∈Ss \in \mathcal{S}sS. At time step ttt, if s=sts = s_ts=st which means that sss is visited at time ttt, then αt(s)>0\alpha_t(s) > 0αt(s)>0; otherwise, αt(s)=0\alpha_t(s) = 0αt(s)=0 for all the other s≠sts \ne s_ts=st. That requires every state must be visited an infinite (or sufficiently many) number of times.
      • The learning rate α\alphaα is often selected as a small constant. In this case, the condition that ∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \inftytαt2(s)< is invalid anymore. When α\alphaα is constant, it can still be shown that the algorithm converges in the sense of expectation sense.

    定理内容

    • 在合适的条件下,TD 学习能够收敛到真实的状态值函数 vπ(s)v_\pi(s)vπ(s)
    • 条件是:
      • ∑tαt(s)=∞\sum_t \alpha_t(s) = \inftytαt(s)= (学习率必须足够大,保证无穷多次更新)
      • ∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \inftytαt2(s)< (学习率不能太大,保证更新逐渐收敛而不是震荡)

    解释

    • 第一条条件保证了 TD 算法不断吸收新信息,不会过早停止学习。
    • 第二条条件保证了 TD 算法不会因为过大的更新幅度而在收敛点附近震荡。

    学习率的取值

    • 实践中,通常将 αt(s)\alpha_t(s)αt(s) 设为一个 小常数(如 0.10.10.1)。
    • 这样严格来说不满足 ∑tαt2(s)<∞\sum_t \alpha_t^2(s) < \inftytαt2(s)<,但在“期望意义”下依然能保证收敛到接近 vπ(s)v_\pi(s)vπ(s)
  • Algorithm properties

    TD/Sarsa learningMC learning
    Online: TD learning is online. It can update the state/action values immediately after receiving a reward.Offline: MC learning is offline. It has to wait until an episode has been completely collected.
    Continuing tasks: Since TD learning is online, it can handle both episodic and continuing tasks.Episodic tasks: Since MC learning is offline, it can only handle episodic tasks that has terminate states.
    Bootstrapping: TD bootstraps because the update of a value relies on the previous estimate of this value. Hence, it requires initial guesses.Non-bootstrapping: MC is not bootstrapping, because it can directly estimate state/action values without any initial guess.
    Low estimation variance: TD has lower than MC because there are fewer random variables. For instance, Sarsa requires Rt+1,St+1,At+1R_{t+1}, S_{t+1}, A_{t+1}Rt+1,St+1,At+1.High estimation variance: To estimate qπ(st,at)q_\pi(s_t, a_t)qπ(st,at), we need samples of Rt+1+γRt+2+γ2Rt+3+…R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldotsRt+1+γRt+2+γ2Rt+3+. Suppose the length of each episode is LLL. There are $

    TD (Temporal Difference) 学习和 MC (Monte Carlo) 学习都是 model-free 方法,但它们有显著的差异。

    1. 在线 vs 离线
      • TD在线更新:每一步获得奖励 rt+1r_{t+1}rt+1 后,就能立刻更新 v(st)v(s_t)v(st)
      • MC离线更新:必须等整个 episode 结束,才能计算回报并更新。
      • 含义:TD 更适合 实时学习场景,MC 适合 完整轨迹收集 的情况。
    2. 任务类型
      • TD:既能处理 episodic(有终止状态),也能处理 continuing tasks(无终止状态)。
      • MC:只能处理 episodic 任务,因为需要完整回报作为更新目标。
      • 含义:TD 更灵活,适合长期运行的系统(如智能体在无限时间的环境中学习)。
    3. Bootstrapping
      • TD 使用 bootstrapping:即更新时依赖当前的估计值(如 v(st+1)v(s_{t+1})v(st+1))。
      • MC非 bootstrapping:直接使用完整回报 GtG_tGt 更新。
      • 含义:TD 更快,因为它不需要等完整 episode,但它依赖初始值。
    4. 估计方差
      • TD:低方差,因为更新只依赖一个即时奖励和下一个状态估计。
      • MC:高方差,因为回报包含很多随机变量,导致估计更不稳定(无偏估计)。
      • 含义:TD 学习通常比 MC 收敛更快、更稳定,但也可能引入 偏差 (bias),因为 bootstrapping 使用了近似的估计。

Sarsa

Base Sarsa

Sarsa algorithm

  • First, our aim is to estimate the action values of a given policy π\piπ.

  • Suppose we have some experience (st,at,rt+1,st+1,at+1)t{(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})}_t(st,at,rt+1,st+1,at+1)t.

  • We can use the following Sarsa algorithm to estimate the action values:

    qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γqt(st+1,at+1)]],q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[q_t(s_t, a_t) - [r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})] \Big],qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γqt(st+1,at+1)]],

    qt+1(s,a)=qt(s,a),∀(s,a)≠(st,at),q_{t+1}(s,a) = q_t(s,a), \quad \forall (s,a) \neq (s_t, a_t),qt+1(s,a)=qt(s,a),(s,a)=(st,at),

    • where t=0,1,2,…t = 0,1,2,\ldotst=0,1,2,
      • qt(st,at)q_t(s_t,a_t)qt(st,at) is an estimate of qπ(st,at)q_\pi(s_t, a_t)qπ(st,at)
      • αt(st,at)\alpha_t(s_t,a_t)αt(st,at) is the learning rate depending on (st,at)(s_t,a_t)(st,at)

    Sarsa 更新公式

    • 当前的估计值:qt(st,at)q_t(s_t,a_t)qt(st,at)
    • TD Target:rt+1+γqt(st+1,at+1)r_{t+1} + \gamma q_t(s_{t+1},a_{t+1})rt+1+γqt(st+1,at+1)
      • rt+1r_{t+1}rt+1:在执行动作 ata_tat 后得到的即时奖励。
      • γqt(st+1,at+1)\gamma q_t(s_{t+1}, a_{t+1})γqt(st+1,at+1):未来从 (st+1,at+1)(s_{t+1}, a_{t+1})(st+1,at+1) 开始继续的长期回报估计。
    • TD Error:qt(st,at)−TD Targetq_t(s_t,a_t) - \text{TD Target}qt(st,at)TD Target

    每次更新,Sarsa 都会把 q(st,at)q(s_t,a_t)q(st,at) 往“下一步回报 + 下一步动作的估计值”这个目标拉近一些

    直观解释

    • 我们对 (st,at)(s_t,a_t)(st,at) 的“好坏”有个旧估计 qt(st,at)q_t(s_t,a_t)qt(st,at)
    • 但现在我们看到了一步真实的奖励 rt+1r_{t+1}rt+1,以及下一步 (st+1,at+1)(s_{t+1},a_{t+1})(st+1,at+1) 的估计值 qt(st+1,at+1)q_t(s_{t+1},a_{t+1})qt(st+1,at+1)
    • 这就提供了一个新目标(TD Target)。
    • Sarsa 更新时,不会直接替换,而是慢慢地把 q(st,at)q(s_t,a_t)q(st,at) 往这个目标拉近
  • Relationship between Sarsa and TD

    • Replace v(s)v(s)v(s) in TD algorithm with q(s,a)q(s,a)q(s,a) → we obtain Sarsa.
    • Sarsa is the action-value version of TD learning.
  • Mathematical Expression

    • The Sarsa algorithm solves:

      qπ(s,a)=E[R+γqπ(S’,A’)∣s,a],∀s,a.q_\pi(s,a) = \mathbb{E}[R + \gamma q_\pi(S’, A’) \mid s,a], \quad \forall s,a.qπ(s,a)=E[R+γqπ(S,A)s,a],s,a.

    • This is another expression of the Bellman equation expressed in terms of action values.

Theorem (Convergence of Sarsa learning)

By the Sarsa algorithm, qt(s,a)q_t(s,a)qt(s,a) converges with probability 111 to the action value qπ(s,a)q_\pi(s,a)qπ(s,a) as t→∞t \to \inftyt, for all (s,a)(s,a)(s,a), if ∑tαt(s,a)=∞\sum_t \alpha_t(s,a) = \inftytαt(s,a)= and ∑tαt2(s,a)<∞\sum_t \alpha_t^2(s,a) < \inftytαt2(s,a)<.

  • Remarks:
    • This theorem says the action value can be found by Sarsa for a given policy π\piπ.

Pseudocode: Policy searching by Sarsa

  • For each episode, do
    • If current s_t is not the target state, do
      • Collect experience (st,at,rt+1,st+1,at+1)(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})(st,at,rt+1,st+1,at+1):
        • Take action at∼πt(st)a_t \sim \pi_t(s_t)atπt(st)

        • Generate rt+1,st+1r_{t+1}, s_{t+1}rt+1,st+1

        • Take action at+1∼πt(st+1)a_{t+1} \sim π_t(s_{t+1})at+1πt(st+1)

        • Update q-value:

          qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γqt(st+1,at+1))]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[ q_t(s_t, a_t) - \big( r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}) \big) \Big]qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γqt(st+1,at+1))]

        • Update policy:

          πt+1(a∣st)=1−ϵ∣A∣(∣A∣−1),if a=arg⁡max⁡aqt+1(st,a)\pi_{t+1}(a \mid s_t) = 1 - \frac{\epsilon}{|\mathcal{A}|} (|\mathcal{A}| - 1), \quad \text{if } a = \arg \max_a q_{t+1}(s_t, a)πt+1(ast)=1Aϵ(A1),if a=argmaxaqt+1(st,a)

          πt+1(a∣st)=ϵ∣A∣,otherwise\pi_{t+1}(a \mid s_t) = \frac{\epsilon}{|\mathcal{A}|}, \quad \text{otherwise}πt+1(ast)=Aϵ,otherwise

Sarsa 的伪代码可以概括为三步:

  1. 采样交互(st,at,rt+1,st+1,at+1)(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})(st,at,rt+1,st+1,at+1)
  2. 更新 Q 值:往“即时奖励 + 下一步 Q”方向拉近
  3. 更新策略:根据最新 Q 值,调整动作选择概率

这样样就实现了 “一边学 Q 值,一边改策略” 的在线强化学习过程。

Remarks about Sarsa

  • The policy of sts_tst is updated immediately after q(st,at)q(s_t,a_t)q(st,at) is updated → based on Generalized Policy Iteration (GPI).
  • The policy is ϵ\epsilonϵ-greedy instead of greedy → balances exploitation and exploration.

Core Idea vs Complication

  • Core idea: use an algorithm to solve the Bellman equation of a given policy.
  • Complication: emerges when we try to find optimal policies and work efficiently.

Expected Sarsa

Algorithm

qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γE[qt(st+1,A)])],q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[q_t(s_t, a_t) - \big(r_{t+1} + \gamma \mathbb{E}[q_t(s_{t+1}, A)]\big)\Big],qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γE[qt(st+1,A)])],

qt+1(s,a)=qt(s,a),∀(s,a)≠(st,at),q_{t+1}(s, a) = q_t(s, a), \quad \forall (s,a) \neq (s_t,a_t),qt+1(s,a)=qt(s,a),(s,a)=(st,at),

  • where

    E[qt(st+1,A))]=∑aπt(a∣st+1)qt(st+1,a)≐vt(st+1)\mathbb{E}[q_t(s_{t+1}, A))] = \sum_a \pi_t(a \mid s_{t+1}) q_t(s_{t+1}, a) \doteq v_t(s_{t+1})E[qt(st+1,A))]=aπt(ast+1)qt(st+1,a)vt(st+1)

  • is the expected value of qt(st+1,a)q_t(s_{t+1}, a)qt(st+1,a) under policy πt\pi_tπt.

Compared to Sarsa:

  • The TD Target is changed from rt+1+γqt(st+1,at+1)r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})rt+1+γqt(st+1,at+1) (as in Sarsa) to rt+1+γE[qt(st+1,A)]r_{t+1} + \gamma \mathbb{E}[q_t(s_{t+1}, A)]rt+1+γE[qt(st+1,A)] (as in Expected Sarsa).
  • Need more computation. But it is beneficial in the sense that it reduces the estimation variances because it reduces random variables in Sarsa from st,at,rt+1,st+1,at+1{s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}}st,at,rt+1,st+1,at+1 to st,at,rt+1,st+1{s_t, a_t, r_{t+1}, s_{t+1}}st,at,rt+1,st+1.

What does the algorithm do mathematically?

  • Expected Sarsa is a stochastic approximation algorithm for solving the following equation:

    qπ(s,a)=E[Rt+1+γEAt+1∼π(St+1)[qπ(St+1,At+1)],∣,St=s,At=a],∀s,a.q_\pi(s,a) = \mathbb{E}\Big[ R_{t+1} + \gamma \mathbb{E}{A{t+1} \sim \pi(S_{t+1})}[q_\pi(S_{t+1}, A_{t+1})] ,\Big|, S_t=s, A_t=a \Big], \quad \forall s,a.qπ(s,a)=E[Rt+1+γEAt+1π(St+1)[qπ(St+1,At+1)],,St=s,At=a],s,a.

  • The above equation is another expression of the Bellman equation:

    qπ(s,a)=E[Rt+1+γvπ(St+1)∣St=s,At=a].q_\pi(s,a) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s, A_t=a].qπ(s,a)=E[Rt+1+γvπ(St+1)St=s,At=a].

nnn-step Sarsa

Introduction

  • The definition of action value is

    qπ(s,a)=E[Gt∣St=s,At=a].q_\pi(s,a) = \mathbb{E}[G_t \mid S_t = s, A_t = a].qπ(s,a)=E[GtSt=s,At=a].

  • The discounted return GtG_tGt can be written in different forms as

    • Sarsa

      Gt(1)=Rt+1+γqπ(St+1,At+1),G_t^{(1)} = R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}),Gt(1)=Rt+1+γqπ(St+1,At+1),

      Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2),G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, A_{t+2}),Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2),

      ⋮\vdots

    • n-step Sarsa

      Gt(n)=Rt+1+γRt+2+⋯+γnqπ(St+n,At+n),G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n}),Gt(n)=Rt+1+γRt+2++γnqπ(St+n,At+n),

      ⋮\vdots

    • MC

      Gt(∞)=Rt+1+γRt+2+γ2Rt+3+⋯G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdotsGt()=Rt+1+γRt+2+γ2Rt+3+

  • It should be noted that

    Gt=Gt(1)=Gt(2)=Gt(n)=Gt(∞),G_t = G_t^{(1)} = G_t^{(2)} = G_t^{(n)} = G_t^{(\infty)},Gt=Gt(1)=Gt(2)=Gt(n)=Gt(),

  • where the superscripts merely indicate the different decomposition structures of GtG_tGt.

Algorithm analysis

  • Sarsa aims to solve

    qπ(s,a)=E[Gt(1)∣s,a]=E[Rt+1+γqπ(St+1,At+1)∣s,a].q_\pi(s,a) = \mathbb{E}[G_t^{(1)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid s,a].qπ(s,a)=E[Gt(1)s,a]=E[Rt+1+γqπ(St+1,At+1)s,a].

  • MC learning aims to solve

    qπ(s,a)=E[Gt(∞)∣s,a]=E[Rt+1+γRt+2+γ2Rt+3+⋯∣s,a].q_\pi(s,a) = \mathbb{E}[G_t^{(\infty)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid s,a].qπ(s,a)=E[Gt()s,a]=E[Rt+1+γRt+2+γ2Rt+3+s,a].

  • An intermediate algorithm called n-step Sarsa aims to solve

    qπ(s,a)=E[Gt(n)∣s,a]=E[Rt+1+γRt+2+⋯+γnqπ(St+n,At+n)∣s,a].q_\pi(s,a) = \mathbb{E}[G_t^{(n)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n}) \mid s,a].qπ(s,a)=E[Gt(n)s,a]=E[Rt+1+γRt+2++γnqπ(St+n,At+n)s,a].

  • The algorithm of n-step Sarsa is

    qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γrt+2+⋯+γnqt(st+n,at+n))].q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t)\Big[q_t(s_t, a_t) - \big(r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^n q_t(s_{t+n}, a_{t+n})\big)\Big].qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γrt+2++γnqt(st+n,at+n))].

  • nnn-step Sarsa is more general because it becomes the (one-step) Sarsa algorithm when n=1n=1n=1 and the MC learning algorithm when n=∞n=\inftyn=.

Properties

  • nnn-step Sarsa needs

    (st,at,rt+1,st+1,at+1,…,rt+n,st+n,at+n).(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, \ldots, r_{t+n}, s_{t+n}, a_{t+n}).(st,at,rt+1,st+1,at+1,,rt+n,st+n,at+n).

  • Since (rt+n,st+n,at+n)(r_{t+n}, s_{t+n}, a_{t+n})(rt+n,st+n,at+n) has not been collected at time ttt, we are not able to implement n-step Sarsa at step ttt. However, we can wait until time t+nt+nt+n to update the q-value of (st,at)(s_t,a_t)(st,at):

    qt+n(st,at)=qt+n−1(st,at)−αt+n−1(st,at)[qt+n−1(st,at)−(rt+1+γrt+2+⋯+γnqt+n−1(st+n,at+n))].q_{t+n}(s_t, a_t) = q_{t+n-1}(s_t, a_t) - \alpha_{t+n-1}(s_t,a_t)\Big[q_{t+n-1}(s_t,a_t) - \big(r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^n q_{t+n-1}(s_{t+n}, a_{t+n})\big)\Big].qt+n(st,at)=qt+n1(st,at)αt+n1(st,at)[qt+n1(st,at)(rt+1+γrt+2++γnqt+n1(st+n,at+n))].

  • Since nnn-step Sarsa includes Sarsa and MC learning as two extreme cases, its performance is a blend of Sarsa and MC learning:

    • If nnn is large, its performance is close to MC learning and hence has a large variance but a small bias.
    • If nnn is small, its performance is close to Sarsa and hence has a relatively large bias due to the initial guess and relatively low variance.
  • Finally, nnn-step Sarsa is also for policy evaluation. It can be combined with the policy improvement step to search for optimal policies.


总结

Sarsa 是 on-policy,学习并评估当前执行的策略;Q-learning 是 off-policy,利用任意行为策略采样却始终朝最优策略收敛;TD 方法则是统一的核心框架,通过 TD Target 和 TD Error 将价值估计逐步逼近 Bellman 方程的解。


文章转载自:

http://MJbSs6BV.tnwwL.cn
http://aIdvLHwl.tnwwL.cn
http://Ps7zLrtp.tnwwL.cn
http://reccIRqN.tnwwL.cn
http://wPSiETcg.tnwwL.cn
http://CvdcQKgI.tnwwL.cn
http://I3NnHqjY.tnwwL.cn
http://seU2B6Fm.tnwwL.cn
http://O30V51WW.tnwwL.cn
http://gfoyShcL.tnwwL.cn
http://rjP91nG1.tnwwL.cn
http://5hCObZTr.tnwwL.cn
http://bfoYxOS3.tnwwL.cn
http://7NBb7YeR.tnwwL.cn
http://mggAGO2r.tnwwL.cn
http://8eWuY5NC.tnwwL.cn
http://29rT8q7s.tnwwL.cn
http://qlPlZJTq.tnwwL.cn
http://WMwtRi7S.tnwwL.cn
http://y2jFzLNt.tnwwL.cn
http://TBbWDz8I.tnwwL.cn
http://Bx4ND5UK.tnwwL.cn
http://Bsur7FI9.tnwwL.cn
http://DGd3TZnY.tnwwL.cn
http://VvtBfP55.tnwwL.cn
http://DnMa8s9m.tnwwL.cn
http://1QhVt74y.tnwwL.cn
http://N26IfJMn.tnwwL.cn
http://sHvM4sxG.tnwwL.cn
http://bvdzLErr.tnwwL.cn
http://www.dtcms.com/a/376807.html

相关文章:

  • child_process 和 cluster的区别
  • 第十七篇|优尼塔斯东京校区的教育数据工程:学费函数、国籍网络与升学有向图
  • ES6 面试题及详细答案 80题 (33-40)-- Symbol与集合数据结构
  • DeepResearch(上)
  • 即时通讯小程序
  • Firefox Window 开发详解(二)
  • Chrome性能黑魔法:深入浅出PGO优化与实战指南
  • 【算法专题训练】20、LRU 缓存
  • 66. 加一 (编程基础0到1)(Leetcode)
  • 多任务相关概念
  • ubuntu 18.04 泰山派编译报错
  • 解决apk包体大于2G无法对齐和签名的问题
  • 运筹学——运输问题之表上作业法,西北角法,最小元素法
  • python版本管理和依赖管理的最佳实践,pyenv + uv
  • iPhon 17 推出
  • MySQL的常用命令
  • KEDA/HPA/VPA 三件套:ABP 后台作业的事件驱动伸缩
  • 金融中的异常收益率
  • 模型部署:(三)安卓端部署Yolov8-v6.0目标检测项目全流程记录
  • 阅读|史蒂芬·普拉达《C Primer Plus(第6版)》:数据和C
  • 回归预测 | MATLAB基于GRU-Attention的多输入单输出回归预测
  • UniApp 分包异步化配置及组件引用解决方案
  • Postman环境变量全局变量设置
  • C语⾔内存函数
  • go资深之路笔记(一) Context
  • 数学建模资源合集
  • STM32项目分享:基于STM32智能吸尘器系统的设计与实现
  • 计算机毕设 java 高校会议室预约管理系统 基于 SSM 框架的高校会议室管理平台 Java+MySQL 的预约全流程管控系统
  • vue-pdf 实现blob数据的预览
  • RiskBird企业信息模糊查询工具