当前位置：首页 > news >正文

RL【7-2】：Temporal-difference Learning

news 2025/9/11 5:41:07

系列文章目录

Fundamental Tools

RL【1】：Basic Concepts
RL【2】：Bellman Equation
RL【3】：Bellman Optimality Equation

Algorithm

RL【4】：Value Iteration and Policy Iteration
RL【5】：Monte Carlo Learning
RL【6】：Stochastic Approximation and Stochastic Gradient Descent

Method

RL【7-1】：Temporal-difference Learning
RL【7-2】：Temporal-difference Learning

文章目录

系列文章目录
- Fundamental Tools
- Algorithm
- Method
前言
Q-learning
Oﬀ-policy vs On-policy
总结

前言

本系列文章主要用于记录 B站赵世钰老师的【强化学习的数学原理】的学习笔记，关于赵老师课程的具体内容，可以移步：
B站视频：【【强化学习的数学原理】课程：从零开始到透彻理解（完结）】
GitHub 课程资料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Q-learning

Algorithm

The Q-learning algorithm is

$qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γmax⁡a∈Aqt(st+1,a)]],q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[ q_t(s_t, a_t) - \big[ r_{t+1} + \gamma \max_{a \in \mathcal{A}} q_t(s_{t+1}, a) \big] \Big],$

$qt+1(s,a)=qt(s,a),∀(s,a)≠(st,at),q_{t+1}(s, a) = q_t(s, a), \quad \forall (s, a) \neq (s_t, a_t),$
Q-learning is very similar to Sarsa. They are different only in terms of the TD Target:
- The TD Target in Q-learning is $rt+1+γmax⁡a∈Aqt(st+1,a)r_{t+1} + \gamma \max_{a \in \mathcal{A}} q_t(s_{t+1}, a)$
- The TD Target in Sarsa is $rt+1+γqt(st+1,at+1)r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})$

What does Q-learning do mathematically?

It aims to solve

$\mathbb{E} \Big[ R_{t+1} + \gamma \max_a q(S_{t+1}, a) \;\Big|\; S_t = s, A_t = a \Big], \quad \forall s, a.$
This is the Bellman optimality equation expressed in terms of action values.

Q-learning 进一步解释

算法更新规则回顾

Q-learning 的更新公式可以改写为更直观的形式：

$qt+1(st,at)=(1−αt)qt(st,at)+αt[rt+1+γmax⁡aqt(st+1,a)].q_{t+1}(s_t, a_t) = (1 - \alpha_t) q_t(s_t, a_t) + \alpha_t \Big[ r_{t+1} + \gamma \max_{a} q_t(s_{t+1}, a) \Big].$

可以看到：新的 Q 值等于旧的估计和 TD Target 的加权平均。

这里的 TD Target 是：

$rt+1+γmax⁡aqt(st+1,a).r_{t+1} + \gamma \max_{a} q_t(s_{t+1}, a).$

也就是说，Q-learning 每次更新都会把当前的估计 $q_t(s_t, a_t)$ 向“立即奖励 + 下一状态最优动作的估计”这个目标拉近一些。

和 Sarsa 的区别

在 Sarsa 中，TD Target 是

$rt+1+γqt(st+1,at+1),r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}),$

即使用实际采取的下一个动作 $a_{t+1}$ 来更新。

在 Q-learning 中，TD Target 是

$rt+1+γmax⁡aqt(st+1,a),r_{t+1} + \gamma \max_a q_t(s_{t+1}, a),$

即使用下一个状态下的最优动作估计来更新，而不依赖策略采样到的动作。

因此：

Sarsa 是 on-policy（策略依赖型）：更新时参考的是当前策略采样到的动作。
Q-learning 是 off-policy（策略无关型）：更新时总是朝着“最优动作”去估计，而不管实际采取了什么动作。

数学意义

Q-learning 的目标是求解：

$\mathbb{E}\Big[ R_{t+1} + \gamma \max_{a’} q(S_{t+1}, a’) \;\big|\; S_t = s, A_t = a \Big].$

这正是 Bellman 最优方程（Bellman optimality equation）。

因此，Q-learning 从本质上是一个 逼近 Bellman 最优方程解的随机逼近算法。

收敛意义

只要满足一定条件（比如学习率 $αt\alpha_t$ 的选取、每个状态动作对被无限次访问等），Q-learning 能够 收敛到最优动作值函数 $q^*(s,a)$ 。

一旦我们得到了 $q^(s,a)$ ，就可以直接导出最优策略：

$π(s)=arg⁡max⁡aq∗(s,a).\pi^(s) = \arg \max_a q^*(s, a).$

总结

Sarsa 更新朝着“当前策略的下一步动作”靠拢，所以是 on-policy。
Q-learning 更新朝着“假设最优动作”靠拢，所以是 off-policy，能直接学到最优策略。
数学上，Q-learning 是在逼近 Bellman optimality equation。

Pseudocode

Policy searching by Q-learning (on-policy version)
- For each episode, do
  - If the current s_t is not the target state, do
    - Collect the experience $s_t, a_t, r_{t+1}, s_{t+1})$ : In particular, take action $a_t$ following $πt(st)\pi_t(s_t)$ , generate $r_{t+1}, s_{t+1}$ .
    - Update q-value:
      
      $qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γmaxaqt(st+1,a))]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - α_t(s_t, a_t) [ q_t(s_t, a_t) - ( r_{t+1} + \gamma max_a q_t(s_{t+1}, a) ) ]$
    - Update policy:
      
      $πt+1(a∣st)=1−(ϵ/∣A∣)(∣A∣−1)ifa=arg⁡max⁡aqt+1(st,a)\pi_{t+1}(a | s_t) = 1 - (\epsilon / |A|)(|A| - 1) \quad if \quad a = \arg\max_a q_{t+1}(s_t, a)$
      
      $πt+1(a∣st)=ϵ/∣A∣otherwise\pi_{t+1}(a | s_t) = \epsilon / |A| \quad otherwise$
Optimal policy search by Q-learning (oﬀ-policy version)
- For each episode ${s_0, a_0, r_1, s_1, a_1, r_2, ... \}$ generated by $πb\pi_b$ , do
  - For each step $\dots$ of the episode, do
  - Update q-value:
    
    $qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γmaxaqt(st+1,a))]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - α_t(s_t, a_t) [ q_t(s_t, a_t) - ( r_{t+1} + \gamma max_a q_t(s_{t+1}, a) ) ]$
  - Update target policy:
    
    $πT,t+1(a∣st)=1ifa=arg⁡max⁡aqt+1(st,a)\pi_{T, t+1}(a | s_t) = 1 \quad if \quad a = \arg\max_a q_{t+1}(s_t, a)$
    
    $πT,t+1(a∣st)=0otherwise\pi_{T, t+1}(a | s_t) = 0 \quad otherwise$

Oﬀ-policy vs On-policy

on-policy learning and off-policy learning.

There exist two policies in a TD learning task:
- The behavior policy is used to generate experience samples.
- The target policy is constantly updated toward an optimal policy.
On-policy vs off-policy:
- When the behavior policy is the same as the target policy, such kind of learning is called on-policy.
- When they are different, the learning is called off-policy.

重要概念

Behavior Policy（行为策略）
定义：行为策略是智能体在环境中 实际采取动作 的策略。
作用：它负责生成经验样本，即 $s_t, a_t, r_{t+1}, s_{t+1})$ 。
举例：
在 $ϵ\epsilon$ -greedy 策略 中，智能体大多数时候选择当前估计值最大的动作，但偶尔随机探索其他动作。这就是一个典型的行为策略。

Target Policy（目标策略）
定义：目标策略是智能体 希望学习和逼近 的策略，最终指向最优策略。
作用：它用于 更新价值函数 或 Q 函数。
举例：
在 Sarsa 中，目标策略就是当前 $ϵ\epsilon$ -greedy 策略，因为它学习的是 它自己在执行的策略的价值。
在 Q-learning 中，目标策略是 贪婪策略（即总是选择 $max⁡q(s,a)\max q(s,a)$ 的动作），即使行为策略还在探索。

On-policy Learning（同策略学习）
定义：当 行为策略 = 目标策略 时，学习就是 on-policy。
特点：智能体用它当前执行的策略既生成样本，又更新该策略的价值估计。
代表算法：Sarsa。
优缺点：
优点：学习到的策略与实际执行的策略一致，稳定。
缺点：探索与利用之间的折中可能导致收敛较慢。

Off-policy Learning（异策略学习）
定义：当 行为策略 ≠ 目标策略 时，学习就是 off-policy。
特点：智能体可以用一种行为策略生成数据，但学习另外一种更优的策略。
代表算法：Q-learning。
优缺点：
优点：可以在使用探索性强的行为策略（如随机策略）时，学习到最优的目标策略。
缺点：实现上更复杂，容易出现高方差。

Exploratory Policy（探索性策略）
定义：一种保证智能体能 覆盖所有状态-动作对 的策略。
必要性：如果智能体总是贪婪地选择当前估计最优的动作，可能陷入局部最优。
典型例子： $ϵ\epsilon$ -greedy 策略。

Advantages of off-policy learning

It can search for optimal policies based on the experience samples generated by any other policies.
As an important special case, the behavior policy can be selected to be exploratory.
- For example, if we would like to estimate the action values of all state–action pairs, we can use an exploratory policy to generate episodes visiting every state–action pair sufficiently many times.

How to judge if a TD algorithm is on-policy or off-policy?

Method
- First, check what the algorithm does mathematically.
- Second, check what things are required to implement the algorithm.
有两个核心方法：
1. 看数学形式：目标更新时，使用的动作是否来自行为策略？
  - 如果 TD target 使用的是行为策略采样的动作（如 Sarsa），那么就是 on-policy。
  - 如果 TD target 使用的是对动作空间的最大化操作（如 Q-learning），那么就是 off-policy。
2. 看实现方式：算法是否需要额外的经验样本生成策略？
Sarsa is on-policy.
- First, Sarsa aims to solve the Bellman equation of a given policy $π\pi$ :
  
  $qπ(s,a)=E[R+γqπ(S’,A’)∣s,a],∀s,a.q_\pi(s,a) = \mathbb{E}[R + \gamma q_\pi(S’, A’) \mid s,a], \quad \forall s,a.$
  - where $\sim p(R \mid s,a), ; S’ \sim p(S’ \mid s,a), ; A’ \sim \pi(A’ \mid S’)$ .
  这里 $qπ(s,a)q_\pi(s,a)$ 是在当前策略 $π\pi$ 下的动作价值函数。
  
  下一步动作 $A ’$ 也是按照 当前策略 $π\pi$ 来选的。
- Second, the algorithm is
  
  $qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γqt(st+1,at+1)]],q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[ q_t(s_t, a_t) - \big[ r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}) \big] \Big],$
  - which requires $s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$ :
    - If $s_t,a_t)$ is given, then $r_{t+1}$ and $s_{t+1}$ do not depend on any policy!
    - $a_{t+1}$ is generated following $πt(st+1)\pi_t(s_{t+1})$ .
    - $πt\pi_t$ is both the target and behavior policy.
  也就是说，Sarsa 用 实际执行的动作 $a_{t+1}$ 来更新 $q(s_t,a_t)$ 。
Sarsa 是 on-policy，因为它在更新的时候使用的 下一步动作 $a_{t+1}$ 就是按照当前策略 $π\pi$ 采样得到的。

→ 学习和行为是一致的。
Monte Carlo learning is on-policy.
- First, the MC method aims to solve
  
  $qπ(s,a)=E[Rt+1+γRt+2+⋯∣St=s,At=a],∀s,a.q_\pi(s,a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \cdots \mid S_t=s, A_t=a], \quad \forall s,a.$
  - where the sample is generated following a given policy $π\pi$ .
  MC 方法不需要一步一步 TD 更新，而是等到 整个 episode 结束，用完整的回报序列更新。
- Second, the implementation of the MC method is
  
  $\approx r_{t+1} + \gamma r_{t+2} + \cdots$
- A policy is used to generate samples, which is further used to estimate the action values of the policy. Based on the action values, we can improve the policy.
MC 同样是 on-policy，因为它必须按照给定的策略 $π\pi$ 来生成完整的采样轨迹。
Q-learning is off-policy.
- First, Q-learning aims to solve the Bellman optimality equation
  
  $\mathbb{E} \Big[ R_{t+1} + \gamma \max_a q(S_{t+1}, a) ,\Big|, S_t=s, A_t=a \Big], \quad \forall s,a.$
- Second, the algorithm is
  
  $qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γmax⁡a∈Aqt(st+1,a)]].q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[ q_t(s_t, a_t) - \big[ r_{t+1} + \gamma \max_{a \in \mathcal{A}} q_t(s_{t+1}, a) \big] \Big].$
  - which requires $s_t, a_t, r_{t+1}, s_{t+1})$ :
    - If $s_t,a_t)$ is given, then $r_{t+1}$ and $s_{t+1}$ do not depend on any policy!
    - The behavior policy to generate $a_t$ from $s_t$ can be anything. The target policy will converge to the optimal policy.
Q-learning 是 off-policy，因为它在更新的时候使用的是“假设自己永远选最优动作”的目标，而不依赖于实际执行的策略。

也就是说，即使行为策略是探索性的（比如 $ϵ\epsilon$ -greedy），更新时依旧朝着 最优策略 的方向收敛。