当前位置：首页 > news >正文

RL【5】：Monte Carlo Learning

news 2025/9/10 11:11:25

系列文章目录

Fundamental Tools

RL【1】：Basic Concepts
RL【2】：Bellman Equation
RL【3】：Bellman Optimality Equation

Algorithm

RL【4】：Value Iteration and Policy Iteration
RL【5】：Monte Carlo Learning
RL【6】：Stochastic Approximation and Stochastic Gradient Descent

Method

文章目录

系列文章目录
- Fundamental Tools
- Algorithm
- Method
前言
Monte Carlo estimation
MC Basic
- Convert policy iteration to be model-free
- The MC Basic algorithm
MC Exploring Starts
MC $ε\varepsilon$ -Greedy
- $ε\varepsilon$ -Greedy Policy
- MC $ε\varepsilon$ -Greedy Algorithm
总结

前言

本系列文章主要用于记录 B站赵世钰老师的【强化学习的数学原理】的学习笔记，关于赵老师课程的具体内容，可以移步：
B站视频：【【强化学习的数学原理】课程：从零开始到透彻理解（完结）】
GitHub 课程资料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Monte Carlo estimation

Law of Large Numbers

For a random variable $X$ . Suppose ${x_j\}{j=1}^N$ are some i.i.d. samples. Let $xˉ=1N∑j=1Nxj\bar{x} = \frac{1}{N} \sum{j=1}^N x_j$ be the average of the samples. Then,

$E[xˉ]=E[X],\mathbb{E}[\bar{x}] = \mathbb{E}[X],$

$Var[xˉ]=1NVar[X].\mathrm{Var}[\bar{x}] = \frac{1}{N}\mathrm{Var}[X].$
As a result, $xˉ\bar{x}$ is an unbiased estimate of $E[X]\mathbb{E}[X]$ and its variance decreases to zero as $N$ increases to infinity.

MC Basic

Convert policy iteration to be model-free

Policy iteration

Policy iteration has two steps in each iteration:
- Policy iteration has two steps in each iteration: $vπk=rπk+γPπkvπkv_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}$
- Policy improvement: $πk+1=arg⁡max⁡π(rπ+γPπvπk)\pi_{k+1} = \arg\max_\pi \left( r_\pi + \gamma P_\pi v_{\pi_k} \right)$
Policy Evaluation
- 给定策略 $πk\pi_k$ ，求它的 状态价值函数 $vπkv_{\pi_k}$ 。
- 这一步依赖于 转移概率矩阵 $P (s^{'} ∣ s, a)$ 和奖励函数 $r (s, a)$ 。
Policy Improvement
- 给定 $vπkv_{\pi_k}$ ，找到更优的策略 $πk+1\pi_{k+1}$ 。
- 实际上是对每个状态 $s$ ，选择能让 动作价值函数 (action-value) 最大的动作。
关键问题：这里都用到了 模型 (model)，即环境的 转移概率分布和奖励分布。
The elementwise form of the policy improvement step is:

$πk+1(s)=arg⁡max⁡π∑aπ(a∣s)[∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vπk(s′)]\pi_{k+1}(s) = \arg\max_\pi \sum_a \pi(a \mid s) \left[ \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v_{\pi_k}(s') \right]$

$\arg\max_\pi \sum_a \pi(a \mid s) q_{\pi_k}(s,a), \quad s \in \mathcal{S}.$
- The key is $qπk(s,a)q_{\pi_k}(s,a)$ !

在强化学习 (RL) 里，model 指的是环境的动态规则：

状态转移模型： $p (s^{'} ∣ s, a)$ ，执行动作 $a$ 后从状态 $s$ 转移到 $s^{'}$ 的概率。
奖励模型： $p (r ∣ s, a)$ ，在状态 $s$ 执行动作 $a$ 后获得奖励 $r$ 的分布。

换句话说，如果我们知道 model，就能完全模拟环境 —— 可以预测下一步的状态和奖励。

注意：这里的“不知道模型”，不是说“不知道动作 (action)”或者“不知道奖励 (reward)”本身，而是说 不知道它们的概率分布。

例如：执行动作 $a$ 后，奖励是多少，能直接观察；
但是 在不同状态下执行同样的动作，可能得到不同的奖励或到达不同的状态，这种分布是未知的。

Two expressions of action value:
- Expression 1 requires the model:
  
  $qπk(s,a)=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vπk(s′)q_{\pi_k}(s,a) = \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v_{\pi_k}(s')$
- Expression 2 does not require the model:
  
  $qπk(s,a)=E[Gt∣St=s,At=a]q_{\pi_k}(s,a) = \mathbb{E}[G_t \mid S_t = s, A_t = a]$
- Idea to achieve model-free RL:
  - We can use expression 2 to calculate $qπk(s,a)q_{\pi_k}(s,a)$ based on \textit{data (samples or experiences)
在策略改进中，我们需要 $qπk(s,a)q_{\pi_k}(s,a)$ 。有两种方式：
1. 需要模型的表达式：
  - 必须知道 $p (r ∣ s, a)$ 和 $p (s^{'} ∣ s, a)$ ，即环境的 model。
  - 如果没有模型，就无法直接计算。
2. 不需要模型的表达式：
  - $G_t$ 是从状态 $s$ 执行动作 $a$ 后，按照策略 $πk\pi_k$ 走下去的 回报 (return)。
  - 不需要显式知道转移概率矩阵，只需要从真实环境中采样出轨迹。

The procedure of Monte Carlo estimation of action values:

Starting from $(s, a)$ , following policy $πk\pi_k$ , generate an episode.
The return of this episode is $g (s, a)$ .
$g (s, a)$ is a sample of $G_t$ in

$qπk(s,a)=E[Gt∣St=s,At=a]q_{\pi_k}(s,a) = \mathbb{E}[G_t \mid S_t = s, A_t = a]$
Suppose we have a set of episodes and hence ${ g^{(j)}(s,a) \}$ . Then,

$qπk(s,a)=E[Gt∣St=s,At=a]≈1N∑i=1Ng(i)(s,a).q_{\pi_k}(s,a) = \mathbb{E}[G_t \mid S_t = s, A_t = a] \;\approx\; \frac{1}{N} \sum_{i=1}^N g^{(i)}(s,a).$
Fundamental idea: When model is unavailable, we can use data.

Monte Carlo 的核心思想：用数据替代模型。

从状态-动作对 $(s, a)$ 出发，按照策略 $πk\pi_k$ 生成一个完整的轨迹（episode）。
记录这条轨迹的回报 $g (s, a)$ ，它是 $G_t$ 的一个样本。
多次重复采样，得到一组样本 ${g^{(j)}(s,a)}$ 。
用样本平均来近似期望
结果：即使不知道 $p (s^{'} ∣ s, a)$ 和 $p (r ∣ s, a)$ ，我们也能通过 Monte Carlo 方法来估计 $qπk(s,a)q_{\pi_k}(s,a)$ ，从而实现 model-free policy iteration。

The MC Basic algorithm

Description of the algorithm

Given an initial policy $π0\pi_0$ , there are two steps at the $k$ -th iteration.

Step 1: policy evaluation.
- This step is to obtain $qπk(s,a)q_{\pi_k}(s,a)$ for all $(s, a)$ .
- Specifically, for each action-state pair $(s, a)$ , run an infinite number of (or sufficiently many) episodes.
- The average of their returns is used to approximate $qπk(s,a)q_{\pi_k}(s,a)$ .
Step 2: policy improvement.
- This step is to solve $πk+1(s)=arg⁡max⁡π∑aπ(a∣s)qπk(s,a)\pi_{k+1}(s) = \arg\max_\pi \sum_a \pi(a \mid s) q_{\pi_k}(s,a)$ , for all $\in \mathcal{S}$ . The greedy optimal policy is $πk+1(ak∗∣s)=1\pi_{k+1}(a_k^* \mid s) = 1$ where $ak∗=arg⁡max⁡aqπk(s,a)a_k^* = \arg\max_a q_{\pi_k}(s,a)$ .

Exactly the same as the policy iteration algorithm, except

Estimate $qπk(s,a)q_{\pi_k}(s,a)$ directly, instead of solving $vπk(s)v_{\pi_k}(s)$ .

背景：为什么需要 MC Basic？

在 经典 Policy Iteration (PI) 中，我们需要知道 环境模型 (model)：
状态转移概率 $\mid s,a)$
奖励分布 $\mid s,a)$

有了模型，我们可以通过 Bellman 方程精确地计算状态价值 $vπ(s)v_\pi(s)$ 或动作价值 $qπ(s,a)q_\pi(s,a)$ 。
但是在很多真实场景（比如游戏、机器人、推荐系统），我们并 不知道转移概率和奖励分布，只知道“执行动作会得到某个奖励并进入下一个状态”。

MC Basic 算法的两大步骤

Step 1: Policy Evaluation

在 model-based PI 里，policy evaluation 依赖于 Bellman 方程：

$vπ(s)=rπ(s)+γ∑s′Pπ(s′∣s)vπ(s′)v_\pi(s) = r_\pi(s) + \gamma \sum_{s'} P_\pi(s'|s) v_\pi(s')$

在 MC Basic 里，直接基于采样来估计 $qπ(s,a)q_\pi(s,a)$ ：

从某个状态-动作对 $(s, a)$ 出发，运行足够多条 完整的轨迹 (episode)。

每条轨迹会产生一个 回报 $g (s, a)$ ，即从 $(s, a)$ 开始累计到的折扣奖励。

对这些回报取平均：

$qπk(s,a)≈1N∑i=1Ng(i)(s,a)q_{\pi_k}(s,a) \approx \frac{1}{N} \sum_{i=1}^N g^{(i)}(s,a)$

直观理解：

就像 做实验取平均，反复从 $(s, a)$ 出发，看平均能得到多少奖励 → 作为 $qπk(s,a)q_{\pi_k}(s,a)$ 的估计。

Step 2: Policy Improvement

一旦我们得到了 $qπk(s,a)q_{\pi_k}(s,a)$ ，就可以做 贪心更新 (greedy update)：

对于每个状态 $s$ ，选择能让 $qπk(s,a)q_{\pi_k}(s,a)$ 最大的动作 $a^*$ ：

$ak(s)=arg⁡max⁡aqπk(s,a)a^k(s) = \arg\max_a q_{\pi_k}(s,a)$

新的策略 $πk+1\pi_{k+1}$ 变成：

$πk+1(a∣s)={1,a=ak∗(s)0,a≠ak∗(s)\pi_{k+1}(a \mid s) = \begin{cases} 1, & a = a^*_k(s) \\ 0, & a \neq a^*_k(s) \end{cases}$

直观理解：

在每个状态下，丢掉那些“表现不佳”的动作，只保留 历史平均回报最高的动作。

Pseudocode: MC Basic algorithm (a model-free variant of policy iteration)

Initialization: Initial guess $π0\pi_0$ .
Aim: Search for an optimal policy.
While the value estimate has not converged, for the $k$ -th iteration, do
- For every state $\in \mathcal{S}$ , do
  - For every action $\in \mathcal{A}(s)$ , do
    - Collect sufficiently many episodes starting from $(s, a)$ following $πk\pi_k$ .
    - MC-based policy evaluation step:
      - $qπk(s,a)q_{\pi_k}(s,a)$ = average return of all the episodes starting from $(s, a)$ .
  - Policy improvement step:
    
    $ak∗(s)=arg⁡max⁡aqπk(s,a)a^*_k(s) = \arg\max_a q_{\pi_k}(s,a)$
    
    $πk+1(a∣s)=1\pi_{k+1}(a \mid s) = 1$ if $a = a^*_k$ , and $πk+1(a∣s)=0\pi_{k+1}(a \mid s) = 0$ otherwise.

Remarks on MC Basic Algorithm

MC Basic is a variant of the policy iteration algorithm.
The model-free algorithms are built up based on model-based ones. It is, therefore, necessary to understand model-based algorithms first before studying model-free algorithms.
MC Basic is useful to reveal the core idea of MC-based model-free RL, but not practical due to low efficiency.
Why does MC Basic estimate action values instead of state values? That is because state values cannot be used to improve policies directly. When models are not available, we should directly estimate action values.
Since policy iteration is convergent, the convergence of MC Basic is also guaranteed to be convergent given sufficient episodes.

MC Exploring Starts

Episode length

Findings:
- When the episode length is short, only the states that are close to the
  target have nonzero state values.
- As the episode length increases, the states that are closer to the target
  have nonzero values earlier than those farther away.
- The episode length should be suﬃciently long.
- The episode length does not have to be infinitely long

一个 episode 是从起始状态开始，按照某个策略执行动作，直到 终止状态（terminal state）或 截断条件 被触发为止。

Episode length 就是这个过程中 包含的时间步数 (time steps)。

在这里插入图片描述

Policy Evaluation

Data-efficient methods

first-visit method
every-visit method

背景

在 Monte Carlo (MC) 中，我们通过 回合 (episode) 来估计动作价值 $qπ(s,a)q_\pi(s,a)$ 或状态价值 $vπ(s)v_\pi(s)$ 。
一个 episode 可能会多次访问到同一个 状态 (state) 或 状态-动作对 (state–action pair)。
问题是：当同一个 $(s, a)$ 在一个 episode 中被访问了多次，应该用哪一次的回报 (return) 来更新呢？

First-Visit Method
定义：只在一个 episode 中 第一次访问到 (s,a) 时，才把该次之后的回报作为样本，用来更新 $qπ(s,a)q_\pi(s,a)$ 。
特点：
忽略同一 episode 中的后续访问。
更新次数少一些 → 数据利用率低，但估计的样本是相互独立的，无偏。

例子：
如果 episode 中 $s_2,a_3)$ 出现了 3 次，只用第一次出现之后的回报作为更新依据。

Every-Visit Method

定义：在一个 episode 中， $(s, a)$ 出现了几次，就记录几次回报，每次访问都用来更新。

特点：

更新次数多 → 数据利用率更高。
但由于同一 episode 内多次访问的回报是相关的（并不是独立样本），可能增加方差。

例子：

如果 episode 中 $s_2,a_3)$ 出现了 3 次，就取 3 次回报分别用于更新。

Policy Improvement

Another aspect in MC-based RL is when to update the policy. There are two methods.

The first method is, in the policy evaluation step, to collect all the episodes starting from a state-action pair and then use the average return to approximate the action value.
- This is the one adopted by the MC Basic algorithm.
- The problem of this method is that the agent has to wait until all episodes have been collected.
The second method uses the return of a single episode to approximate the action value.
- In this way, we can improve the policy episode-by-episode.

什么时候更新策略？

第一种方法（批量更新，batch update）
在策略评估阶段，收集某个状态-动作对 (s,a) 开始的所有 episode。
用这些 episode 的平均回报作为对动作价值 q(s,a) 的估计。
这是 MC Basic 算法采用的方法。
缺点：必须等所有 episode 收集完成之后，才能更新策略 → 更新不够及时。

第二种方法（逐步更新，incremental update）
使用 单个 episode 的回报 就直接更新动作价值。
好处：可以 逐回合更新策略，不必等所有数据收集完。
这提高了学习的效率，使得策略能更快地改善。

总结：

第一种方法 = 离线批量更新（慢，但稳定，类似 policy iteration）。
第二种方法 = 在线逐步更新（快，灵活，类似 incremental MC / online RL）。

Generalized policy iteration:

Not a specific algorithm.
It refers to the general idea or framework of switching between
policy-evaluation and policy-improvement processes.
Many model-based and model-free RL algorithms fall into this
framework.

Pseudocode: MC Exploring Starts (a sample-efficient variant of MC Basic)

Initialization: Initial guess $π0\pi_0$ .
Aim: Search for an optimal policy.
For each episode, do
- Episode generation: Randomly select a starting state–action pair $s_0,a_0)$
  and ensure that all pairs can be possibly selected. Follow the current policy to generate an episode of length $T$ : $s0,a0,r1,…,sT−1,aT−1,rTs_0, a_0, r_1, \ldots, s_{T-1}, a_{T-1}, r_T$
Episode generation
- 随机选择一个起始状态–动作对 $s_0,a_0)$ ，确保所有 $(s, a)$ 都有机会被选到。
- 这样就满足了 exploring starts 假设。
- 然后按照当前策略 $πk\pi_k$ 生成一个完整 episode，长度 $s_0, a_0, r_1, \ldots, s_{T-1}, a_{T-1}, r_T$
- 和 MC Basic 的不同：MC Basic 从环境默认初始状态开始，而 MC Exploring Starts 人为控制从任意 $(s, a)$ 出发。
- Policy evaluation and policy improvement:
  - Initialization: $\gets 0$
  g ← 0 ：用来存储从当前时刻往后的折扣回报。
  - For each step of the episode, $\ldots, 0$ , do
    
    $\gets \gamma g + r_{t+1}$
    - Use the first-visit method
      - If $s_t,a_t)$ does not appear in $(s0,a0,s1,a1,…,st−1,at−1)(s_0,a_0,s_1,a_1,\ldots,s_{t-1},a_{t-1})$ , then
        
        $Returns(st,at)←Returns(st,at)+gReturns(s_t,a_t) \gets Returns(s_t,a_t) + g$
        
        $q(st,at)=average(Returns(st,at))q(s_t,a_t) = \text{average}(Returns(s_t,a_t))$
        
        $a=arg⁡max⁡aq(st,a)\pi(a \mid s_t) = 1 \quad \text{if } a = \arg\max_a q(s_t,a)$
    遍历 episode（从尾到头）：
    
    $\ldots, 0$
    1. 回报更新
      
      $\;\gets\; \gamma g + r_{t+1}$
      
      这是标准的 return 计算：累加 reward 并加上折扣 $γ\gamma$ 。
    2. First-visit MC 方法
      
      如果 $s_t,a_t)$ 在之前已经出现过，就跳过（只更新第一次出现的位置）。
      如果 $s_t,a_t)$ 是第一次出现：
      
      记录它的回报
      
      $Returns(st,at)←Returns(st,at)+gReturns(s_t,a_t) \;\gets\; Returns(s_t,a_t) + g$
      
      更新 Q 值为平均值
      
      $q(st,at)=average(Returns(st,at))q(s_t,a_t) = \text{average}(Returns(s_t,a_t))$
      
      策略改进：在该状态下选择使 $q(s_t,a)$ 最大的动作
      
      $a=arg⁡max⁡aq(st,a)0otherwise\pi(a \mid s_t) = \begin{cases} 1 & \text{if } a = \arg\max_a q(s_t,a) \\ 0 & \text{otherwise} \end{cases}$
What is exploring starts?
- Exploring starts means we need to generate suﬃciently many episodes
  starting from every state-action pair.
- Both MC Basic and MC Exploring Starts need this assumption.
在强化学习（RL）里，Exploring Starts 是一个理论假设，用来保证 MC 算法能够学到最优策略。具体含义是：
- 学习时，每次生成一条 episode（轨迹），它的 起始点 $s_0, a_0)$ 不是固定的，而是 可以从所有可能的状态-动作对 $(s, a)$ 中随机选取；
- 这样就能保证 所有的 $(s, a)$ 对 eventually 都会被访问，从而保证每个动作的价值 $qπ(s,a)q_\pi(s,a)$ 都能被估计到。
换句话说，Exploring Starts 就是要求我们“强制”从环境里的所有 $(s, a)$ 对出发生成 episode，以覆盖整个空间。
Why do we need to consider exploring starts?
- In theory, only if every action value for every state is well explored, can
  we select the optimal actions correctly. On the contrary, if an action is not explored, this action may happen to be the optimal one and hence be missed.
- In practice, exploring starts is difficult to achieve. For many applications, especially those involving physical interactions with environments, it is difficult to collect episodes starting from every state-action pair.
原因是 策略改进依赖于正确的 Q 值估计：
- 在 Policy Improvement 中，我们更新策略：
  $π(a∣s)={1,a=arg⁡max⁡aqπ(s,a)0,otherwise\pi(a \mid s) = \begin{cases} 1, & a = \arg\max_a q_\pi(s,a) \\ 0, & \text{otherwise} \end{cases}$
- 如果某个 $(s, a)$ 从来没被探索过，我们就不知道它的 $qπ(s,a)q_\pi(s,a)$ 值。
- 结果可能是：我们错误地认为一个 sub-optimal 动作是最优的，从而错过了真正的最优动作。
所以，从理论角度，只有当所有 $(s, a)$ 都被探索过，才能确保我们不会错过最优策略。

MC $ε\varepsilon$ -Greedy

$ε\varepsilon$ -Greedy Policy

What is an $ε\varepsilon$ -greedy policy?

$\pi(a \mid s) =
\begin{cases}
1 - \dfrac{\varepsilon}{|\mathcal{A}(s)|} \big(|\mathcal{A}(s)| - 1\big), & \text{for the greedy action}, \[8pt]
\dfrac{\varepsilon}{|\mathcal{A}(s)|}, & \text{for the other $∣A(s)∣−1|\mathcal{A}(s)| - 1$ actions}.
\end{cases}$

where $ε∈[0,1]\varepsilon \in [0,1]$ and $∣A(s)∣|\mathcal{A}(s)|$ is the number of actions for $s$ .
The chance to choose the greedy action is always greater than other actions, because $\dfrac{\varepsilon}{|\mathcal{A}(s)|} (|\mathcal{A}(s)| - 1) = 1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}(s)|} \;\geq\; \dfrac{\varepsilon}{|\mathcal{A}(s)|}$ .

解释：

在状态 $s$ 下，智能体有一组可选动作 $A(s)\mathcal{A}(s)$ 。
贪心动作 (greedy action)：是指在当前的价值函数（例如 $q (s, a)$ ）估计下，价值最大的动作。
$ε\varepsilon$ -greedy 策略的核心思想是：
绝大多数时候（概率较大）选择当前最优动作（贪心动作）。
少部分时候（概率较小）随机选择其它动作。

其中：

$ε∈[0,1]\varepsilon \in [0,1]$ 表示探索程度。
$∣A(s)∣|\mathcal{A}(s)|$ 表示状态 $s$ 下可选动作的个数。

概率分配问题：

所有动作的概率和为 1
贪心动作的概率总是比非贪心动作高

Why use $ε\varepsilon$ -greedy?

Balance between exploitation and exploration.
- When $ε=0\varepsilon = 0$ , it becomes greedy! Less exploration but more exploitation.
- When $ε=1\varepsilon = 1$ , it becomes a uniform distribution. More exploration but less exploitation.

强化学习中的一个核心矛盾是：

Exploration（探索）：尝试新动作，获取更多信息。
Exploitation（利用）：选择当前看来最优的动作，获取最大收益。

$ε\varepsilon$ -greedy 就是一个简单的折中方案：

当 $ε=0\varepsilon = 0$ ：完全贪心策略，只选择当前最优动作。
优点：利用性强，能快速获取已知最大收益。
缺点：可能陷入局部最优，错过真正更优的动作。

当 $ε=1\varepsilon = 1$ ：完全随机策略（均匀分布）。
优点：探索性强，可以遍历所有动作。
缺点：利用性差，几乎不参考价值估计，学习效率低。

当 $ε\varepsilon$ 取中间值：
以 较大概率选择贪心动作（利用当前知识）；
以 小概率尝试随机动作（保持探索）。

MC $ε\varepsilon$ -Greedy Algorithm

How to embed $ε\varepsilon$ -greedy into the MC-based RL algorithms?

Originally, the policy improvement step in MC Basic and MC Exploring Starts is to solve

$πk+1(s)=arg⁡max⁡π∈Π∑aπ(a∣s)qπk(s,a),\pi_{k+1}(s) = \arg\max_{\pi \in \Pi} \sum_a \pi(a \mid s) q_{\pi_k}(s,a),$
- where $Π\Pi$ denotes the set of all possible policies. The optimal policy here is
```
  $\pi_{k+1}(a \mid s) =\begin{cases}1, & a = a_k^*, \\0, & a \neq a_k^*,\end{cases}$
```
- where
  
  $ak∗=arg⁡max⁡aqπk(s,a)a_k^* = \arg\max_a q_{\pi_k}(s,a)$
Now, the policy improvement step is changed to solve

$πk+1(s)=arg⁡max⁡π∈Πε∑aπ(a∣s)qπk(s,a),\pi_{k+1}(s) = \arg\max_{\pi \in \Pi_\varepsilon} \sum_a \pi(a \mid s) q_{\pi_k}(s,a),$
- where $Πε\Pi_\varepsilon$ denotes the set of all $ε\varepsilon$ -greedy policies with a fixed value of $ε\varepsilon$ . The optimal policy here is
  
  $πk+1(a∣s)={1−∣A(s)∣−1∣A(s)∣ε,a=ak∗,1∣A(s)∣ε,a≠ak∗,\pi_{k+1}(a \mid s) = \begin{cases} 1 - \frac{|\mathcal{A}(s)| - 1}{|\mathcal{A}(s)|}\varepsilon, & a = a_k^*, \\ \frac{1}{|\mathcal{A}(s)|}\varepsilon, & a \neq a_k^*, \end{cases}$
- MC $ε\varepsilon$ -Greedy is the same as that of MC Exploring Starts except that the former uses $ε\varepsilon$ -greedy policies.
- It does not require exploring starts, but still requires to visit all state–action pairs in a different form.

背景回顾：MC Basic & MC Exploring Starts

在 Monte Carlo (MC) 方法里，我们通过采样（运行很多 episode）来估计动作价值函数 $q (s, a)$ ，然后进行策略改进：

MC Basic：需要 exploring starts（从任意 $(s, a)$ 对开始）来保证所有 $(s, a)$ 都能被采样。
MC Exploring Starts：同样依赖 exploring starts，但效率更高。

$ε\varepsilon$ -greedy 的引入

原本的策略改进（纯贪心）

在 MC Basic 和 MC Exploring Starts 中，策略改进是完全贪心的：

$πk+1(a∣s)={1,a=ak∗0,a≠ak\pi_{k+1}(a \mid s) = \begin{cases} 1, & a = a^*_k \\ 0, & a \neq a^k \end{cases}$

其中 a_k^* = \arg\max_a q{\pi_k}(s,a).

含义：每次都 只选择估计的最优动作。

缺点：容易陷入局部最优（没探索过的动作可能被永远忽视）。

引入 $ε\varepsilon$ -greedy 后的策略改进

策略公式修改为：

$πk+1(a∣s)={1−∣A(s)∣−1∣A(s)∣ε,a=ak∗1∣A(s)∣ε,a≠ak∗\pi_{k+1}(a \mid s) = \begin{cases} 1 - \dfrac{|\mathcal{A}(s)| - 1}{|\mathcal{A}(s)|}\varepsilon, & a = a^*_k \\ \dfrac{1}{|\mathcal{A}(s)|}\varepsilon, & a \neq a^*_k \end{cases}$

解释：

贪心动作：大概率选择（接近 1）。
非贪心动作：也有小概率选择（探索）。
参数 $ε\varepsilon$ 控制探索强度：
$ε=0;⇒;\varepsilon = 0 ;\Rightarrow;$ 完全贪心（和原本一致）。
$ε=1;⇒;\varepsilon = 1 ;\Rightarrow;$ 完全随机策略。

好处：

不再需要 exploring starts，因为 即使没有人为设定起点， $ε\varepsilon$ -greedy 自然会覆盖所有 $(s, a)$ 。
保证所有动作都有被采样的机会。

Pseudocode: MC $ε\varepsilon$ -Greedy (a variant of MC Exploring Starts)

Initialization: Initial guess $π0\pi_0$ and the value of $ε∈[0,1]\varepsilon \in [0,1]$
Aim: Search for an optimal policy.
For each episode, do
- Episode generation: Randomly select a starting state–action pair $s_0,a_0)$ . Following the current policy, generate an episode of length $s_0, a_0, r_1, \ldots, s_{T-1}, a_{T-1}, r_T$ .
  Episode 生成
  - 输入：当前策略 $πk\pi_k$
  - 过程：从某个 $s_0,a_0)$ 开始，采样环境，生成一个 episode：
    
    $s0,a0,r1,s1,a1,r2,…,sT−1,aT−1,rTs_0, a_0, r_1, s_1, a_1, r_2, \ldots, s_{T-1}, a_{T-1}, r_T$
  - 关键：这个过程得到了一整条轨迹，用来计算回报 $G_t$ 。
- Policy evaluation and policy improvement:
  - Initialization: $\gets 0$
  - For each step of the episode, $\ldots, 0$ , do
    
    $\gets \gamma g + r_{t+1}$
    
    含义：计算从 $t$ 时刻开始的 折扣回报 $G_t$ 。
    
    利用整个 episode 的奖励信息来更新价值。
    - Use the every-visit method:
      
      $Returns(st,at)←Returns(st,at)+gReturns(s_t,a_t) \gets Returns(s_t,a_t) + g$
      
      $q(st,at)=average(Returns(st,at))q(s_t,a_t) = \text{average}(Returns(s_t,a_t))$
      - Let $a^* = \arg\max_a q(s_t,a)$ and update the policy:
        
        $π(a∣st)={1−∣A(st)∣−1∣A(st)∣ε,a=a∗1∣A(st)∣ε,a≠a∗\pi(a \mid s_t) = \begin{cases} 1 - \frac{|\mathcal{A}(s_t)|-1}{|\mathcal{A}(s_t)|}\varepsilon, & a = a^* \\ \frac{1}{|\mathcal{A}(s_t)|}\varepsilon, & a \neq a^* \end{cases}$
      作用：不断累积同一 $(s, a)$ 对的回报样本，取平均作为 $q (s, a)$ 的估计。区别：
      
      first-visit：只在 episode 中第一次访问 $(s, a)$ 时更新。
      every-visit：每次访问 $(s, a)$ 都更新。
      
      every-visit 更新更高效，但会引入更多方差；first-visit 更新更稳健但慢。

总结

MC $ε\varepsilon$ -Greedy 算法是一种无需环境模型、无需 exploring starts 的 Monte Carlo 策略迭代方法，它通过采样回报平均来估计 $q (s, a)$ ，并用 $ε\varepsilon$ -greedy 策略改进保证探索，从而实现在真实环境中可行的、渐进收敛的强化学习。