RL【4】:Value Iteration and Policy Iteration
系列文章目录
文章目录
- 系列文章目录
- 前言
- Value iteration algorithm
- Policy iteration algorithm
- Truncated policy iteration algorithm
- 总结
前言
本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning
Value iteration algorithm
How to solve the Bellman optimality equation?
v=f(v)=maxπ(rπ+γPπv)v = f(v) = \max_{\pi} \left( r_\pi + \gamma P_\pi v \right)v=f(v)=maxπ(rπ+γPπv)
这表明:
- v∗(s)v^*(s)v∗(s) 表示从状态 sss 出发,遵循最优策略时的价值(期望回报)。
- 每个状态的最优价值是通过 选择最优动作(最大化)来决定的。
We know that the contraction mapping theorem suggests an iterative algorithm:
vk+1=f(vk)=maxπ(rπ+γPπvk),k=1,2,3,…v_{k+1} = f(v_k) = \max_{\pi} \left( r_\pi + \gamma P_\pi v_k \right), \quad k = 1,2,3,\ldotsvk+1=f(vk)=maxπ(rπ+γPπvk),k=1,2,3,…
where v0v_0v0 can be arbitrary.
This algorithm can eventually find the optimal state value and an optimal policy, which is called value iteration.
The algorithm can be decomposed into two steps.
- Step 1: policy update.
-
This step is to solve
πk+1=argmaxπ(rπ+γPπvk),\pi_{k+1} = \arg\max_{\pi} \left( r_\pi + \gamma P_\pi v_k \right),πk+1=argmaxπ(rπ+γPπvk),
-
where vkv_kvk is given.
-
含义:在第 kkk 次迭代时,给定当前的价值函数估计 vkv_kvk,选择一个 最优策略 πk+1\pi_{k+1}πk+1。
直观理解:这是一个“贪心选择”步骤 —— 在当前价值估计下,挑选最优的动作。
-
Step 2: value update.
vk+1=rπk+1+γPπk+1vkv_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}} v_kvk+1=rπk+1+γPπk+1vk
含义:利用刚刚得到的 新策略 πk+1\pi_{k+1}πk+1,更新状态价值函数。
直观理解:这一步是“重新计算价值”,即“在新的策略下,我的回报有多少?”。
- Question: Is vkv_kvk a state value?
- No, because it is not ensured that vkv_kvk satisfies a Bellman equation.
Value iteration algorithm - Elementwise form
-
Step 1: Policy update
-
The elementwise form of
πk+1=argmaxπ(rπ+γPπvk)\pi_{k+1} = \arg\max_{\pi} \big(r_\pi + \gamma P_\pi v_k\big)πk+1=argmaxπ(rπ+γPπvk)
-
is
πk+1(s)=argmaxπ∑aπ(a∣s)(∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(s′)),s∈S.\pi_{k+1}(s) = \arg\max_{\pi} \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v_k(s') \right), \quad s \in \mathcal{S}.πk+1(s)=argmaxπ∑aπ(a∣s)(∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(s′)),s∈S.
πk+1(s)=argmaxπ∑aπ(a∣s)qk(s,a),s∈S.\pi_{k+1}(s) = \arg\max_{\pi} \sum_a \pi(a \mid s)q_k(s,a), \quad s \in \mathcal{S}.πk+1(s)=argmaxπ∑aπ(a∣s)qk(s,a),s∈S.
-
The optimal policy solving the above optimization problem is
πk+1(a∣s)={1,a=ak∗(s)0,a≠ak∗(s)\pi_{k+1}(a \mid s) = \begin{cases} 1, & a = a^*_k(s) \\ 0, & a \neq a^*_k(s) \end{cases}πk+1(a∣s)={1,0,a=ak∗(s)a=ak∗(s)
-
where a∗k(s)=argmaxaqk(s,a)a^*k(s) = \arg\max_a q_k(s,a)a∗k(s)=argmaxaqk(s,a). πk+1\pi{k+1}πk+1 is called a greedy policy, since it simply selects the greatest q-value.
qk(s,a)q_k(s,a)qk(s,a) 就是在状态 sss 采取动作 aaa 后得到的 Q 值(即时奖励 + 折扣未来价值)。
- 含义:在状态 sss,新策略 πk+1\pi_{k+1}πk+1 选择的动作是能让 Q 值最大的那个。
- 结果:得到的最优策略是“确定性策略”(deterministic policy)
πk+1\pi_{k+1}πk+1 就是一个 贪心策略(greedy policy) —— 在每个状态下,选择当下 Q 值最大的动作。
-
-
Step 2: Value update
-
The elementwise form of
vk+1=rπk+1+γPπk+1vkv_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}} v_kvk+1=rπk+1+γPπk+1vk
-
is
vk+1(s)=∑aπk+1(a∣s)(∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(s′)),s∈S.v_{k+1}(s) = \sum_a \pi_{k+1}(a \mid s) \left( \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v_k(s') \right), \quad s \in \mathcal{S}.vk+1(s)=∑aπk+1(a∣s)(∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(s′)),s∈S.
vk+1(s)=∑aπk+1(a∣s)qk(s,a),s∈S.v_{k+1}(s) = \sum_a \pi_{k+1}(a \mid s)q_k(s,a), \quad s \in \mathcal{S}.vk+1(s)=∑aπk+1(a∣s)qk(s,a),s∈S.
-
Since πk+1\pi_{k+1}πk+1 is greedy, the above equation is simply
vk+1(s)=maxaqk(s,a).v_{k+1}(s) = \max_a q_k(s,a).vk+1(s)=maxaqk(s,a).
含义:新的状态价值函数 vk+1v_{k+1}vk+1,就是在每个状态下选择能获得最大回报的动作对应的 Q 值。
直观理解:价值函数不断逼近“最优价值”,每次迭代都在向 Bellman 最优性方程靠近。
-
Step 1(Policy Update):用 vkv_kvk 找出一个贪心策略 πk+1\pi_{k+1}πk+1,即在每个状态下挑选 Q 值最大的动作。
Step 2(Value Update):用新策略 πk+1\pi_{k+1}πk+1 更新 vk+1v_{k+1}vk+1,但因为策略是贪心的,结果直接等价于取 maxaqk(s,a)\max_a q_k(s,a)maxaqk(s,a)。
Value iteration algorithm - Pseudocode
-
Procedure summary
vk(s)→qk(s,a)→greedy policy πk+1(a∣s)→new value vk+1(s)=maxaqk(s,a)v_k(s) \;\;\to\;\; q_k(s,a) \;\;\to\;\; \text{greedy policy } \pi_{k+1}(a \mid s) \;\;\to\;\; \text{new value } v_{k+1}(s) = \max_a q_k(s,a)vk(s)→qk(s,a)→greedy policy πk+1(a∣s)→new value vk+1(s)=maxaqk(s,a)
值迭代的核心循环:
- 先用当前的 状态价值函数 vk(s)v_k(s)vk(s) 去算 动作价值函数 qk(s,a)q_k(s,a)qk(s,a);
- 再根据 qk(s,a)q_k(s,a)qk(s,a) 选出“贪心策略” πk+1\pi_{k+1}πk+1;
- 用该策略更新状态价值,得到新的 vk+1v_{k+1}vk+1;
- 重复直到收敛。
-
Pseudocode
-
Initialization: The probability model p(r∣s,a)p(r \mid s,a)p(r∣s,a) and p(s′∣s,a)p(s' \mid s,a)p(s′∣s,a) for all (s,a)(s,a)(s,a) are known. Initial guess v0v_0v0.
已知环境的概率模型:
- 转移概率 p(s′∣s,a)p(s' \mid s,a)p(s′∣s,a):在状态 sss 采取动作 aaa 后到达 s′s's′ 的概率。
- 奖励概率 p(r∣s,a)p(r \mid s,a)p(r∣s,a):在状态 sss 采取动作 aaa 得到奖励 rrr 的概率。
- 给定一个初始猜测 v0v_0v0,比如全零向量。相当于“开始时随便猜一个状态值表”
-
Aim: Search the optimal state value and an optimal policy solving the Bellman optimality equation.
目标是找到:
- 最优状态价值函数 v∗v^*v∗
- 最优策略 π∗\pi^*π∗
它们满足 Bellman 最优性方程。
-
While vkv_kvk has not converged in the sense that ∥vk−vk−1∥\| v_k - v_{k-1} \|∥vk−vk−1∥ is greater than a predefined small threshold, for the kkk-th iteration, do
- For every state s∈Ss \in \mathcal{S}s∈S, do
-
For every action a∈A(s)a \in \mathcal{A}(s)a∈A(s), do
q-value: qk(s,a)=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(s′)q_k(s,a) = \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v_k(s')qk(s,a)=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(s′)
-
Maximum action value: ak∗(s)=argmaxaqk(s,a)a^*_k(s) = \arg\max_a q_k(s,a)ak∗(s)=argmaxaqk(s,a)
-
Policy update: πk+1(a∣s)={1,a=ak∗(s),0,otherwise\pi_{k+1}(a \mid s) = \begin{cases} 1, & a = a^*_k(s), \\ 0, & \text{otherwise} \end{cases}πk+1(a∣s)={1,0,a=ak∗(s),otherwise
-
Value update: vk+1(s)=maxaqk(s,a)v_{k+1}(s) = \max_a q_k(s,a)vk+1(s)=maxaqk(s,a)
-
收敛条件:
- 当 ∣vk−vk−1∣|v_k - v_{k-1}|∣vk−vk−1∣ 小于一个阈值,停止迭代。
- 即“新旧价值函数差别非常小”,说明学到的 vkv_kvk 已经接近最优 v∗v^*v∗。
总体解释
- Value Iteration 就是不断交替:
- 用当前 vkv_kvk 算 qk(s,a)q_k(s,a)qk(s,a)
- 从 qk(s,a)q_k(s,a)qk(s,a) 中提取最优策略
- 用该策略更新 vk+1v_{k+1}vk+1
- 每次迭代都在逼近最优解 v∗v^*v∗ 和 π∗\pi^*π∗。
- 收敛后,vk≈v∗v_k \approx v^*vk≈v∗,πk≈π∗\pi_k \approx \pi^*πk≈π∗。
- For every state s∈Ss \in \mathcal{S}s∈S, do
-
Policy iteration algorithm
Algorithm description
-
Given a random initial policy π0\pi_0π0,
-
Step 1: policy evaluation (PE)
-
This step is to calculate the state value of πk\pi_kπk:
vπk=rπk+γPπkvπkv_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}vπk=rπk+γPπkvπk
-
Note that vπkv_{\pi_k}vπk is a state value function.
含义:
- 给定当前策略 πk\pi_kπk,我们要算出它的 状态价值函数 vπkv_{\pi_k}vπk。
- 这相当于问:如果一直遵循策略 πk\pi_kπk,每个状态 sss 的长期累计回报是多少?
结果:得到 vπkv_{\pi_k}vπk,即该策略下的价值函数。
-
-
Step 2: policy improvement (PI)
πk+1=argmaxπ(rπ+γPπvπk)\pi_{k+1} = \arg\max_{\pi} \left( r_\pi + \gamma P_\pi v_{\pi_k} \right)πk+1=argmaxπ(rπ+γPπvπk)
- The maximization is componentwise!
含义:
- 在已知 vπkv_{\pi_k}vπk 的情况下,找到一个更好的策略 πk+1\pi_{k+1}πk+1。
- 具体来说,在每个状态 sss,我们选择那个能使“即时奖励 + 折扣未来回报”最大的动作。
注意:
- “maximization is componentwise” 意味着:对于 每个状态 sss,单独选择一个最优动作,而不是一次性全局最大化。
- 因此 πk+1\pi_{k+1}πk+1 通常是一个 贪心策略,即在每个状态下都选择当前看来最优的动作。
-
-
The algorithm leads to a sequence
π0→PEvπ0→PIπ1→PEvπ1→PIπ2→PEvπ2→PI⋯\pi_0 \xrightarrow{PE} v_{\pi_0} \xrightarrow{PI} \pi_1 \xrightarrow{PE} v_{\pi_1} \xrightarrow{PI} \pi_2 \xrightarrow{PE} v_{\pi_2} \xrightarrow{PI} \cdotsπ0PEvπ0PIπ1PEvπ1PIπ2PEvπ2PI⋯
- where PE = policy evaluation, PI = policy improvement.
直观理解
- PE 步骤:问“如果我老老实实按照当前策略走,能得到多少回报?”
- PI 步骤:想“那如果我稍微贪心一点,在某些状态换个更优动作,会不会更好?”
Q & A
-
Q1: In the policy evaluation step, how to get the state value vπkv_{\pi_k}vπk by solving the Bellman equation?
-
The Bellman equation for policy πk\pi_kπk is
vπk=rπk+γPπkvπk.v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}.vπk=rπk+γPπkvπk.
-
Closed-form solution:
vπk=(I−γPπk)−1rπk.v_{\pi_k} = (I - \gamma P_{\pi_k})^{-1} r_{\pi_k}.vπk=(I−γPπk)−1rπk.
-
Iterative solution:
vπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,…v_{\pi_k}^{(j+1)} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}^{(j)}, \quad j = 0,1,2,\ldotsvπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,…
-
Policy iteration is an iterative algorithm with another iterative algorithm embedded in the policy evaluation step!
-
-
Q2: In the policy improvement step, why is the new policy πk+1\pi_{k+1}πk+1 better than πk\pi_kπk?
- Lemma: Policy Improvement
-
If
πk+1=argmaxπ(rπ+γPπvπk),\pi_{k+1} = \arg\max_{\pi} \big(r_\pi + \gamma P_\pi v_{\pi_k}\big),πk+1=argmaxπ(rπ+γPπvπk),
-
then
vπk+1≥vπk,∀k.v_{\pi_{k+1}} \geq v_{\pi_k}, \quad \forall k.vπk+1≥vπk,∀k.
-
- Lemma: Policy Improvement
-
Q3: Why such an iterative algorithm can finally reach an optimal policy?
-
Since every iteration improves the policy, we know
vπ0≤vπ1≤vπ2≤⋯≤vπk≤⋯≤v∗.v_{\pi_0} \;\leq\; v_{\pi_1} \;\leq\; v_{\pi_2} \;\leq \cdots \leq v_{\pi_k} \;\leq \cdots \leq v^*.vπ0≤vπ1≤vπ2≤⋯≤vπk≤⋯≤v∗.
-
As a result, vπkv_{\pi_k}vπk keeps increasing and will converge.
-
Theorem: Convergence of Policy Iteration
- The state value sequence {vπk}k=0∞\{v_{\pi_k}\}_{k=0}^{\infty}{vπk}k=0∞ generated by the policy iteration algorithm converges to the optimal state value v∗v^*v∗.
- As a result, the policy sequence {πk}k=0∞\{\pi_k\}_{k=0}^{\infty}{πk}k=0∞ converges to an optimal policy.
-
-
Q4: What is the relationship between this policy iteration algorithm and the previous value iteration algorithm?
- Related to the answer to Q3
Policy iteration algorithm - Elementwise form
-
Step 1: Policy Evaluation
-
Matrix-vector form:
vπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,…v_{\pi_k}^{(j+1)} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}^{(j)}, \quad j = 0,1,2,\ldotsvπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,…
-
Elementwise form:
$v_{\pi_k}^{(j+1)}(s) = \sum_a \pi_k(a \mid s)
\Bigg( \sum_r p(r \mid s,a) r- \gamma \sum_{s’} p(s’ \mid s,a) v_{\pi_k}^{(j)}(s’) \Bigg),
\quad s \in \mathcal{S}.$
- Stop when j→∞j \to \inftyj→∞ or jjj is sufficiently large, or ∥vπk(j+1)−vπk(j)∥\| v_{\pi_k}^{(j+1)} - v_{\pi_k}^{(j)} \|∥vπk(j+1)−vπk(j)∥ is sufficiently small.
- \gamma \sum_{s’} p(s’ \mid s,a) v_{\pi_k}^{(j)}(s’) \Bigg),
通过不断迭代更新 vπkv_{\pi_k}vπk,最终能收敛到该策略下的真实价值函数 vπkv_{\pi_k}vπk。
-
-
Step 2: Policy Improvement
-
Matrix-vector form:
πk+1=argmaxπ(rπ+γPπvπk)\pi_{k+1} = \arg\max_\pi (r_\pi + \gamma P_\pi v_{\pi_k})πk+1=argmaxπ(rπ+γPπvπk)
-
Elementwise form:
πk+1(s)=argmaxπ∑aπ(a∣s)(∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vπk(s′)),s∈S.\pi_{k+1}(s) = \arg\max_\pi \sum_a \pi(a \mid s) \Bigg( \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v_{\pi_k}(s') \Bigg), \quad s \in \mathcal{S}.πk+1(s)=argmaxπ∑aπ(a∣s)(∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vπk(s′)),s∈S.
-
Here, qπk(s,a)q_{\pi_k}(s,a)qπk(s,a) is the action value under policy πk\pi_kπk. Let
ak∗(s)=argmaxaqπk(s,a).a^*_k(s) = \arg\max_a q_{\pi_k}(s,a).ak∗(s)=argmaxaqπk(s,a).
-
Then, the greedy policy is
πk+1(a∣s)={1,a=ak∗(s)0,a≠ak∗(s).\pi_{k+1}(a \mid s) = \begin{cases} 1, & a = a^*_k(s) \\ 0, & a \neq a^*_k(s). \end{cases}πk+1(a∣s)={1,0,a=ak∗(s)a=ak∗(s).
-
给定 vπkv_{\pi_k}vπk,我们寻找新的策略 πk+1\pi_{k+1}πk+1,使得每个状态下的长期回报最大化。
-
整体理解
- 策略评估 (PE):计算“在当前策略 πk\pi_kπk 下,每个状态的价值是多少”。用迭代公式逼近,直到收敛。
- 策略改进 (PI):基于当前 vπkv_{\pi_k}vπk,在每个状态挑选更优动作,从而得到新策略 πk+1\pi_{k+1}πk+1。
-
Pseudocode: Policy Iteration Algorithm
- Initialization: The probability model p(r∣s,a)p(r \mid s,a)p(r∣s,a) and p(s′∣s,a)p(s' \mid s,a)p(s′∣s,a) for all (s,a)(s,a)(s,a) are known. Initial guess π0\pi_0π0.
已知环境的概率模型:
- 奖励分布 p(r∣s,a)p(r \mid s,a)p(r∣s,a):在状态 sss 执行动作 aaa 时得到奖励 rrr 的概率。
- 转移分布 p(s′∣s,a)p(s' \mid s,a)p(s′∣s,a):在状态 sss 执行动作 aaa 后转移到状态 s′s's′ 的概率。
随机初始化一个策略 π0\pi_0π0(比如在每个状态随机选择动作)。
- Aim: earch for the optimal state value and an optimal policy.
目标是找到:
- 最优状态价值函数 v∗v^*v∗
- 最优策略 π∗\pi^*π∗
即满足 Bellman 最优性方程的解。
- While the policy has not converged, for the kkk-th iteration, do
-
Policy evaluation:
- Initialization: an arbitrary initial guess vπk(0)v_{\pi_k}^{(0)}vπk(0).
- While vπk(j)v_{\pi_k}^{(j)}vπk(j) has not converged, for the jjj-th iteration, do
-
For every state s∈Ss \in \mathcal{S}s∈S, do
vπk(j+1)(s)=∑aπk(a∣s)[∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vπk(j)(s′)],v_{\pi_k}^{(j+1)}(s) = \sum_a \pi_k(a \mid s) \Bigg[ \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v_{\pi_k}^{(j)}(s') \Bigg],vπk(j+1)(s)=∑aπk(a∣s)[∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vπk(j)(s′)],
-
-
Policy improvement:
-
For every state s∈Ss \in \mathcal{S}s∈S, do
qπk(s,a)=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vπk(s′),q_{\pi_k}(s,a) = \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v_{\pi_k}(s'),qπk(s,a)=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vπk(s′),
- ak∗(s)=argmaxaqπk(s,a),a^*_k(s) = \arg\max_a q{\pi_k}(s,a),ak∗(s)=argmaxaqπk(s,a),
- πk+1(a∣s)=1\pi_{k+1}(a \mid s) = 1πk+1(a∣s)=1 if a=ak∗(s)a=a^*_k(s)a=ak∗(s), and ∗πk+1(a∣s)=0*\pi_{k+1}(a \mid s) = 0∗πk+1(a∣s)=0* otherwise
-
-
含义:
- PE:在固定策略 πk\pi_kπk 下,不断迭代,直到算出该策略的真实状态价值函数 vπkv_{\pi_k}vπk
- PI:根据 vπkv_{\pi_k}vπk 的评估结果,把策略改进为“在每个状态总是选择当前看起来最优的动作”。
收敛过程
如果在某次迭代后,策略不再变化(πk+1=πk\pi_{k+1} = \pi_kπk+1=πk),说明我们已经找到了 最优策略 π∗\pi^*π∗,对应的 vπ∗v_{\pi^*}vπ∗ 就是最优状态价值函数 v∗v^*v∗。
Truncated policy iteration algorithm
Compare value iteration and policy iteration
-
Elementwise form
-
Policy Iteration
-
Step 1: Policy Evaluation
vπk=rπk+γPπkvπkv_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}vπk=rπk+γPπkvπk
-
Step 2: Policy Improvement
πk+1=argmaxπ(rπ+γPπvπk)\pi_{k+1} = \arg\max_\pi (r_\pi + \gamma P_\pi v_{\pi_k})πk+1=argmaxπ(rπ+γPπvπk)
-
-
Value iteration: start from v0v_0v0
-
Step 1: Policy update (PU)
πk+1=argmaxπ(rπ+γPπvπk)\pi_{k+1} = \arg\max_\pi (r_\pi + \gamma P_\pi v_{\pi_k})πk+1=argmaxπ(rπ+γPπvπk)
-
Step 2: Value update (VU)
vk+1=rπk+1+γPπk+1vkv_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}} v_kvk+1=rπk+1+γPπk+1vk
-
-
-
The two algorithms are very similar:
-
Policy iteration
π0→PEvπ0→PIπ1→PEvπ1→PIπ2→PEvπ2→PI⋯\pi_0 \;\xrightarrow{PE}\; v_{\pi_0} \;\xrightarrow{PI}\; \pi_1 \;\xrightarrow{PE}\; v_{\pi_1} \;\xrightarrow{PI}\; \pi_2 \;\xrightarrow{PE}\; v_{\pi_2} \;\xrightarrow{PI}\; \cdotsπ0PEvπ0PIπ1PEvπ1PIπ2PEvπ2PI⋯
-
Value iteration
u0→PUπ1′→VUu1→PUπ2′→VUu2→PU⋯u_0 \;\xrightarrow{PU}\; \pi'_1 \;\xrightarrow{VU}\; u_1 \;\xrightarrow{PU}\; \pi'_2 \;\xrightarrow{VU}\; u_2 \;\xrightarrow{PU}\; \cdotsu0PUπ1′VUu1PUπ2′VUu2PU⋯
-
Where
- PE = policy evaluation, PI = policy improvement.
- PU = policy update, VU = value update.
-
两个算法的直观理解
- Policy Iteration:每次都把 vπkv_{\pi_k}vπk 评估到精确解,再更新策略 → 收敛快,但每次评估开销大。
- Value Iteration:不用精确求解 vπv_\pivπ,而是一步步更新 vkv_kvk → 每步轻量,但收敛要更多迭代。
相同点
- 两者都基于 Bellman 最优性方程。
- 都是交替进行“价值更新”和“策略改进”,最终收敛到最优策略 π∗\pi^*π∗ 和最优状态价值 v∗v^*v∗。
不同点
- Policy Iteration (PI)
策略评估 (PE):
每次迭代时,都要把 vπkv_{\pi_k}vπk 精确算出来(即解方程 vπk=rπk+γPπkvπkv_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}vπk=rπk+γPπkvπk)。
策略改进 (PI):
在已知精确 vπkv_{\pi_k}vπk 的情况下,更新策略 πk+1\pi_{k+1}πk+1。
特点:
- 每次迭代“算得很准”,所以只需很少迭代就能收敛。
- 但每一步代价高,因为需要解大规模线性方程组(或迭代到精确解)。
适用场景:状态空间比较小,可以承担“每次精确评估”的计算量。
- Value Iteration (VI)
- 策略更新 (PU) + 价值更新 (VU):
- 不去精确求解 vπkv_{\pi_k}vπk;
- 而是从 vkv_kvk 出发,先用它生成贪心策略 πk+1\pi_{k+1}πk+1,再只做一次价值更新
- 特点:
- 每一步计算量小(只做一步 Bellman 更新)。
- 需要更多迭代才能收敛。
- 适用场景:状态空间很大,无法每次都精确解策略评估,只能渐进更新。
- Let’s compare the steps carefully:
-
They start from the same initial condition.
- In policy iteration, solving vπ1=rπ1+γPπ1vπ1v_{\pi_1} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}vπ1=rπ1+γPπ1vπ1 requires an iterative algorithm (an infinite number of iterations).
- In value iteration, v1=rπ1+γPπ1v0v_1 = r_{\pi_1} + \gamma P_{\pi_1} v_0v1=rπ1+γPπ1v0 is a one-step iteration.
-
Consider the step of solving vπ1=rπ1+γPπ1vπ1v_{\pi_1} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}vπ1=rπ1+γPπ1vπ1 :
vπ1(0)=v0v_{\pi_1}^{(0)} = v_0vπ1(0)=v0
Value iteration←v1←vπ1(1)=rπ1+γPπ1vπ1(0)\text{Value iteration} \leftarrow v_1 \leftarrow v_{\pi_1}^{(1)} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}^{(0)}Value iteration←v1←vπ1(1)=rπ1+γPπ1vπ1(0)
vπ1(2)=rπ1+γPπ1vπ1(1)v_{\pi_1}^{(2)} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}^{(1)}vπ1(2)=rπ1+γPπ1vπ1(1)
⋯\cdots⋯
Truncated policy iteration←vˉ1←vπ1(j)=rπ1+γPπ1vπ1(j−1)\text{Truncated policy iteration} \leftarrow \bar{v}1 \leftarrow v{\pi_1}^{(j)} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}^{(j-1)}Truncated policy iteration←vˉ1←vπ1(j)=rπ1+γPπ1vπ1(j−1)
⋯\cdots⋯
Policy iteration←vπ1←vπ1(∞)=rπ1+γPπ1vπ1(∞)\text{Policy iteration} \leftarrow v_{\pi_1} \leftarrow v_{\pi_1}^{(\infty)} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}^{(\infty)}Policy iteration←vπ1←vπ1(∞)=rπ1+γPπ1vπ1(∞)
-
Where
- The value iteration algorithm computes once.
- The policy iteration algorithm computes an infinite number of iterations.
- The truncated policy iteration algorithm computes a finite number of iterations (say jjj). The rest iterations from jjj to ∞\infty∞ are truncated.
-
Pseudocode: Truncated policy iteration algorithm
背景
- Policy Iteration (PI) 的核心是两步:
- 策略评估 (Policy Evaluation, PE):精确求解当前策略 πk\pi_kπk 的状态价值 vπkv_{\pi_k}vπk。
- 策略改进 (Policy Improvement, PI):基于 vπkv_{\pi_k}vπk,更新得到更优的策略 πk+1\pi_{k+1}πk+1。
- 但是 策略评估如果做到精确,需要反复迭代直到收敛,计算开销很大。
- 为了加快速度,可以只做有限步的近似评估,这就得到 截断策略迭代 (Truncated Policy Iteration, TPI)。
它本质上是 Policy Iteration 和 Value Iteration 的折中方法。
- Initialization: The probability model p(r∣s,a)p(r \mid s,a)p(r∣s,a) and p(s′∣s,a)p(s' \mid s,a)p(s′∣s,a) for all (s,a)(s,a)(s,a) are known. Initial guess π0\pi_0π0.
初始化
- 已知环境模型:奖励分布 p(r∣s,a)p(r|s,a)p(r∣s,a) 和转移分布 p(s′∣s,a)p(s'|s,a)p(s′∣s,a)。
- 初始化一个随机策略 π0\pi_0π0。
- Aim: Search for the optimal state value and an optimal policy.
- While the policy has not converged, for the kkk-th iteration, do
-
Policy evaluation:
- Initialization: select the initial guess as vk(0)=vk−1v_k^{(0)} = v_{k-1}vk(0)=vk−1. The maximum iteration is set to be jtruncatej_{\text{truncate}}jtruncate.
- While j<jtruncatej < j_{\text{truncate}}j<jtruncate, do
-
For every state s∈Ss \in \mathcal{S}s∈S, do
vk(j+1)(s)=∑aπk(a∣s)[∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(j)(s′)]v_k^{(j+1)}(s) = \sum_a \pi_k(a \mid s) \Bigg[ \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v_k^{(j)}(s') \Bigg]vk(j+1)(s)=∑aπk(a∣s)[∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(j)(s′)]
-
Set vk=vk(jtruncate)v_k = v_k^{(j_{\text{truncate}})}vk=vk(jtruncate).
-
Truncated Policy Evaluation
-
初始化 vk(0)=vk−1v_k^{(0)} = v_{k-1}vk(0)=vk−1(即用上一次迭代的结果作为初始猜测)。
-
设置最大迭代次数 jtruncatej_{\text{truncate}}jtruncate。
-
执行有限次迭代:
vk(j+1)(s)=∑aπk(a∣s)[∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)vk(j)(s’)]v_k^{(j+1)}(s) = \sum_a \pi_k(a \mid s)\Bigg[ \sum_r p(r \mid s,a) r + \gamma \sum_{s’} p(s’ \mid s,a) v_k^{(j)}(s’) \Bigg]vk(j+1)(s)=∑aπk(a∣s)[∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)vk(j)(s’)]
-
迭代 jtruncatej_{\text{truncate}}jtruncate 次后,得到近似的 vk≈vπkv_k \approx v_{\pi_k}vk≈vπk。
-
与完全的策略评估不同,这里并没有等 vπk(j)v_{\pi_k}^{(j)}vπk(j) 收敛,而是 提前截断。
-
Policy improvement:
-
For every state s∈Ss \in \mathcal{S}s∈S, do
qk(s,a)=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(s′),q_k(s,a) = \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v_k(s'),qk(s,a)=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(s′),
- ak∗(s)=argmaxaqk(s,a),a^*_k(s) = \arg\max_a q_k(s,a),ak∗(s)=argmaxaqk(s,a),
- πk+1(a∣s)=1\pi_{k+1}(a \mid s) = 1πk+1(a∣s)=1 if a=ak∗(s)a = a^*_k(s)a=ak∗(s), and πk+1(a∣s)=0\pi_{k+1} (a \mid s) = 0πk+1(a∣s)=0 otherwisde
Policy improvement
-
对于每个状态 sss,先计算所有动作的 qqq 值:
qk(s,a)=∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)vk(s’)q_k(s,a) = \sum_r p(r \mid s,a) r + \gamma \sum_{s’} p(s’ \mid s,a) v_k(s’)qk(s,a)=∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)vk(s’)
-
选择最优动作:
ak∗(s)=argmaxaqk(s,a)a_k^*(s) = \arg\max_a q_k(s,a)ak∗(s)=argmaxaqk(s,a)
-
更新策略:
πk+1(a∣s)={1,a=ak(s)0,a≠ak(s)\pi_{k+1}(a \mid s) = \begin{cases} 1, & a = a_k^{(s)} \\ 0, & a \neq a_k^{(s)} \end{cases}πk+1(a∣s)={1,0,a=ak(s)a=ak(s)
-
-
Convergence
Proposition (Value Improvement)
-
Consider the iterative algorithm for solving the policy evaluation step:
vπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,…v_{\pi_k}^{(j+1)} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}^{(j)}, \quad j = 0,1,2,\ldotsvπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,…
-
If the initial guess is selected as vπk(0)=vπk−1v_{\pi_k}^{(0)} = v_{\pi_{k-1}}vπk(0)=vπk−1, it holds that
vπk(j+1)≥vπk(j)v_{\pi_k}^{(j+1)} \geq v_{\pi_k}^{(j)}vπk(j+1)≥vπk(j)
- for every j=0,1,2,…j = 0,1,2,\ldotsj=0,1,2,…
总体理解
- Policy Iteration:每次都精确解 vπkv_{\pi_k}vπk → 每步开销大,但收敛迭代次数少。
- Value Iteration:每次只做一步价值更新 → 每步轻量,但收敛迭代次数多。
- Truncated Policy Iteration:折中方法,每次只做有限步(jtruncatej_{\text{truncate}}jtruncate 次)价值迭代。
- jtruncatej_{\text{truncate}}jtruncate 大 → 更像 Policy Iteration;
- jtruncatej_{\text{truncate}}jtruncate 小 → 更像 Value Iteration。
总结
截断策略迭代(TPI)在策略评估时只做有限次更新,用计算开销和收敛速度做折中,jtruncatej_{\text{truncate}}jtruncate 越大越像 Policy Iteration,越小越像 Value Iteration。