小白讲强化学习:从零开始的4x4网格世界探索
引言:为什么我们需要强化学习?
问题描述
想象一下,我们有一个聪明的机器人小智,它被放置在一个4×4的网格世界中。这个世界的规则如下:
- 起始位置:左上角 (1,1)
- 目标位置:右下角 (4,4)
- 障碍物:位于 (2,2) 和 (3,3),机器人无法通过
- 可用动作:上、下、左、右、不动(5种动作)
- 目标:找到从起点到终点的最优路径
网格世界示意图:
+---+---+---+---+
| S | | | | S = 起点 (1,1)
+---+---+---+---+
| | X | | | X = 障碍物
+---+---+---+---+
| | | X | |
+---+---+---+---+
| | | | G | G = 目标 (4,4)
+---+---+---+---+
为什么传统方法不够用?
- 动态环境:如果环境经常变化(障碍物位置改变),预编程的路径就失效了
- 不确定性:机器人的动作可能不总是成功(比如有10%的概率滑向其他方向)
- 奖励优化:我们希望机器人不仅能到达目标,还要找到最优(最短、最安全)的路径
- 自主学习:我们希望机器人能够自己探索和学习,而不是被告知每一步该怎么做
这就是强化学习发挥作用的地方!强化学习让智能体通过与环境的交互,在试错中学习最优策略。
第一章:状态表示
1.1 什么是状态?
**状态(State)**是对当前环境情况的完整描述。在我们的网格世界中,状态就是机器人当前所在的位置。
1.2 数学表示
我们可以用坐标 ( x , y ) (x, y) (x,y) 来表示状态,其中:
- x x x 表示行号(1-4)
- y y y 表示列号(1-4)
状态空间: S = { ( x , y ) ∣ x , y ∈ { 1 , 2 , 3 , 4 } , ( x , y ) ≠ ( 2 , 2 ) , ( x , y ) ≠ ( 3 , 3 ) } S = \{(x,y) | x,y \in \{1,2,3,4\}, (x,y) \neq (2,2), (x,y) \neq (3,3)\} S={(x,y)∣x,y∈{1,2,3,4},(x,y)=(2,2),(x,y)=(3,3)}
总共有 4 × 4 − 2 = 14 4 \times 4 - 2 = 14 4×4−2=14 个有效状态(减去2个障碍物位置)。
1.3 状态编码
为了便于计算机处理,我们将二维坐标转换为一维索引:
编码公式:
state_id = ( x − 1 ) × 4 + ( y − 1 ) \text{state\_id} = (x-1) \times 4 + (y-1) state_id=(x−1)×4+(y−1)
解码公式:
x = ⌊ state_id / 4 ⌋ + 1 x = \lfloor \text{state\_id} / 4 \rfloor + 1 x=⌊state_id/4⌋+1
y = ( state_id m o d 4 ) + 1 y = (\text{state\_id} \bmod 4) + 1 y=(state_idmod4)+1
1.4 Python实现
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as snsclass GridWorld:def __init__(self):self.grid_size = 4self.start_state = (0, 0) # 转换为0-based索引self.goal_state = (3, 3)self.obstacles = [(1, 1), (2, 2)] # 转换为0-based索引# 动作定义:上、下、左、右、不动self.actions = {0: (-1, 0), # 上1: (1, 0), # 下 2: (0, -1), # 左3: (0, 1), # 右4: (0, 0) # 不动}self.action_names = ['上', '下', '左', '右', '不动']def state_to_coord(self, state_id):"""将状态ID转换为坐标"""x = state_id // self.grid_sizey = state_id % self.grid_sizereturn (x, y)def coord_to_state(self, x, y):"""将坐标转换为状态ID"""return x * self.grid_size + ydef is_valid_state(self, x, y):"""检查状态是否有效"""if x < 0 or x >= self.grid_size or y < 0 or y >= self.grid_size:return Falseif (x, y) in self.obstacles:return Falsereturn Truedef get_valid_states(self):"""获取所有有效状态"""valid_states = []for x in range(self.grid_size):for y in range(self.grid_size):if self.is_valid_state(x, y):valid_states.append(self.coord_to_state(x, y))return valid_statesdef visualize_grid(self):"""可视化网格世界"""fig, ax = plt.subplots(figsize=(8, 8))# 画网格for i in range(self.grid_size + 1):ax.axhline(i, color='black', linewidth=1)ax.axvline(i, color='black', linewidth=1)# 标记起点start_rect = Rectangle(self.start_state, 1, 1, facecolor='green', alpha=0.7)ax.add_patch(start_rect)ax.text(self.start_state[1] + 0.5, self.start_state[0] + 0.5, 'S', ha='center', va='center', fontsize=16, fontweight='bold')# 标记终点goal_rect = Rectangle(self.goal_state, 1, 1, facecolor='red', alpha=0.7)ax.add_patch(goal_rect)ax.text(self.goal_state[1] + 0.5, self.goal_state[0] + 0.5, 'G', ha='center', va='center', fontsize=16, fontweight='bold')# 标记障碍物for obs in self.obstacles:obs_rect = Rectangle(obs, 1, 1, facecolor='black', alpha=0.8)ax.add_patch(obs_rect)ax.text(obs[1] + 0.5, obs[0] + 0.5, 'X', ha='center', va='center', color='white', fontsize=16, fontweight='bold')# 标记状态IDfor x in range(self.grid_size):for y in range(self.grid_size):if self.is_valid_state(x, y):state_id = self.coord_to_state(x, y)ax.text(y + 0.1, x + 0.9, f'({state_id})', fontsize=10, color='blue')ax.set_xlim(0, self.grid_size)ax.set_ylim(0, self.grid_size)ax.set_aspect('equal')ax.invert_yaxis() # 让(0,0)在左上角ax.set_title('4×4网格世界状态表示', fontsize=16)plt.tight_layout()return fig# 创建环境并可视化
env = GridWorld()
print("有效状态列表:", env.get_valid_states())
print("状态总数:", len(env.get_valid_states()))# 测试状态转换
test_coord = (2, 1)
state_id = env.coord_to_state(test_coord[0], test_coord[1])
recovered_coord = env.state_to_coord(state_id)
print(f"原坐标: {test_coord}, 状态ID: {state_id}, 恢复坐标: {recovered_coord}")
第二章:马尔科夫过程
2.1 什么是马尔科夫性质?
马尔科夫性质:未来的状态只依赖于当前状态,而不依赖于过去的历史。
数学表达:
P ( S t + 1 = s ′ ∣ S t = s , S t − 1 = s t − 1 , . . . , S 0 = s 0 ) = P ( S t + 1 = s ′ ∣ S t = s ) P(S_{t+1} = s' | S_t = s, S_{t-1} = s_{t-1}, ..., S_0 = s_0) = P(S_{t+1} = s' | S_t = s) P(St+1=s′∣St=s,St−1=st−1,...,S0=s0)=P(St+1=s′∣St=s)
简单来说:“未来只看现在,不看过去”
2.2 马尔科夫过程 (Markov Process)
马尔科夫过程是一个二元组: M = ( S , P ) \mathcal{M} = (S, P) M=(S,P)
- S S S:状态空间
- P P P:状态转移概率矩阵
状态转移概率:
P s s ′ = P ( S t + 1 = s ′ ∣ S t = s ) P_{ss'} = P(S_{t+1} = s' | S_t = s) Pss′=P(St+1=s′∣St=s)
满足: ∑ s ′ ∈ S P s s ′ = 1 \sum_{s' \in S} P_{ss'} = 1 ∑s′∈SPss′=1(概率和为1)
2.3 我们网格世界中的马尔科夫过程
假设机器人的动作有一定随机性:
- 80%概率按预期方向移动
- 20%概率随机移动到相邻的有效位置
2.4 Python实现
class MarkovProcess:def __init__(self, grid_world):self.env = grid_worldself.states = grid_world.get_valid_states()self.n_states = len(self.states)# 构建状态转移概率矩阵self.transition_matrix = self._build_transition_matrix()def _build_transition_matrix(self):"""构建状态转移概率矩阵"""# 这里我们假设机器人随机选择动作P = np.zeros((self.n_states, self.n_states))for i, state_id in enumerate(self.states):x, y = self.env.state_to_coord(state_id)# 获取所有可能的下一个状态next_states = []for action_id, (dx, dy) in self.env.actions.items():next_x, next_y = x + dx, y + dyif self.env.is_valid_state(next_x, next_y):next_state_id = self.env.coord_to_state(next_x, next_y)next_states.append(next_state_id)else:# 如果移动无效,留在原地next_states.append(state_id)# 均匀分布概率(每个动作概率相等)for next_state_id in next_states:next_state_idx = self.states.index(next_state_id)P[i][next_state_idx] += 1.0 / len(self.env.actions)return Pdef simulate_random_walk(self, start_state_id, steps=100):"""模拟随机游走"""current_state_idx = self.states.index(start_state_id)trajectory = [start_state_id]for _ in range(steps):# 根据转移概率选择下一个状态next_state_idx = np.random.choice(self.n_states, p=self.transition_matrix[current_state_idx])next_state_id = self.states[next_state_idx]trajectory.append(next_state_id)current_state_idx = next_state_idx# 如果到达目标状态,结束模拟if next_state_id == self.env.coord_to_state(*self.env.goal_state):breakreturn trajectorydef visualize_transition_matrix(self):"""可视化状态转移矩阵"""fig, ax = plt.subplots(figsize=(10, 8))sns.heatmap(self.transition_matrix, xticklabels=self.states,yticklabels=self.states,annot=True, fmt='.3f', cmap='Blues', ax=ax)ax.set_title('状态转移概率矩阵')ax.set_xlabel('下一个状态')ax.set_ylabel('当前状态')plt.tight_layout()return fig# 创建马尔科夫过程
mp = MarkovProcess(env)# 模拟随机游走
start_state = env.coord_to_state(*env.start_state)
trajectory = mp.simulate_random_walk(start_state, steps=50)print("随机游走轨迹(状态ID):", trajectory[:10], "...")
print("轨迹长度:", len(trajectory))# 转换为坐标显示
coord_trajectory = [env.state_to_coord(state_id) for state_id in trajectory[:10]]
print("对应坐标轨迹:", coord_trajectory)
第三章:马尔科夫决策过程(MDP)
3.1 从马尔科夫过程到MDP
马尔科夫决策过程在马尔科夫过程的基础上增加了动作和奖励:
M = ( S , A , P , R , γ ) \mathcal{M} = (S, A, P, R, \gamma) M=(S,A,P,R,γ)
- S S S:状态空间
- A A A:动作空间
- P P P:状态转移概率 P ( s ′ ∣ s , a ) P(s'|s,a) P(s′∣s,a)
- R R R:奖励函数 R ( s , a , s ′ ) R(s,a,s') R(s,a,s′)
- γ \gamma γ:折扣因子 γ ∈ [ 0 , 1 ] \gamma \in [0,1] γ∈[0,1]
3.2 关键概念详解
3.2.1 状态转移概率
P s s ′ a = P ( S t + 1 = s ′ ∣ S t = s , A t = a ) P_{ss'}^a = P(S_{t+1} = s' | S_t = s, A_t = a) Pss′a=P(St+1=s′∣St=s,At=a)
3.2.2 奖励函数
我们设计如下奖励结构:
- 到达目标:+100
- 撞墙(动作无效):-10
- 正常移动:-1(鼓励快速到达目标)
- 不动:-5(不鼓励原地不动)
3.2.3 策略 (Policy)
策略 π \pi π 定义了在每个状态下选择动作的规则:
π ( a ∣ s ) = P ( A t = a ∣ S t = s ) \pi(a|s) = P(A_t = a | S_t = s) π(a∣s)=P(At=a∣St=s)
3.2.4 价值函数
状态价值函数:在策略 π \pi π 下,从状态 s s s 开始的期望回报
V π ( s ) = E π [ ∑ t = 0 ∞ γ t R t + 1 ∣ S 0 = s ] V^\pi(s) = \mathbb{E}_\pi[\sum_{t=0}^{\infty} \gamma^t R_{t+1} | S_0 = s] Vπ(s)=Eπ[t=0∑∞γtRt+1∣S0=s]
动作价值函数:在状态 s s s 执行动作 a a a,然后遵循策略 π \pi π 的期望回报
Q π ( s , a ) = E π [ ∑ t = 0 ∞ γ t R t + 1 ∣ S 0 = s , A 0 = a ] Q^\pi(s,a) = \mathbb{E}_\pi[\sum_{t=0}^{\infty} \gamma^t R_{t+1} | S_0 = s, A_0 = a] Qπ(s,a)=Eπ[t=0∑∞γtRt+1∣S0=s,A0=a]
3.3 Python实现
class GridWorldMDP:def __init__(self, grid_world, gamma=0.9):self.env = grid_worldself.states = grid_world.get_valid_states()self.n_states = len(self.states)self.n_actions = len(grid_world.actions)self.gamma = gamma# 构建转移概率和奖励函数self.P, self.R = self._build_mdp()def _build_mdp(self):"""构建MDP的转移概率和奖励函数"""# P[s][a][s'] = 从状态s执行动作a转移到状态s'的概率P = np.zeros((self.n_states, self.n_actions, self.n_states))# R[s][a] = 在状态s执行动作a的即时奖励R = np.zeros((self.n_states, self.n_actions))for i, state_id in enumerate(self.states):x, y = self.env.state_to_coord(state_id)for action_id, (dx, dy) in self.env.actions.items():next_x, next_y = x + dx, y + dy# 检查移动是否有效if self.env.is_valid_state(next_x, next_y):next_state_id = self.env.coord_to_state(next_x, next_y)next_state_idx = self.states.index(next_state_id)# 设置转移概率(确定性环境)P[i][action_id][next_state_idx] = 1.0# 设置奖励if (next_x, next_y) == self.env.goal_state:R[i][action_id] = 100 # 到达目标elif action_id == 4: # 不动R[i][action_id] = -5else:R[i][action_id] = -1 # 正常移动else:# 无效移动,留在原地P[i][action_id][i] = 1.0R[i][action_id] = -10 # 撞墙惩罚return P, Rdef step(self, state_id, action_id):"""执行一步动作"""state_idx = self.states.index(state_id)# 根据转移概率采样下一个状态next_state_idx = np.random.choice(self.n_states, p=self.P[state_idx][action_id])next_state_id = self.states[next_state_idx]# 获取奖励reward = self.R[state_idx][action_id]# 检查是否终止done = (next_state_id == self.env.coord_to_state(*self.env.goal_state))return next_state_id, reward, donedef reset(self):"""重置环境到初始状态"""return self.env.coord_to_state(*self.env.start_state)# 创建MDP
mdp = GridWorldMDP(env)# 测试环境交互
state = mdp.reset()
print(f"初始状态: {state}, 坐标: {env.state_to_coord(state)}")for step in range(5):action = np.random.randint(5) # 随机选择动作next_state, reward, done = mdp.step(state, action)print(f"步骤 {step + 1}:")print(f" 状态: {state} -> {next_state}")print(f" 动作: {action} ({env.action_names[action]})")print(f" 奖励: {reward}")print(f" 终止: {done}")state = next_stateif done:print("到达目标!")break
第四章:贝尔曼方程
4.1 贝尔曼方程的直觉理解
贝尔曼方程描述了价值函数的递归关系:
“一个状态的价值 = 立即奖励 + 未来状态价值的期望”
4.2 贝尔曼期望方程
4.2.1 状态价值函数的贝尔曼方程
V π ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ P s s ′ a [ R s s ′ a + γ V π ( s ′ ) ] V^\pi(s) = \sum_{a} \pi(a|s) \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma V^\pi(s')] Vπ(s)=a∑π(a∣s)s′∑Pss′a[Rss′a+γVπ(s′)]
逐步分解:
- ∑ a π ( a ∣ s ) \sum_{a} \pi(a|s) ∑aπ(a∣s):对所有可能的动作求期望
- ∑ s ′ P s s ′ a \sum_{s'} P_{ss'}^a ∑s′Pss′a:对所有可能的下一状态求期望
- R s s ′ a R_{ss'}^a Rss′a:立即奖励
- γ V π ( s ′ ) \gamma V^\pi(s') γVπ(s′):折扣后的未来价值
4.2.2 动作价值函数的贝尔曼方程
Q π ( s , a ) = ∑ s ′ P s s ′ a [ R s s ′ a + γ ∑ a ′ π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) ] Q^\pi(s,a) = \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')] Qπ(s,a)=s′∑Pss′a[Rss′a+γa′∑π(a′∣s′)Qπ(s′,a′)]
4.2.3 两个价值函数的关系
V π ( s ) = ∑ a π ( a ∣ s ) Q π ( s , a ) V^\pi(s) = \sum_{a} \pi(a|s) Q^\pi(s,a) Vπ(s)=a∑π(a∣s)Qπ(s,a)
Q π ( s , a ) = ∑ s ′ P s s ′ a [ R s s ′ a + γ V π ( s ′ ) ] Q^\pi(s,a) = \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma V^\pi(s')] Qπ(s,a)=s′∑Pss′a[Rss′a+γVπ(s′)]
4.3 贝尔曼最优方程
最优策略: π ∗ = arg max π V π ( s ) , ∀ s \pi^* = \arg\max_\pi V^\pi(s), \forall s π∗=argmaxπVπ(s),∀s
最优价值函数:
V ∗ ( s ) = max a Q ∗ ( s , a ) V^*(s) = \max_a Q^*(s,a) V∗(s)=amaxQ∗(s,a)
Q ∗ ( s , a ) = ∑ s ′ P s s ′ a [ R s s ′ a + γ V ∗ ( s ′ ) ] Q^*(s,a) = \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma V^*(s')] Q∗(s,a)=s′∑Pss′a[Rss′a+γV∗(s′)]
最优贝尔曼方程:
V ∗ ( s ) = max a ∑ s ′ P s s ′ a [ R s s ′ a + γ V ∗ ( s ′ ) ] V^*(s) = \max_a \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma V^*(s')] V∗(s)=amaxs′∑Pss′a[Rss′a+γV∗(s′)]
4.4 Python实现
class BellmanSolver:def __init__(self, mdp, tolerance=1e-6, max_iterations=1000):self.mdp = mdpself.tolerance = toleranceself.max_iterations = max_iterationsdef policy_evaluation(self, policy):"""策略评估:计算给定策略的价值函数"""V = np.zeros(self.mdp.n_states)for iteration in range(self.max_iterations):V_new = np.zeros(self.mdp.n_states)for s in range(self.mdp.n_states):# 贝尔曼期望方程value = 0for a in range(self.mdp.n_actions):action_value = 0for s_next in range(self.mdp.n_states):action_value += self.mdp.P[s][a][s_next] * (self.mdp.R[s][a] + self.mdp.gamma * V[s_next])value += policy[s][a] * action_valueV_new[s] = value# 检查收敛if np.max(np.abs(V_new - V)) < self.tolerance:print(f"策略评估在第 {iteration + 1} 次迭代后收敛")return V_newV = V_newprint(f"策略评估在 {self.max_iterations} 次迭代后未收敛")return Vdef value_iteration(self):"""价值迭代:求解最优价值函数"""V = np.zeros(self.mdp.n_states)policy = np.zeros((self.mdp.n_states, self.mdp.n_actions))iteration_values = [] # 记录每次迭代的价值函数for iteration in range(self.max_iterations):V_new = np.zeros(self.mdp.n_states)for s in range(self.mdp.n_states):# 贝尔曼最优方程action_values = []for a in range(self.mdp.n_actions):action_value = 0for s_next in range(self.mdp.n_states):action_value += self.mdp.P[s][a][s_next] * (self.mdp.R[s][a] + self.mdp.gamma * V[s_next])action_values.append(action_value)V_new[s] = max(action_values)iteration_values.append(V_new.copy())# 检查收敛if np.max(np.abs(V_new - V)) < self.tolerance:print(f"价值迭代在第 {iteration + 1} 次迭代后收敛")breakV = V_new# 提取最优策略for s in range(self.mdp.n_states):action_values = []for a in range(self.mdp.n_actions):action_value = 0for s_next in range(self.mdp.n_states):action_value += self.mdp.P[s][a][s_next] * (self.mdp.R[s][a] + self.mdp.gamma * V[s_next])action_values.append(action_value)# 最优动作best_action = np.argmax(action_values)policy[s][best_action] = 1.0return V, policy, iteration_valuesdef visualize_value_function(self, V, title="价值函数"):"""可视化价值函数"""# 创建一个4x4的矩阵来显示价值value_grid = np.full((4, 4), np.nan)for i, state_id in enumerate(self.mdp.states):x, y = self.mdp.env.state_to_coord(state_id)value_grid[x, y] = V[i]fig, ax = plt.subplots(figsize=(8, 8))# 创建热力图im = ax.imshow(value_grid, cmap='RdYlBu_r', alpha=0.8)# 添加文本标注for x in range(4):for y in range(4):if not np.isnan(value_grid[x, y]):text = ax.text(y, x, f'{value_grid[x, y]:.1f}',ha="center", va="center", color="black", fontweight="bold")# 标记特殊位置start_x, start_y = self.mdp.env.start_stategoal_x, goal_y = self.mdp.env.goal_stateax.add_patch(Rectangle((start_y-0.4, start_x-0.4), 0.8, 0.8, fill=False, edgecolor='green', linewidth=3))ax.add_patch(Rectangle((goal_y-0.4, goal_x-0.4), 0.8, 0.8, fill=False, edgecolor='red', linewidth=3))# 标记障碍物for obs_x, obs_y in self.mdp.env.obstacles:ax.add_patch(Rectangle((obs_y-0.5, obs_x-0.5), 1, 1, facecolor='black', alpha=0.8))ax.set_title(title, fontsize=16)ax.set_xticks(range(4))ax.set_yticks(range(4))plt.colorbar(im)plt.tight_layout()return fig# 创建求解器
solver = BellmanSolver(mdp)# 求解最优价值函数和策略
print("开始价值迭代求解...")
optimal_V, optimal_policy, iteration_history = solver.value_iteration()print("\n最优价值函数:")
for i, state_id in enumerate(mdp.states):coord = env.state_to_coord(state_id)print(f"状态 {state_id} {coord}: V* = {optimal_V[i]:.2f}")print("\n最优策略:")
for i, state_id in enumerate(mdp.states):coord = env.state_to_coord(state_id)best_action = np.argmax(optimal_policy[i])print(f"状态 {state_id} {coord}: 最优动作 = {best_action} ({env.action_names[best_action]})")# 可视化最优价值函数
fig = solver.visualize_value_function(optimal_V, "最优价值函数")
plt.show()
第五章:策略梯度方法
5.1 策略梯度的基本思想
前面我们学习了基于价值的方法(如价值迭代),现在我们学习基于策略的方法。
策略梯度直接优化参数化的策略:
π θ ( a ∣ s ) = P ( A t = a ∣ S t = s , θ ) \pi_\theta(a|s) = P(A_t = a | S_t = s, \theta) πθ(a∣s)=P(At=a∣St=s,θ)
其中 θ \theta θ 是策略的参数。
5.2 目标函数
我们的目标是最大化期望回报:
J ( θ ) = E π θ [ ∑ t = 0 ∞ γ t r t + 1 ] J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_{t+1}] J(θ)=Eπθ[t=0∑∞γtrt+1]
对于有限步骤的情况:
J ( θ ) = E π θ [ ∑ t = 0 T − 1 γ t r t + 1 ] J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{T-1} \gamma^t r_{t+1}] J(θ)=Eπθ[t=0∑T−1γtrt+1]
5.3 策略梯度定理
策略梯度定理告诉我们如何计算目标函数的梯度:
∇ θ J ( θ ) = E π θ [ ∇ θ log π θ ( a ∣ s ) ⋅ Q π θ ( s , a ) ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s,a)] ∇θJ(θ)=Eπθ[∇θlogπθ(a∣s)⋅Qπθ(s,a)]
直觉理解:
- 如果 Q π θ ( s , a ) > 0 Q^{\pi_\theta}(s,a) > 0 Qπθ(s,a)>0(好的动作),增加 π θ ( a ∣ s ) \pi_\theta(a|s) πθ(a∣s) 的概率
- 如果 Q π θ ( s , a ) < 0 Q^{\pi_\theta}(s,a) < 0 Qπθ(s,a)<0(坏的动作),减少 π θ ( a ∣ s ) \pi_\theta(a|s) πθ(a∣s) 的概率
5.4 REINFORCE算法
最简单的策略梯度算法:
- 用当前策略 π θ \pi_\theta πθ 收集一个完整的回合
- 计算每个时间步的回报 G t = ∑ k = 0 T − t − 1 γ k r t + k + 1 G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k+1} Gt=∑k=0T−t−1γkrt+k+1
- 更新参数: θ ← θ + α ∇ θ log π θ ( a t ∣ s t ) G t \theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) G_t θ←θ+α∇θlogπθ(at∣st)Gt
5.5 Python实现
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import deque
import randomclass PolicyNetwork(nn.Module):def __init__(self, state_size, action_size, hidden_size=64):super(PolicyNetwork, self).__init__()self.fc1 = nn.Linear(state_size, hidden_size)self.fc2 = nn.Linear(hidden_size, hidden_size)self.fc3 = nn.Linear(hidden_size, action_size)def forward(self, x):x = F.relu(self.fc1(x))x = F.relu(self.fc2(x))x = F.softmax(self.fc3(x), dim=-1)return xclass REINFORCE:def __init__(self, state_size, action_size, lr=0.01, gamma=0.99):self.state_size = state_sizeself.action_size = action_sizeself.gamma = gamma# 创建策略网络self.policy_net = PolicyNetwork(state_size, action_size)self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)# 存储轨迹self.reset_episode()def reset_episode(self):"""重置回合数据"""self.states = []self.actions = []self.rewards = []self.log_probs = []def state_to_tensor(self, state_id, mdp):"""将状态ID转换为one-hot编码的张量"""state_idx = mdp.states.index(state_id)state_vector = torch.zeros(len(mdp.states))state_vector[state_idx] = 1.0return state_vector.unsqueeze(0)def select_action(self, state_tensor):"""根据当前策略选择动作"""action_probs = self.policy_net(state_tensor)action_dist = torch.distributions.Categorical(action_probs)action = action_dist.sample()log_prob = action_dist.log_prob(action)return action.item(), log_probdef store_transition(self, state, action, reward, log_prob):"""存储转移数据"""self.states.append(state)self.actions.append(action)self.rewards.append(reward)self.log_probs.append(log_prob)def calculate_returns(self):"""计算折扣回报"""returns = []G = 0# 从后往前计算for reward in reversed(self.rewards):G = reward + self.gamma * Greturns.insert(0, G)return returnsdef update_policy(self):"""更新策略网络"""returns = self.calculate_returns()# 标准化回报(减少方差)returns = torch.tensor(returns)returns = (returns - returns.mean()) / (returns.std() + 1e-8)# 计算策略损失policy_loss = []for log_prob, G in zip(self.log_probs, returns):policy_loss.append(-log_prob * G)policy_loss = torch.stack(policy_loss).sum()# 梯度更新self.optimizer.zero_grad()policy_loss.backward()self.optimizer.step()return policy_loss.item()class PolicyGradientTrainer:def __init__(self, mdp, agent):self.mdp = mdpself.agent = agentdef train_episode(self):"""训练一个回合"""state = self.mdp.reset()self.agent.reset_episode()total_reward = 0steps = 0max_steps = 100while steps < max_steps:# 状态编码state_tensor = self.agent.state_to_tensor(state, self.mdp)# 选择动作action, log_prob = self.agent.select_action(state_tensor)# 执行动作next_state, reward, done = self.mdp.step(state, action)# 存储转移self.agent.store_transition(state, action, reward, log_prob)total_reward += rewardstate = next_statesteps += 1if done:break# 更新策略loss = self.agent.update_policy()return total_reward, steps, lossdef train(self, num_episodes=1000):"""训练多个回合"""rewards_history = []steps_history = []loss_history = []for episode in range(num_episodes):total_reward, steps, loss = self.train_episode()rewards_history.append(total_reward)steps_history.append(steps)loss_history.append(loss)if (episode + 1) % 100 == 0:avg_reward = np.mean(rewards_history[-100:])avg_steps = np.mean(steps_history[-100:])print(f"回合 {episode + 1}: 平均奖励 = {avg_reward:.2f}, 平均步数 = {avg_steps:.2f}")return rewards_history, steps_history, loss_historydef test_policy(self, num_episodes=10):"""测试训练好的策略"""test_rewards = []test_paths = []for _ in range(num_episodes):state = self.mdp.reset()path = [state]total_reward = 0steps = 0max_steps = 100while steps < max_steps:state_tensor = self.agent.state_to_tensor(state, self.mdp)# 使用贪心策略(选择概率最大的动作)with torch.no_grad():action_probs = self.agent.policy_net(state_tensor)action = torch.argmax(action_probs).item()next_state, reward, done = self.mdp.step(state, action)path.append(next_state)total_reward += rewardstate = next_statesteps += 1if done:breaktest_rewards.append(total_reward)test_paths.append(path)return test_rewards, test_paths# 创建和训练策略梯度智能体
print("创建策略梯度智能体...")
pg_agent = REINFORCE(state_size=len(mdp.states), action_size=mdp.n_actions, lr=0.01)
trainer = PolicyGradientTrainer(mdp, pg_agent)print("开始训练...")
rewards_history, steps_history, loss_history = trainer.train(num_episodes=500)# 测试训练好的策略
print("\n测试训练好的策略...")
test_rewards, test_paths = trainer.test_policy(num_episodes=5)print("测试结果:")
for i, (reward, path) in enumerate(zip(test_rewards, test_paths)):coord_path = [env.state_to_coord(state) for state in path]print(f"测试 {i+1}: 奖励 = {reward:.1f}, 路径长度 = {len(path)}")print(f" 路径: {coord_path[:10]}{'...' if len(path) > 10 else ''}")# 可视化训练过程
fig, axes = plt.subplots(1, 3, figsize=(15, 5))# 奖励曲线
axes[0].plot(rewards_history)
axes[0].set_title('训练奖励')
axes[0].set_xlabel('回合')
axes[0].set_ylabel('总奖励')# 步数曲线
axes[1].plot(steps_history)
axes[1].set_title('回合步数')
axes[1].set_xlabel('回合')
axes[1].set_ylabel('步数')# 损失曲线
axes[2].plot(loss_history)
axes[2].set_title('策略损失')
axes[2].set_xlabel('回合')
axes[2].set_ylabel('损失')plt.tight_layout()
plt.show()
第七章:近端策略优化 (PPO)
7.1 PPO的背景和动机
PPO是目前最流行的深度强化学习算法之一。它解决了传统策略梯度方法的几个问题:
- 样本效率低:每次更新只能使用一次数据
- 训练不稳定:策略更新步长难以控制
- 容易陷入局部最优:贪心更新可能破坏策略
7.2 PPO的核心思想
PPO通过限制策略更新的幅度来确保训练稳定性。
7.2.1 重要性采样 (Importance Sampling)
当我们用旧策略 π θ o l d \pi_{\theta_{old}} πθold 收集的数据来更新新策略 π θ \pi_\theta πθ 时,需要使用重要性采样:
E s , a ∼ π θ o l d [ π θ ( a ∣ s ) π θ o l d ( a ∣ s ) A π θ o l d ( s , a ) ] \mathbb{E}_{s,a \sim \pi_{\theta_{old}}}[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{\theta_{old}}}(s,a)] Es,a∼πθold[πθold(a∣s)πθ(a∣s)Aπθold(s,a)]
其中比率 r t ( θ ) = π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} rt(θ)=πθold(at∣st)πθ(at∣st) 表示新旧策略的概率比。
7.2.2 优势函数 (Advantage Function)
优势函数衡量某个动作相对于平均水平的好坏:
A π ( s , a ) = Q π ( s , a ) − V π ( s ) A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s) Aπ(s,a)=Qπ(s,a)−Vπ(s)
直觉理解:
- A ( s , a ) > 0 A(s,a) > 0 A(s,a)>0:这个动作比平均水平好
- A ( s , a ) < 0 A(s,a) < 0 A(s,a)<0:这个动作比平均水平差
- A ( s , a ) = 0 A(s,a) = 0 A(s,a)=0:这个动作就是平均水平
7.2.3 GAE (Generalized Advantage Estimation)
实际中,我们使用GAE来估计优势函数:
A t = δ t + ( γ λ ) δ t + 1 + ( γ λ ) 2 δ t + 2 + . . . A_t = \delta_t + (\gamma \lambda) \delta_{t+1} + (\gamma \lambda)^2 \delta_{t+2} + ... At=δt+(γλ)δt+1+(γλ)2δt+2+...
其中 δ t = r t + γ V ( s t + 1 ) − V ( s t ) \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) δt=rt+γV(st+1)−V(st) 是TD误差。
7.3 PPO的损失函数
7.3.1 剪切目标函数
PPO的核心是剪切概率比率:
L C L I P ( θ ) = E t [ min ( r t ( θ ) A t , clip ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A t ) ] L^{CLIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t)] LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1−ϵ,1+ϵ)At)]
其中:
- r t ( θ ) = π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} rt(θ)=πθold(at∣st)πθ(at∣st)
- ϵ \epsilon ϵ 是剪切参数(通常为0.1或0.2)
- clip ( x , a , b ) \text{clip}(x, a, b) clip(x,a,b) 将x限制在[a,b]范围内
7.3.2 价值函数损失
L V F ( θ ) = E t [ ( V θ ( s t ) − V t t a r g e t ) 2 ] L^{VF}(\theta) = \mathbb{E}_t[(V_\theta(s_t) - V_t^{target})^2] LVF(θ)=Et[(Vθ(st)−Vttarget)2]
7.3.3 熵奖励
为了鼓励探索,加入熵奖励:
L E N T ( θ ) = E t [ H ( π θ ( ⋅ ∣ s t ) ) ] L^{ENT}(\theta) = \mathbb{E}_t[H(\pi_\theta(\cdot|s_t))] LENT(θ)=Et[H(πθ(⋅∣st))]
7.3.4 总损失函数
L ( θ ) = L C L I P ( θ ) + c 1 L V F ( θ ) + c 2 L E N T ( θ ) L(\theta) = L^{CLIP}(\theta) + c_1 L^{VF}(\theta) + c_2 L^{ENT}(\theta) L(θ)=LCLIP(θ)+c1LVF(θ)+c2LENT(θ)
7.4 PPO算法流程
- 收集轨迹:用当前策略收集N步数据
- 计算优势:使用GAE计算优势函数
- 多次更新:用同一批数据更新K次策略
- 重复:回到步骤1
7.5 Python实现
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categoricalclass ActorCritic(nn.Module):def __init__(self, state_size, action_size, hidden_size=64):super(ActorCritic, self).__init__()# 共享的特征提取层self.fc1 = nn.Linear(state_size, hidden_size)self.fc2 = nn.Linear(hidden_size, hidden_size)# Actor网络(策略)self.actor = nn.Linear(hidden_size, action_size)# Critic网络(价值函数)self.critic = nn.Linear(hidden_size, 1)def forward(self, x):x = F.relu(self.fc1(x))x = F.relu(self.fc2(x))# 策略分布action_logits = self.actor(x)action_probs = F.softmax(action_logits, dim=-1)# 状态价值state_value = self.critic(x)return action_probs, state_valueclass PPOAgent:def __init__(self, state_size, action_size, lr=3e-4, gamma=0.99, lambda_gae=0.95, epsilon_clip=0.2, c1=0.5, c2=0.01):self.state_size = state_sizeself.action_size = action_sizeself.gamma = gammaself.lambda_gae = lambda_gaeself.epsilon_clip = epsilon_clipself.c1 = c1 # 价值函数损失系数self.c2 = c2 # 熵奖励系数# 创建Actor-Critic网络self.network = ActorCritic(state_size, action_size)self.optimizer = optim.Adam(self.network.parameters(), lr=lr)# 存储轨迹数据self.reset_buffers()def reset_buffers(self):"""重置缓冲区"""self.states = []self.actions = []self.rewards = []self.log_probs = []self.values = []self.dones = []def state_to_tensor(self, state_id, mdp):"""将状态转换为张量"""state_idx = mdp.states.index(state_id)state_vector = torch.zeros(len(mdp.states))state_vector[state_idx] = 1.0return state_vector.unsqueeze(0)def select_action(self, state_tensor):"""选择动作并返回相关信息"""with torch.no_grad():action_probs, state_value = self.network(state_tensor)# 创建动作分布并采样action_dist = Categorical(action_probs)action = action_dist.sample()log_prob = action_dist.log_prob(action)return action.item(), log_prob.item(), state_value.item()def store_transition(self, state, action, reward, log_prob, value, done):"""存储转移数据"""self.states.append(state)self.actions.append(action)self.rewards.append(reward)self.log_probs.append(log_prob)self.values.append(value)self.dones.append(done)def compute_gae(self, next_value=0):"""计算GAE优势函数"""advantages = []gae = 0# 添加最后一个价值values = self.values + [next_value]# 从后往前计算GAEfor t in reversed(range(len(self.rewards))):if t == len(self.rewards) - 1:next_non_terminal = 1.0 - self.dones[t]next_value = values[t + 1]else:next_non_terminal = 1.0 - self.dones[t]next_value = values[t + 1]# TD误差delta = self.rewards[t] + self.gamma * next_value * next_non_terminal - values[t]# GAE计算gae = delta + self.gamma * self.lambda_gae * next_non_terminal * gaeadvantages.insert(0, gae)return advantagesdef update_policy(self, epochs=4):"""更新策略网络"""# 计算优势函数和回报advantages = self.compute_gae()advantages = torch.tensor(advantages, dtype=torch.float32)# 计算回报(优势 + 价值)returns = advantages + torch.tensor(self.values, dtype=torch.float32)# 标准化优势advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)# 转换数据为张量old_log_probs = torch.tensor(self.log_probs, dtype=torch.float32)actions = torch.tensor(self.actions, dtype=torch.long)# 将状态列表转换为张量states_tensor = torch.stack([s for s in self.states])# 多次更新策略total_policy_loss = 0total_value_loss = 0total_entropy = 0for epoch in range(epochs):# 前向传播action_probs, state_values = self.network(states_tensor)# 计算当前策略的log概率action_dist = Categorical(action_probs)current_log_probs = action_dist.log_prob(actions)entropy = action_dist.entropy().mean()# 计算概率比率ratio = torch.exp(current_log_probs - old_log_probs)# PPO剪切目标surr1 = ratio * advantagessurr2 = torch.clamp(ratio, 1 - self.epsilon_clip, 1 + self.epsilon_clip) * advantagespolicy_loss = -torch.min(surr1, surr2).mean()# 价值函数损失value_loss = F.mse_loss(state_values.squeeze(), returns)# 总损失total_loss = policy_loss + self.c1 * value_loss - self.c2 * entropy# 反向传播self.optimizer.zero_grad()total_loss.backward()torch.nn.utils.clip_grad_norm_(self.network.parameters(), 0.5)self.optimizer.step()total_policy_loss += policy_loss.item()total_value_loss += value_loss.item()total_entropy += entropy.item()# 清空缓冲区self.reset_buffers()return (total_policy_loss / epochs, total_value_loss / epochs, total_entropy / epochs)class PPOTrainer:def __init__(self, mdp, agent):self.mdp = mdpself.agent = agentdef train_episode(self, max_steps=100):"""训练一个回合"""state = self.mdp.reset()total_reward = 0steps = 0while steps < max_steps:# 转换状态state_tensor = self.agent.state_to_tensor(state, self.mdp)# 选择动作action, log_prob, value = self.agent.select_action(state_tensor)# 执行动作next_state, reward, done = self.mdp.step(state, action)# 存储转移self.agent.store_transition(state_tensor, action, reward, log_prob, value, done)total_reward += rewardstate = next_statesteps += 1if done:breakreturn total_reward, stepsdef train(self, num_episodes=1000, update_frequency=20):"""训练多个回合"""rewards_history = []steps_history = []policy_loss_history = []value_loss_history = []entropy_history = []episode_rewards = []episode_steps = []for episode in range(num_episodes):total_reward, steps = self.train_episode()episode_rewards.append(total_reward)episode_steps.append(steps)# 每update_frequency回合更新一次策略if (episode + 1) % update_frequency == 0:policy_loss, value_loss, entropy = self.agent.update_policy()# 记录平均表现avg_reward = np.mean(episode_rewards)avg_steps = np.mean(episode_steps)rewards_history.append(avg_reward)steps_history.append(avg_steps)policy_loss_history.append(policy_loss)value_loss_history.append(value_loss)entropy_history.append(entropy)if len(rewards_history) % 10 == 0:print(f"更新 {len(rewards_history)}: 平均奖励 = {avg_reward:.2f}, "f"平均步数 = {avg_steps:.2f}")# 重置临时记录episode_rewards = []episode_steps = []return (rewards_history, steps_history, policy_loss_history, value_loss_history, entropy_history)def test_policy(self, num_episodes=10):"""测试学习到的策略"""test_rewards = []test_paths = []for _ in range(num_episodes):state = self.mdp.reset()path = [state]total_reward = 0steps = 0max_steps = 100while steps < max_steps:state_tensor = self.agent.state_to_tensor(state, self.mdp)# 使用贪心策略with torch.no_grad():action_probs, _ = self.agent.network(state_tensor)action = torch.argmax(action_probs).item()next_state, reward, done = self.mdp.step(state, action)path.append(next_state)total_reward += rewardstate = next_statesteps += 1if done:breaktest_rewards.append(total_reward)test_paths.append(path)return test_rewards, test_paths# 示例使用PPO
print("创建PPO智能体...")
ppo_agent = PPOAgent(state_size=len(mdp.states), action_size=mdp.n_actions,lr=3e-4,gamma=0.99,lambda_gae=0.95,epsilon_clip=0.2
)ppo_trainer = PPOTrainer(mdp, ppo_agent)print("开始PPO训练...")
ppo_results = ppo_trainer.train(num_episodes=1000, update_frequency=20)
ppo_rewards, ppo_steps, ppo_policy_loss, ppo_value_loss, ppo_entropy = ppo_results# 测试PPO策略
print("\n测试PPO策略...")
ppo_test_rewards, ppo_test_paths = ppo_trainer.test_policy(num_episodes=5)print("PPO测试结果:")
for i, (reward, path) in enumerate(zip(ppo_test_rewards, ppo_test_paths)):coord_path = [env.state_to_coord(state) for state in path]print(f"测试 {i+1}: 奖励 = {reward:.1f}, 路径长度 = {len(path)}")print(f" 路径: {coord_path}")
总结与回顾
强化学习的核心概念
- 状态表示:如何描述环境的当前情况
- 马尔科夫过程:未来只依赖于现在,不依赖于过去
- 贝尔曼方程:价值函数的递归关系
- 策略梯度:直接优化策略参数
- Q-Learning:学习动作价值函数
- PPO:稳定高效的策略优化
算法特点对比
算法 | 类型 | 优点 | 缺点 | 适用场景 |
---|---|---|---|---|
价值迭代 | 基于模型 | 理论保证,收敛性好 | 需要环境模型 | 小规模、已知环境 |
策略梯度 | 基于策略 | 直接优化策略 | 方差大,样本效率低 | 连续动作空间 |
Q-Learning | 基于价值 | 无需环境模型 | 需要探索策略 | 离散动作空间 |
PPO | Actor-Critic | 稳定、高效 | 实现复杂 | 现代深度强化学习 |
学习建议
- 从简单开始:先理解基本概念,再学习复杂算法
- 动手实践:编程实现有助于深入理解
- 可视化分析:观察学习过程和结果
- 参数调优:学会调整超参数
- 扩展应用:尝试在其他问题上应用
进一步学习方向
- 深度强化学习:DQN、A3C、SAC等
- 多智能体强化学习:竞争与合作
- 层次强化学习:分解复杂任务
- 元学习:学会如何学习
- 实际应用:游戏、机器人、推荐系统等
附录:完整代码整合
# 完整的强化学习实验代码
def run_complete_experiment():"""运行完整的强化学习实验"""# 创建环境env = GridWorld()mdp = GridWorldMDP(env)print("4×4网格世界强化学习实验")print("=" * 50)# 1. 价值迭代print("\n1. 运行价值迭代...")solver = BellmanSolver(mdp)optimal_V, optimal_policy, _ = solver.value_iteration()# 2. 策略梯度print("\n2. 训练策略梯度...")pg_agent = REINFORCE(len(mdp.states), mdp.n_actions)pg_trainer = PolicyGradientTrainer(mdp, pg_agent)pg_rewards, _, _ = pg_trainer.train(num_episodes=500)# 3. Q-Learningprint("\n3. 训练Q-Learning...")q_agent = QLearningAgent(len(mdp.states), mdp.n_actions)q_trainer = QLearningTrainer(mdp, q_agent)q_rewards, _, _, _ = q_trainer.train(num_episodes=1000)# 4. PPOprint("\n4. 训练PPO...")ppo_agent = PPOAgent(len(mdp.states), mdp.n_actions)ppo_trainer = PPOTrainer(mdp, ppo_agent)ppo_rewards, _, _, _, _ = ppo_trainer.train(num_episodes=1000)print("\n实验完成!所有算法都已训练完毕。")return {'env': env,'mdp': mdp,'value_iteration': (optimal_V, optimal_policy),'policy_gradient': (pg_agent, pg_rewards),'q_learning': (q_agent, q_rewards),'ppo': (ppo_agent, ppo_rewards)}# 运行完整实验
if __name__ == "__main__":results = run_complete_experiment()print("强化学习教程学习完成!")
恭喜你完成了强化学习的完整学习旅程! 🎉
通过这个教程,你已经掌握了强化学习的核心概念和主要算法。现在你可以:
- 理解强化学习的基本原理
- 实现经典的强化学习算法
- 分析和比较不同算法的性能
- 将这些知识应用到新的问题中
继续探索和实践,强化学习的世界等待着你去发现更多精彩的内容!
第六章:Q-Learning
6.1 Q-Learning的基本思想
Q-Learning是一种无模型的强化学习算法,它直接学习动作价值函数 Q ( s , a ) Q(s,a) Q(s,a),而不需要知道环境的转移概率。
6.2 Q-Learning的数学原理
Q-Learning基于以下更新规则:
Q ( s , a ) ← Q ( s , a ) + α [ r + γ max a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)] Q(s,a)←Q(s,a)+α[r+γa′maxQ(s′,a′)−Q(s,a)]
参数解释:
- α \alpha α:学习率 (0 < α ≤ 1)
- γ \gamma γ:折扣因子 (0 ≤ γ ≤ 1)
- r r r:即时奖励
- s ′ s' s′:下一个状态
- max a ′ Q ( s ′ , a ′ ) \max_{a'} Q(s',a') maxa′Q(s′,a′):下一个状态的最大Q值
更新规则的直觉:
- r + γ max a ′ Q ( s ′ , a ′ ) r + \gamma \max_{a'} Q(s',a') r+γmaxa′Q(s′,a′):目标值(我们希望Q(s,a)接近的值)
- Q ( s , a ) Q(s,a) Q(s,a):当前值
- r + γ max a ′ Q ( s ′ , a ′ ) − Q ( s , a ) r + \gamma \max_{a'} Q(s',a') - Q(s,a) r+γmaxa′Q(s′,a′)−Q(s,a):TD误差
- 用TD误差来调整当前的Q值
6.3 ε-贪心策略
为了平衡探索和利用,我们使用ε-贪心策略:
π ( a ∣ s ) = { 1 − ϵ + ϵ ∣ A ∣ if a = arg max a ′ Q ( s , a ′ ) ϵ ∣ A ∣ otherwise \pi(a|s) = \begin{cases} 1-\epsilon + \frac{\epsilon}{|A|} & \text{if } a = \arg\max_{a'} Q(s,a') \\ \frac{\epsilon}{|A|} & \text{otherwise} \end{cases} π(a∣s)={1−ϵ+∣A∣ϵ∣A∣ϵif a=argmaxa′Q(s,a′)otherwise
简化版本:
- 以概率 1 − ϵ 1-\epsilon 1−ϵ 选择最优动作(利用)
- 以概率 ϵ \epsilon ϵ 随机选择动作(探索)
6.4 Q-Learning算法流程
- 初始化:Q表为0
- 对于每个回合:
- 初始化状态 s s s
- 对于回合中的每一步:
- 用ε-贪心策略选择动作 a a a
- 执行动作,观察奖励 r r r 和下一状态 s ′ s' s′
- 更新Q值: Q ( s , a ) ← Q ( s , a ) + α [ r + γ max a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)] Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)−Q(s,a)]
- s ← s ′ s \leftarrow s' s←s′
6.5 Python实现
class QLearningAgent:def __init__(self, state_size, action_size, lr=0.1, gamma=0.99, epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01):self.state_size = state_sizeself.action_size = action_sizeself.lr = lrself.gamma = gammaself.epsilon = epsilonself.epsilon_decay = epsilon_decayself.epsilon_min = epsilon_min# 初始化Q表self.q_table = np.zeros((state_size, action_size))def select_action(self, state_idx):"""ε-贪心策略选择动作"""if np.random.random() < self.epsilon:# 探索:随机选择动作return np.random.randint(self.action_size)else:# 利用:选择Q值最大的动作return np.argmax(self.q_table[state_idx])def update_q_table(self, state_idx, action, reward, next_state_idx, done):"""更新Q表"""current_q = self.q_table[state_idx][action]if done:# 终止状态,没有未来奖励target_q = rewardelse:# 使用贝尔曼方程计算目标Q值target_q = reward + self.gamma * np.max(self.q_table[next_state_idx])# Q-Learning更新规则td_error = target_q - current_qself.q_table[state_idx][action] += self.lr * td_errorreturn abs(td_error)def decay_epsilon(self):"""衰减探索率"""if self.epsilon > self.epsilon_min:self.epsilon *= self.epsilon_decaydef get_greedy_action(self, state_idx):"""获取贪心动作(用于测试)"""return np.argmax(self.q_table[state_idx])class QLearningTrainer:def __init__(self, mdp, agent):self.mdp = mdpself.agent = agentdef train_episode(self):"""训练一个回合"""state = self.mdp.reset()state_idx = self.mdp.states.index(state)total_reward = 0steps = 0max_steps = 100td_errors = []while steps < max_steps:# 选择动作action = self.agent.select_action(state_idx)# 执行动作next_state, reward, done = self.mdp.step(state, action)next_state_idx = self.mdp.states.index(next_state)# 更新Q表td_error = self.agent.update_q_table(state_idx, action, reward, next_state_idx, done)td_errors.append(td_error)total_reward += rewardstate = next_statestate_idx = next_state_idxsteps += 1if done:break# 衰减探索率self.agent.decay_epsilon()return total_reward, steps, np.mean(td_errors)def train(self, num_episodes=1000):"""训练多个回合"""rewards_history = []steps_history = []td_error_history = []epsilon_history = []for episode in range(num_episodes):total_reward, steps, avg_td_error = self.train_episode()rewards_history.append(total_reward)steps_history.append(steps)td_error_history.append(avg_td_error)epsilon_history.append(self.agent.epsilon)if (episode + 1) % 100 == 0:avg_reward = np.mean(rewards_history[-100:])avg_steps = np.mean(steps_history[-100:])print(f"回合 {episode + 1}: 平均奖励 = {avg_reward:.2f}, "f"平均步数 = {avg_steps:.2f}, ε = {self.agent.epsilon:.3f}")return rewards_history, steps_history, td_error_history, epsilon_historydef test_policy(self, num_episodes=10):"""测试学习到的策略"""test_rewards = []test_paths = []for _ in range(num_episodes):state = self.mdp.reset()state_idx = self.mdp.states.index(state)path = [state]total_reward = 0steps = 0max_steps = 100while steps < max_steps:# 使用贪心策略action = self.agent.get_greedy_action(state_idx)next_state, reward, done = self.mdp.step(state, action)path.append(next_state)total_reward += rewardstate = next_statestate_idx = self.mdp.states.index(state)steps += 1if done:breaktest_rewards.append(total_reward)test_paths.append(path)return test_rewards, test_pathsdef visualize_q_table(self):"""可视化Q表"""fig, axes = plt.subplots(1, 5, figsize=(20, 4))action_names = ['上', '下', '左', '右', '不动']for action in range(5):# 创建4x4网格来显示Q值q_grid = np.full((4, 4), np.nan)for i, state_id in enumerate(self.mdp.states):x, y = self.mdp.env.state_to_coord(state_id)q_grid[x, y] = self.agent.q_table[i][action]im = axes[action].imshow(q_grid, cmap='RdYlBu_r')axes[action].set_title(f'Q值 - 动作: {action_names[action]}')# 添加数值标注for x in range(4):for y in range(4):if not np.isnan(q_grid[x, y]):axes[action].text(y, x, f'{q_grid[x, y]:.1f}',ha="center", va="center", color="black", fontsize=8)# 标记障碍物for obs_x, obs_y in self.mdp.env.obstacles:axes[action].add_patch(Rectangle((obs_y-0.5, obs_x-0.5), 1, 1, facecolor='black', alpha=0.8))axes[action].set_xticks(range(4))axes[action].set_yticks(range(4))plt.colorbar(im, ax=axes[action])plt.tight_layout()return fig# 创建和训练Q-Learning智能体
print("创建Q-Learning智能体...")
q_agent = QLearningAgent(state_size=len(mdp.states), action_size=mdp.n_actions,lr=0.1, gamma=0.9, epsilon=1.0,epsilon_decay=0.995,epsilon_min=0.01
)q_trainer = QLearningTrainer(mdp, q_agent)print("开始Q-Learning训练...")
q_rewards, q_steps, q_td_errors, q_epsilons = q_trainer.train(num_episodes=1000)# 测试学习到的策略
print("\n测试Q-Learning策略...")
q_test_rewards, q_test_paths = q_trainer.test_policy(num_episodes=5)print("Q-Learning测试结果:")
for i, (reward, path) in enumerate(zip(q_test_rewards, q_test_paths)):coord_path = [env.state_to_coord(state) for state in path]print(f"测试 {i+1}: 奖励 = {reward:.1f}, 路径长度 = {len(path)}")print(f" 路径: {coord_path}")# 可视化Q表
fig = q_trainer.visualize_q_table()
plt.show()# 可视化训练过程
fig, axes = plt.subplots(2, 2, figsize=(12, 10))axes[0,0].plot(q_rewards)
axes[0,0].set_title('Q-Learning 训练奖励')
axes[0,0].set_xlabel('回合')
axes[0,0].set_ylabel('总奖励')axes[0,1].plot(q_steps)
axes[0,1].set_title('Q-Learning 回合步数')
axes[0,1].set_xlabel('回合')
axes[0,1].set_ylabel('步数')axes[1,0].plot(q_td_errors)
axes[1,0].set_title('Q-Learning TD误差')
axes[1,0].set_xlabel('回合')
axes[1,0].set_ylabel('平均TD误差')axes[1,1].plot(q_epsilons)
axes[1,1].set_title('Q-Learning 探索率衰减')
axes[1,1].set_xlabel('回合')
axes[1,1].set_ylabel('ε值')plt.tight_layout()
plt.show()
第七章:近端策略优化 (PPO)
7.1 PPO的背景和动机
PPO是目前最流行的深度强化学习算法之一。它解决了传统策略梯度方法的几个问题:
- 样本效率低:每次更新只能使用一次数据
- 训练不稳定:策略更新步长难以控制
- 容易陷入局部最优:贪心更新可能破坏策略
7.2 PPO的核心思想
PPO通过限制策略更新的幅度来确保训练稳定性。
7.2.1 重要性采样 (Importance Sampling)
当我们用旧策略 π θ o l d \pi_{\theta_{old}} πθold 收集的数据来更新新策略 π θ \pi_\theta πθ 时,需要使用重要性采样:
E s , a ∼ π θ o l d [ π θ ( a ∣ s ) π θ o l d ( a ∣ s ) A π θ o l d ( s , a ) ] \mathbb{E}_{s,a \sim \pi_{\theta_{old}}}[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{\theta_{old}}}(s,a)] Es,a∼πθold[πθold(a∣s)πθ(a∣s)Aπθold(s,a)]
其中比率 r t ( θ ) = π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} rt(θ)=πθold(at∣st)πθ(at∣st) 表示新旧策略的概率比。
7.2.2 优势函数 (Advantage Function)
优势函数衡量某个动作相对于平均水平的好坏:
A π ( s , a ) = Q π ( s , a ) − V π ( s ) A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s) Aπ(s,a)=Qπ(s,a)−Vπ(s)
直觉理解:
- A ( s , a ) > 0 A(s,a) > 0 A(s,a)>0:这个动作比平均水平好
- A ( s , a ) < 0 A(s,a) < 0 A(s,a)<0:这个动作比平均水平差
- A ( s , a ) = 0 A(s,a) = 0 A(s,a)=0:这个动作就是平均水平
7.2.3 GAE (Generalized Advantage Estimation)
实际中,我们使用GAE来估计优势函数:
A t = δ t + ( γ λ ) δ t + 1 + ( γ λ ) 2 δ t + 2 + . . . A_t = \delta_t + (\gamma \lambda) \delta_{t+1} + (\gamma \lambda)^2 \delta_{t+2} + ... At=δt+(γλ)δt+1+(γλ)2δt+2+...
其中 δ t = r t + γ V ( s t + 1 ) − V ( s t ) \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) δt=rt+γV(st+1)−V(st) 是TD误差。
7.3 PPO的损失函数
7.3.1 剪切目标函数
PPO的核心是剪切概率比率:
L C L I P ( θ ) = E t [ min ( r t ( θ ) A t , clip ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A t ) ] L^{CLIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t)] LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1−ϵ,1+ϵ)At)]
其中:
- r t ( θ ) = π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} rt(θ)=πθold(at∣st)πθ(at∣st)
- ϵ \epsilon ϵ 是剪切参数(通常为0.1或0.2)
- clip ( x , a , b ) \text{clip}(x, a, b) clip(x,a,b) 将x限制在[a,b]范围内
7.3.2 价值函数损失
L V F ( θ ) = E t [ ( V θ ( s t ) − V t t a r g e t ) 2 ] L^{VF}(\theta) = \mathbb{E}_t[(V_\theta(s_t) - V_t^{target})^2] LVF(θ)=Et[(Vθ(st)−Vttarget)2]
7.3.3 熵奖励
为了鼓励探索,加入熵奖励:
L E N T ( θ ) = E t [ H ( π θ ( ⋅ ∣ s t ) ) ] L^{ENT}(\theta) = \mathbb{E}_t[H(\pi_\theta(\cdot|s_t))] LENT(θ)=Et[H(πθ(⋅∣st))]
7.3.4 总损失函数
L ( θ ) = L C L I P ( θ ) + c 1 L V F ( θ ) + c 2 L E N T ( θ ) L(\theta) = L^{CLIP}(\theta) + c_1 L^{VF}(\theta) + c_2 L^{ENT}(\theta) L(θ)=LCLIP(θ)+c1LVF(θ)+c2LENT(θ)
7.4 PPO算法流程
- 收集轨迹:用当前策略收集N步数据
- 计算优势:使用GAE计算优势函数
- 多次更新:用同一批数据更新K次策略
- 重复:回到步骤1
7.5 Python实现
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categoricalclass ActorCritic(nn.Module):def __init__(self, state_size, action_size, hidden_size=64):super(ActorCritic, self).__init__()# 共享的特征提取层self.fc1 = nn.Linear(state_size, hidden_size)self.fc2 = nn.Linear(hidden_size, hidden_size)# Actor网络(策略)self.actor = nn.Linear(hidden_size, action_size)# Critic网络(价值函数)self.critic = nn.Linear(hidden_size, 1)def forward(self, x):x = F.relu(self.fc1(x))x = F.relu(self.fc2(x))# 策略分布action_logits = self.actor(x)action_probs = F.softmax(action_logits, dim=-1)# 状态价值state_value = self.critic(x)return action_probs, state_valueclass PPOAgent:def __init__(self, state_size, action_size, lr=3e-4, gamma=0.99, lambda_gae=0.95, epsilon_clip=0.2, c1=0.5, c2=0.01):self.state_size = state_sizeself.action_size = action_sizeself.gamma = gammaself.lambda_gae = lambda_gaeself.epsilon_clip = epsilon_clipself.c1 = c1 # 价值函数损失系数self.c2 = c2 # 熵奖励系数# 创建Actor-Critic网络self.network = ActorCritic(state_size, action_size)self.optimizer = optim.Adam(self.network.parameters(), lr=lr)# 存储轨迹数据self.reset_buffers()def reset_buffers(self):"""重置缓冲区"""self.states = []self.actions = []self.rewards = []self.log_probs = []self.values = []self.dones = []def state_to_tensor(self, state_id, mdp):"""将状态转换为张量"""state_idx = mdp.states.index(state_id)state_vector = torch.zeros(len(mdp.states))state_vector[state_idx] = 1.0return state_vector.unsqueeze(0)def select_action(self, state_tensor):"""选择动作并返回相关信息"""with torch.no_grad():action_probs, state_value = self.network(state_tensor)# 创建动作分布并采样action_dist = Categorical(action_probs)action = action_dist.sample()log_prob = action_dist.log_prob(action)return action.item(), log_prob.item(), state_value.item()def store_transition(self, state, action, reward, log_prob, value, done):"""存储转移数据"""self.states.append(state)self.actions.append(action)self.rewards.append(reward)self.log_probs.append(log_prob)self.values.append(value)self.dones.append(done)def compute_gae(self, next_value=0):"""计算GAE优势函数"""advantages = []gae = 0# 添加最后一个价值values = self.values + [next_value]# 从后往前计算GAEfor t in reversed(range(len(self.rewards))):if t == len(self.rewards) - 1:next_non_terminal = 1.0 - self.dones[t]next_value = values[t + 1]else:next_non_terminal = 1.0 - self.dones[t]next_value = values[t + 1]# TD误差delta = self.rewards[t] + self.gamma * next_value * next_non_terminal - values[t]# GAE计算gae = delta + self.gamma * self.lambda_gae * next_non_terminal * gaeadvantages.insert(0, gae)return advantagesdef update_policy(self, epochs=4):"""更新策略网络"""# 计算优势函数和回报advantages = self.compute_gae()advantages = torch.tensor(advantages, dtype=torch.float32)# 计算回报(优势 + 价值)returns = advantages + torch.tensor(self.values, dtype=torch.float32)# 标准化优势advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)# 转换数据为张量old_log_probs = torch.tensor(self.log_probs, dtype=torch.float32)actions = torch.tensor(self.actions, dtype=torch.long)# 将状态列表转换为张量states_tensor = torch.stack([s for s in self.states])# 多次更新策略total_policy_loss = 0total_value_loss = 0total_entropy = 0for epoch in range(epochs):# 前向传播action_probs, state_values = self.network(states_tensor)# 计算当前策略的log概率action_dist = Categorical(action_probs)current_log_probs = action_dist.log_prob(actions)entropy = action_dist.entropy().mean()# 计算概率比率ratio = torch.exp(current_log_probs - old_log_probs)# PPO剪切目标surr1 = ratio * advantagessurr2 = torch.clamp(ratio, 1 - self.epsilon_clip, 1 + self.epsilon_clip) * advantagespolicy_loss = -torch.min(surr1, surr2).mean()# 价值函数损失value_loss = F.mse_loss(state_values.squeeze(), returns)# 总损失total_loss = policy_loss + self.c1 * value_loss - self.c2 * entropy# 反向传播self.optimizer.zero_grad()total_loss.backward()torch.nn.utils.clip_grad_norm_(self.network.parameters(), 0.5)self.optimizer.step()total_policy_loss += policy_loss.item()total_value_loss += value_loss.item()total_entropy += entropy.item()# 清空缓冲区self.reset_buffers()return (total_policy_loss / epochs, total_value_loss / epochs, total_entropy / epochs)class PPOTrainer:def __init__(self, mdp, agent):self.mdp = mdpself.agent = agentdef train_episode(self, max_steps=100):"""训练一个回合"""state = self.mdp.reset()total_reward = 0steps = 0while steps < max_steps:# 转换状态state_tensor = self.agent.state_to_tensor(state, self.mdp)# 选择动作action, log_prob, value = self.agent.select_action(state_tensor)# 执行动作next_state, reward, done = self.mdp.step(state, action)# 存储转移self.agent.store_transition(state_tensor, action, reward, log_prob, value, done)total_reward += rewardstate = next_statesteps += 1if done:breakreturn total_reward, stepsdef train(self, num_episodes=1000, update_frequency=20):"""训练多个回合"""rewards_history = []steps_history = []policy_loss_history = []value_loss_history = []entropy_history = []episode_rewards = []episode_steps = []for episode in range(num_episodes):total_reward, steps = self.train_episode()episode_rewards.append(total_reward)episode_steps.append(steps)# 每update_frequency回合更新一次策略if (episode + 1) % update_frequency == 0:policy_loss, value_loss, entropy = self.agent.update_policy()# 记录平均表现avg_reward = np.mean(episode_rewards)avg_steps = np.mean(episode_steps)rewards_history.append(avg_reward)steps_history.append(avg_steps)policy_loss_history.append(policy_loss)value_loss_history.append(value_loss)entropy_history.append(entropy)if len(rewards_history) % 10 == 0:print(f"更新 {len(rewards_history)}: 平均奖励 = {avg_reward:.2f}, "f"平均步数 = {avg_steps:.2f}")# 重置临时记录episode_rewards = []episode_steps = []return (rewards_history, steps_history, policy_loss_history, value_loss_history, entropy_history)def test_policy(self, num_episodes=10):"""测试学习到的策略"""test_rewards = []test_paths = []for _ in range(num_episodes):state = self.mdp.reset()path = [state]total_reward = 0steps = 0max_steps = 100while steps < max_steps:state_tensor = self.agent.state_to_tensor(state, self.mdp)# 使用贪心策略with torch.no_grad():action_probs, _ = self.agent.network(state_tensor)action = torch.argmax(action_probs).item()next_state, reward, done = self.mdp.step(state, action)path.append(next_state)total_reward += rewardstate = next_statesteps += 1if done:breaktest_rewards.append(total_reward)test_paths.append(path)return test_rewards, test_paths# 示例使用PPO
print("创建PPO智能体...")
ppo_agent = PPOAgent(state_size=len(mdp.states), action_size=mdp.n_actions,lr=3e-4,gamma=0.99,lambda_gae=0.95,epsilon_clip=0.2
)ppo_trainer = PPOTrainer(mdp, ppo_agent)print("开始PPO训练...")
ppo_results = ppo_trainer.train(num_episodes=1000, update_frequency=20)
ppo_rewards, ppo_steps, ppo_policy_loss, ppo_value_loss, ppo_entropy = ppo_results# 测试PPO策略
print("\n测试PPO策略...")
ppo_test_rewards, ppo_test_paths = ppo_trainer.test_policy(num_episodes=5)print("PPO测试结果:")
for i, (reward, path) in enumerate(zip(ppo_test_rewards, ppo_test_paths)):coord_path = [env.state_to_coord(state) for state in path]print(f"测试 {i+1}: 奖励 = {reward:.1f}, 路径长度 = {len(path)}")print(f" 路径: {coord_path}")
总结与回顾
强化学习的核心概念
- 状态表示:如何描述环境的当前情况
- 马尔科夫过程:未来只依赖于现在,不依赖于过去
- 贝尔曼方程:价值函数的递归关系
- 策略梯度:直接优化策略参数
- Q-Learning:学习动作价值函数
- PPO:稳定高效的策略优化
算法特点对比
算法 | 类型 | 优点 | 缺点 | 适用场景 |
---|---|---|---|---|
价值迭代 | 基于模型 | 理论保证,收敛性好 | 需要环境模型 | 小规模、已知环境 |
策略梯度 | 基于策略 | 直接优化策略 | 方差大,样本效率低 | 连续动作空间 |
Q-Learning | 基于价值 | 无需环境模型 | 需要探索策略 | 离散动作空间 |
PPO | Actor-Critic | 稳定、高效 | 实现复杂 | 现代深度强化学习 |
学习建议
- 从简单开始:先理解基本概念,再学习复杂算法
- 动手实践:编程实现有助于深入理解
- 可视化分析:观察学习过程和结果
- 参数调优:学会调整超参数
- 扩展应用:尝试在其他问题上应用
进一步学习方向
- 深度强化学习:DQN、A3C、SAC等
- 多智能体强化学习:竞争与合作
- 层次强化学习:分解复杂任务
- 元学习:学会如何学习
- 实际应用:游戏、机器人、推荐系统等
附录:完整代码整合
# 完整的强化学习实验代码
def run_complete_experiment():"""运行完整的强化学习实验"""# 创建环境env = GridWorld()mdp = GridWorldMDP(env)print("4×4网格世界强化学习实验")print("=" * 50)# 1. 价值迭代print("\n1. 运行价值迭代...")solver = BellmanSolver(mdp)optimal_V, optimal_policy, _ = solver.value_iteration()# 2. 策略梯度print("\n2. 训练策略梯度...")pg_agent = REINFORCE(len(mdp.states), mdp.n_actions)pg_trainer = PolicyGradientTrainer(mdp, pg_agent)pg_rewards, _, _ = pg_trainer.train(num_episodes=500)# 3. Q-Learningprint("\n3. 训练Q-Learning...")q_agent = QLearningAgent(len(mdp.states), mdp.n_actions)q_trainer = QLearningTrainer(mdp, q_agent)q_rewards, _, _, _ = q_trainer.train(num_episodes=1000)# 4. PPOprint("\n4. 训练PPO...")ppo_agent = PPOAgent(len(mdp.states), mdp.n_actions)ppo_trainer = PPOTrainer(mdp, ppo_agent)ppo_rewards, _, _, _, _ = ppo_trainer.train(num_episodes=1000)print("\n实验完成!所有算法都已训练完毕。")return {'env': env,'mdp': mdp,'value_iteration': (optimal_V, optimal_policy),'policy_gradient': (pg_agent, pg_rewards),'q_learning': (q_agent, q_rewards),'ppo': (ppo_agent, ppo_rewards)}# 运行完整实验
if __name__ == "__main__":results = run_complete_experiment()print("强化学习教程学习完成!")
恭喜你完成了强化学习的完整学习旅程! 🎉
通过这个教程,你已经掌握了强化学习的核心概念和主要算法。现在你可以:
- 理解强化学习的基本原理
- 实现经典的强化学习算法
- 分析和比较不同算法的性能
- 将这些知识应用到新的问题中
继续探索和实践,强化学习的世界等待着你去发现更多精彩的内容!