当前位置: 首页 > news >正文

小白讲强化学习:从零开始的4x4网格世界探索

引言:为什么我们需要强化学习?

问题描述

想象一下,我们有一个聪明的机器人小智,它被放置在一个4×4的网格世界中。这个世界的规则如下:

  • 起始位置:左上角 (1,1)
  • 目标位置:右下角 (4,4)
  • 障碍物:位于 (2,2) 和 (3,3),机器人无法通过
  • 可用动作:上、下、左、右、不动(5种动作)
  • 目标:找到从起点到终点的最优路径
网格世界示意图:
+---+---+---+---+
| S |   |   |   |  S = 起点 (1,1)
+---+---+---+---+
|   | X |   |   |  X = 障碍物
+---+---+---+---+
|   |   | X |   |
+---+---+---+---+
|   |   |   | G |  G = 目标 (4,4)
+---+---+---+---+

为什么传统方法不够用?

  1. 动态环境:如果环境经常变化(障碍物位置改变),预编程的路径就失效了
  2. 不确定性:机器人的动作可能不总是成功(比如有10%的概率滑向其他方向)
  3. 奖励优化:我们希望机器人不仅能到达目标,还要找到最优(最短、最安全)的路径
  4. 自主学习:我们希望机器人能够自己探索和学习,而不是被告知每一步该怎么做

这就是强化学习发挥作用的地方!强化学习让智能体通过与环境的交互,在试错中学习最优策略。


第一章:状态表示

1.1 什么是状态?

**状态(State)**是对当前环境情况的完整描述。在我们的网格世界中,状态就是机器人当前所在的位置。

1.2 数学表示

我们可以用坐标 ( x , y ) (x, y) (x,y) 来表示状态,其中:

  • x x x 表示行号(1-4)
  • y y y 表示列号(1-4)

状态空间: S = { ( x , y ) ∣ x , y ∈ { 1 , 2 , 3 , 4 } , ( x , y ) ≠ ( 2 , 2 ) , ( x , y ) ≠ ( 3 , 3 ) } S = \{(x,y) | x,y \in \{1,2,3,4\}, (x,y) \neq (2,2), (x,y) \neq (3,3)\} S={(x,y)x,y{1,2,3,4},(x,y)=(2,2),(x,y)=(3,3)}

总共有 4 × 4 − 2 = 14 4 \times 4 - 2 = 14 4×42=14 个有效状态(减去2个障碍物位置)。

1.3 状态编码

为了便于计算机处理,我们将二维坐标转换为一维索引:

编码公式
state_id = ( x − 1 ) × 4 + ( y − 1 ) \text{state\_id} = (x-1) \times 4 + (y-1) state_id=(x1)×4+(y1)

解码公式
x = ⌊ state_id / 4 ⌋ + 1 x = \lfloor \text{state\_id} / 4 \rfloor + 1 x=state_id/4+1
y = ( state_id m o d 4 ) + 1 y = (\text{state\_id} \bmod 4) + 1 y=(state_idmod4)+1

1.4 Python实现

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as snsclass GridWorld:def __init__(self):self.grid_size = 4self.start_state = (0, 0)  # 转换为0-based索引self.goal_state = (3, 3)self.obstacles = [(1, 1), (2, 2)]  # 转换为0-based索引# 动作定义:上、下、左、右、不动self.actions = {0: (-1, 0),  # 上1: (1, 0),   # 下  2: (0, -1),  # 左3: (0, 1),   # 右4: (0, 0)    # 不动}self.action_names = ['上', '下', '左', '右', '不动']def state_to_coord(self, state_id):"""将状态ID转换为坐标"""x = state_id // self.grid_sizey = state_id % self.grid_sizereturn (x, y)def coord_to_state(self, x, y):"""将坐标转换为状态ID"""return x * self.grid_size + ydef is_valid_state(self, x, y):"""检查状态是否有效"""if x < 0 or x >= self.grid_size or y < 0 or y >= self.grid_size:return Falseif (x, y) in self.obstacles:return Falsereturn Truedef get_valid_states(self):"""获取所有有效状态"""valid_states = []for x in range(self.grid_size):for y in range(self.grid_size):if self.is_valid_state(x, y):valid_states.append(self.coord_to_state(x, y))return valid_statesdef visualize_grid(self):"""可视化网格世界"""fig, ax = plt.subplots(figsize=(8, 8))# 画网格for i in range(self.grid_size + 1):ax.axhline(i, color='black', linewidth=1)ax.axvline(i, color='black', linewidth=1)# 标记起点start_rect = Rectangle(self.start_state, 1, 1, facecolor='green', alpha=0.7)ax.add_patch(start_rect)ax.text(self.start_state[1] + 0.5, self.start_state[0] + 0.5, 'S', ha='center', va='center', fontsize=16, fontweight='bold')# 标记终点goal_rect = Rectangle(self.goal_state, 1, 1, facecolor='red', alpha=0.7)ax.add_patch(goal_rect)ax.text(self.goal_state[1] + 0.5, self.goal_state[0] + 0.5, 'G', ha='center', va='center', fontsize=16, fontweight='bold')# 标记障碍物for obs in self.obstacles:obs_rect = Rectangle(obs, 1, 1, facecolor='black', alpha=0.8)ax.add_patch(obs_rect)ax.text(obs[1] + 0.5, obs[0] + 0.5, 'X', ha='center', va='center', color='white', fontsize=16, fontweight='bold')# 标记状态IDfor x in range(self.grid_size):for y in range(self.grid_size):if self.is_valid_state(x, y):state_id = self.coord_to_state(x, y)ax.text(y + 0.1, x + 0.9, f'({state_id})', fontsize=10, color='blue')ax.set_xlim(0, self.grid_size)ax.set_ylim(0, self.grid_size)ax.set_aspect('equal')ax.invert_yaxis()  # 让(0,0)在左上角ax.set_title('4×4网格世界状态表示', fontsize=16)plt.tight_layout()return fig# 创建环境并可视化
env = GridWorld()
print("有效状态列表:", env.get_valid_states())
print("状态总数:", len(env.get_valid_states()))# 测试状态转换
test_coord = (2, 1)
state_id = env.coord_to_state(test_coord[0], test_coord[1])
recovered_coord = env.state_to_coord(state_id)
print(f"原坐标: {test_coord}, 状态ID: {state_id}, 恢复坐标: {recovered_coord}")

第二章:马尔科夫过程

2.1 什么是马尔科夫性质?

马尔科夫性质:未来的状态只依赖于当前状态,而不依赖于过去的历史。

数学表达:
P ( S t + 1 = s ′ ∣ S t = s , S t − 1 = s t − 1 , . . . , S 0 = s 0 ) = P ( S t + 1 = s ′ ∣ S t = s ) P(S_{t+1} = s' | S_t = s, S_{t-1} = s_{t-1}, ..., S_0 = s_0) = P(S_{t+1} = s' | S_t = s) P(St+1=sSt=s,St1=st1,...,S0=s0)=P(St+1=sSt=s)

简单来说:“未来只看现在,不看过去”

2.2 马尔科夫过程 (Markov Process)

马尔科夫过程是一个二元组: M = ( S , P ) \mathcal{M} = (S, P) M=(S,P)

  • S S S:状态空间
  • P P P:状态转移概率矩阵

状态转移概率:
P s s ′ = P ( S t + 1 = s ′ ∣ S t = s ) P_{ss'} = P(S_{t+1} = s' | S_t = s) Pss=P(St+1=sSt=s)

满足: ∑ s ′ ∈ S P s s ′ = 1 \sum_{s' \in S} P_{ss'} = 1 sSPss=1(概率和为1)

2.3 我们网格世界中的马尔科夫过程

假设机器人的动作有一定随机性:

  • 80%概率按预期方向移动
  • 20%概率随机移动到相邻的有效位置

2.4 Python实现

class MarkovProcess:def __init__(self, grid_world):self.env = grid_worldself.states = grid_world.get_valid_states()self.n_states = len(self.states)# 构建状态转移概率矩阵self.transition_matrix = self._build_transition_matrix()def _build_transition_matrix(self):"""构建状态转移概率矩阵"""# 这里我们假设机器人随机选择动作P = np.zeros((self.n_states, self.n_states))for i, state_id in enumerate(self.states):x, y = self.env.state_to_coord(state_id)# 获取所有可能的下一个状态next_states = []for action_id, (dx, dy) in self.env.actions.items():next_x, next_y = x + dx, y + dyif self.env.is_valid_state(next_x, next_y):next_state_id = self.env.coord_to_state(next_x, next_y)next_states.append(next_state_id)else:# 如果移动无效,留在原地next_states.append(state_id)# 均匀分布概率(每个动作概率相等)for next_state_id in next_states:next_state_idx = self.states.index(next_state_id)P[i][next_state_idx] += 1.0 / len(self.env.actions)return Pdef simulate_random_walk(self, start_state_id, steps=100):"""模拟随机游走"""current_state_idx = self.states.index(start_state_id)trajectory = [start_state_id]for _ in range(steps):# 根据转移概率选择下一个状态next_state_idx = np.random.choice(self.n_states, p=self.transition_matrix[current_state_idx])next_state_id = self.states[next_state_idx]trajectory.append(next_state_id)current_state_idx = next_state_idx# 如果到达目标状态,结束模拟if next_state_id == self.env.coord_to_state(*self.env.goal_state):breakreturn trajectorydef visualize_transition_matrix(self):"""可视化状态转移矩阵"""fig, ax = plt.subplots(figsize=(10, 8))sns.heatmap(self.transition_matrix, xticklabels=self.states,yticklabels=self.states,annot=True, fmt='.3f', cmap='Blues', ax=ax)ax.set_title('状态转移概率矩阵')ax.set_xlabel('下一个状态')ax.set_ylabel('当前状态')plt.tight_layout()return fig# 创建马尔科夫过程
mp = MarkovProcess(env)# 模拟随机游走
start_state = env.coord_to_state(*env.start_state)
trajectory = mp.simulate_random_walk(start_state, steps=50)print("随机游走轨迹(状态ID):", trajectory[:10], "...")
print("轨迹长度:", len(trajectory))# 转换为坐标显示
coord_trajectory = [env.state_to_coord(state_id) for state_id in trajectory[:10]]
print("对应坐标轨迹:", coord_trajectory)

第三章:马尔科夫决策过程(MDP)

3.1 从马尔科夫过程到MDP

马尔科夫决策过程在马尔科夫过程的基础上增加了动作奖励

M = ( S , A , P , R , γ ) \mathcal{M} = (S, A, P, R, \gamma) M=(S,A,P,R,γ)

  • S S S:状态空间
  • A A A:动作空间
  • P P P:状态转移概率 P ( s ′ ∣ s , a ) P(s'|s,a) P(ss,a)
  • R R R:奖励函数 R ( s , a , s ′ ) R(s,a,s') R(s,a,s)
  • γ \gamma γ:折扣因子 γ ∈ [ 0 , 1 ] \gamma \in [0,1] γ[0,1]

3.2 关键概念详解

3.2.1 状态转移概率

P s s ′ a = P ( S t + 1 = s ′ ∣ S t = s , A t = a ) P_{ss'}^a = P(S_{t+1} = s' | S_t = s, A_t = a) Pssa=P(St+1=sSt=s,At=a)

3.2.2 奖励函数

我们设计如下奖励结构:

  • 到达目标:+100
  • 撞墙(动作无效):-10
  • 正常移动:-1(鼓励快速到达目标)
  • 不动:-5(不鼓励原地不动)
3.2.3 策略 (Policy)

策略 π \pi π 定义了在每个状态下选择动作的规则:
π ( a ∣ s ) = P ( A t = a ∣ S t = s ) \pi(a|s) = P(A_t = a | S_t = s) π(as)=P(At=aSt=s)

3.2.4 价值函数

状态价值函数:在策略 π \pi π 下,从状态 s s s 开始的期望回报
V π ( s ) = E π [ ∑ t = 0 ∞ γ t R t + 1 ∣ S 0 = s ] V^\pi(s) = \mathbb{E}_\pi[\sum_{t=0}^{\infty} \gamma^t R_{t+1} | S_0 = s] Vπ(s)=Eπ[t=0γtRt+1S0=s]

动作价值函数:在状态 s s s 执行动作 a a a,然后遵循策略 π \pi π 的期望回报
Q π ( s , a ) = E π [ ∑ t = 0 ∞ γ t R t + 1 ∣ S 0 = s , A 0 = a ] Q^\pi(s,a) = \mathbb{E}_\pi[\sum_{t=0}^{\infty} \gamma^t R_{t+1} | S_0 = s, A_0 = a] Qπ(s,a)=Eπ[t=0γtRt+1S0=s,A0=a]

3.3 Python实现

class GridWorldMDP:def __init__(self, grid_world, gamma=0.9):self.env = grid_worldself.states = grid_world.get_valid_states()self.n_states = len(self.states)self.n_actions = len(grid_world.actions)self.gamma = gamma# 构建转移概率和奖励函数self.P, self.R = self._build_mdp()def _build_mdp(self):"""构建MDP的转移概率和奖励函数"""# P[s][a][s'] = 从状态s执行动作a转移到状态s'的概率P = np.zeros((self.n_states, self.n_actions, self.n_states))# R[s][a] = 在状态s执行动作a的即时奖励R = np.zeros((self.n_states, self.n_actions))for i, state_id in enumerate(self.states):x, y = self.env.state_to_coord(state_id)for action_id, (dx, dy) in self.env.actions.items():next_x, next_y = x + dx, y + dy# 检查移动是否有效if self.env.is_valid_state(next_x, next_y):next_state_id = self.env.coord_to_state(next_x, next_y)next_state_idx = self.states.index(next_state_id)# 设置转移概率(确定性环境)P[i][action_id][next_state_idx] = 1.0# 设置奖励if (next_x, next_y) == self.env.goal_state:R[i][action_id] = 100  # 到达目标elif action_id == 4:  # 不动R[i][action_id] = -5else:R[i][action_id] = -1  # 正常移动else:# 无效移动,留在原地P[i][action_id][i] = 1.0R[i][action_id] = -10  # 撞墙惩罚return P, Rdef step(self, state_id, action_id):"""执行一步动作"""state_idx = self.states.index(state_id)# 根据转移概率采样下一个状态next_state_idx = np.random.choice(self.n_states, p=self.P[state_idx][action_id])next_state_id = self.states[next_state_idx]# 获取奖励reward = self.R[state_idx][action_id]# 检查是否终止done = (next_state_id == self.env.coord_to_state(*self.env.goal_state))return next_state_id, reward, donedef reset(self):"""重置环境到初始状态"""return self.env.coord_to_state(*self.env.start_state)# 创建MDP
mdp = GridWorldMDP(env)# 测试环境交互
state = mdp.reset()
print(f"初始状态: {state}, 坐标: {env.state_to_coord(state)}")for step in range(5):action = np.random.randint(5)  # 随机选择动作next_state, reward, done = mdp.step(state, action)print(f"步骤 {step + 1}:")print(f"  状态: {state} -> {next_state}")print(f"  动作: {action} ({env.action_names[action]})")print(f"  奖励: {reward}")print(f"  终止: {done}")state = next_stateif done:print("到达目标!")break

第四章:贝尔曼方程

4.1 贝尔曼方程的直觉理解

贝尔曼方程描述了价值函数的递归关系

“一个状态的价值 = 立即奖励 + 未来状态价值的期望”

4.2 贝尔曼期望方程

4.2.1 状态价值函数的贝尔曼方程

V π ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ P s s ′ a [ R s s ′ a + γ V π ( s ′ ) ] V^\pi(s) = \sum_{a} \pi(a|s) \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma V^\pi(s')] Vπ(s)=aπ(as)sPssa[Rssa+γVπ(s)]

逐步分解

  1. ∑ a π ( a ∣ s ) \sum_{a} \pi(a|s) aπ(as):对所有可能的动作求期望
  2. ∑ s ′ P s s ′ a \sum_{s'} P_{ss'}^a sPssa:对所有可能的下一状态求期望
  3. R s s ′ a R_{ss'}^a Rssa:立即奖励
  4. γ V π ( s ′ ) \gamma V^\pi(s') γVπ(s):折扣后的未来价值
4.2.2 动作价值函数的贝尔曼方程

Q π ( s , a ) = ∑ s ′ P s s ′ a [ R s s ′ a + γ ∑ a ′ π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) ] Q^\pi(s,a) = \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')] Qπ(s,a)=sPssa[Rssa+γaπ(as)Qπ(s,a)]

4.2.3 两个价值函数的关系

V π ( s ) = ∑ a π ( a ∣ s ) Q π ( s , a ) V^\pi(s) = \sum_{a} \pi(a|s) Q^\pi(s,a) Vπ(s)=aπ(as)Qπ(s,a)

Q π ( s , a ) = ∑ s ′ P s s ′ a [ R s s ′ a + γ V π ( s ′ ) ] Q^\pi(s,a) = \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma V^\pi(s')] Qπ(s,a)=sPssa[Rssa+γVπ(s)]

4.3 贝尔曼最优方程

最优策略 π ∗ = arg ⁡ max ⁡ π V π ( s ) , ∀ s \pi^* = \arg\max_\pi V^\pi(s), \forall s π=argmaxπVπ(s),s

最优价值函数
V ∗ ( s ) = max ⁡ a Q ∗ ( s , a ) V^*(s) = \max_a Q^*(s,a) V(s)=amaxQ(s,a)

Q ∗ ( s , a ) = ∑ s ′ P s s ′ a [ R s s ′ a + γ V ∗ ( s ′ ) ] Q^*(s,a) = \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma V^*(s')] Q(s,a)=sPssa[Rssa+γV(s)]

最优贝尔曼方程
V ∗ ( s ) = max ⁡ a ∑ s ′ P s s ′ a [ R s s ′ a + γ V ∗ ( s ′ ) ] V^*(s) = \max_a \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma V^*(s')] V(s)=amaxsPssa[Rssa+γV(s)]

4.4 Python实现

class BellmanSolver:def __init__(self, mdp, tolerance=1e-6, max_iterations=1000):self.mdp = mdpself.tolerance = toleranceself.max_iterations = max_iterationsdef policy_evaluation(self, policy):"""策略评估:计算给定策略的价值函数"""V = np.zeros(self.mdp.n_states)for iteration in range(self.max_iterations):V_new = np.zeros(self.mdp.n_states)for s in range(self.mdp.n_states):# 贝尔曼期望方程value = 0for a in range(self.mdp.n_actions):action_value = 0for s_next in range(self.mdp.n_states):action_value += self.mdp.P[s][a][s_next] * (self.mdp.R[s][a] + self.mdp.gamma * V[s_next])value += policy[s][a] * action_valueV_new[s] = value# 检查收敛if np.max(np.abs(V_new - V)) < self.tolerance:print(f"策略评估在第 {iteration + 1} 次迭代后收敛")return V_newV = V_newprint(f"策略评估在 {self.max_iterations} 次迭代后未收敛")return Vdef value_iteration(self):"""价值迭代:求解最优价值函数"""V = np.zeros(self.mdp.n_states)policy = np.zeros((self.mdp.n_states, self.mdp.n_actions))iteration_values = []  # 记录每次迭代的价值函数for iteration in range(self.max_iterations):V_new = np.zeros(self.mdp.n_states)for s in range(self.mdp.n_states):# 贝尔曼最优方程action_values = []for a in range(self.mdp.n_actions):action_value = 0for s_next in range(self.mdp.n_states):action_value += self.mdp.P[s][a][s_next] * (self.mdp.R[s][a] + self.mdp.gamma * V[s_next])action_values.append(action_value)V_new[s] = max(action_values)iteration_values.append(V_new.copy())# 检查收敛if np.max(np.abs(V_new - V)) < self.tolerance:print(f"价值迭代在第 {iteration + 1} 次迭代后收敛")breakV = V_new# 提取最优策略for s in range(self.mdp.n_states):action_values = []for a in range(self.mdp.n_actions):action_value = 0for s_next in range(self.mdp.n_states):action_value += self.mdp.P[s][a][s_next] * (self.mdp.R[s][a] + self.mdp.gamma * V[s_next])action_values.append(action_value)# 最优动作best_action = np.argmax(action_values)policy[s][best_action] = 1.0return V, policy, iteration_valuesdef visualize_value_function(self, V, title="价值函数"):"""可视化价值函数"""# 创建一个4x4的矩阵来显示价值value_grid = np.full((4, 4), np.nan)for i, state_id in enumerate(self.mdp.states):x, y = self.mdp.env.state_to_coord(state_id)value_grid[x, y] = V[i]fig, ax = plt.subplots(figsize=(8, 8))# 创建热力图im = ax.imshow(value_grid, cmap='RdYlBu_r', alpha=0.8)# 添加文本标注for x in range(4):for y in range(4):if not np.isnan(value_grid[x, y]):text = ax.text(y, x, f'{value_grid[x, y]:.1f}',ha="center", va="center", color="black", fontweight="bold")# 标记特殊位置start_x, start_y = self.mdp.env.start_stategoal_x, goal_y = self.mdp.env.goal_stateax.add_patch(Rectangle((start_y-0.4, start_x-0.4), 0.8, 0.8, fill=False, edgecolor='green', linewidth=3))ax.add_patch(Rectangle((goal_y-0.4, goal_x-0.4), 0.8, 0.8, fill=False, edgecolor='red', linewidth=3))# 标记障碍物for obs_x, obs_y in self.mdp.env.obstacles:ax.add_patch(Rectangle((obs_y-0.5, obs_x-0.5), 1, 1, facecolor='black', alpha=0.8))ax.set_title(title, fontsize=16)ax.set_xticks(range(4))ax.set_yticks(range(4))plt.colorbar(im)plt.tight_layout()return fig# 创建求解器
solver = BellmanSolver(mdp)# 求解最优价值函数和策略
print("开始价值迭代求解...")
optimal_V, optimal_policy, iteration_history = solver.value_iteration()print("\n最优价值函数:")
for i, state_id in enumerate(mdp.states):coord = env.state_to_coord(state_id)print(f"状态 {state_id} {coord}: V* = {optimal_V[i]:.2f}")print("\n最优策略:")
for i, state_id in enumerate(mdp.states):coord = env.state_to_coord(state_id)best_action = np.argmax(optimal_policy[i])print(f"状态 {state_id} {coord}: 最优动作 = {best_action} ({env.action_names[best_action]})")# 可视化最优价值函数
fig = solver.visualize_value_function(optimal_V, "最优价值函数")
plt.show()

第五章:策略梯度方法

5.1 策略梯度的基本思想

前面我们学习了基于价值的方法(如价值迭代),现在我们学习基于策略的方法。

策略梯度直接优化参数化的策略:
π θ ( a ∣ s ) = P ( A t = a ∣ S t = s , θ ) \pi_\theta(a|s) = P(A_t = a | S_t = s, \theta) πθ(as)=P(At=aSt=s,θ)

其中 θ \theta θ 是策略的参数。

5.2 目标函数

我们的目标是最大化期望回报:
J ( θ ) = E π θ [ ∑ t = 0 ∞ γ t r t + 1 ] J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_{t+1}] J(θ)=Eπθ[t=0γtrt+1]

对于有限步骤的情况:
J ( θ ) = E π θ [ ∑ t = 0 T − 1 γ t r t + 1 ] J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{T-1} \gamma^t r_{t+1}] J(θ)=Eπθ[t=0T1γtrt+1]

5.3 策略梯度定理

策略梯度定理告诉我们如何计算目标函数的梯度:

∇ θ J ( θ ) = E π θ [ ∇ θ log ⁡ π θ ( a ∣ s ) ⋅ Q π θ ( s , a ) ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s,a)] θJ(θ)=Eπθ[θlogπθ(as)Qπθ(s,a)]

直觉理解

  • 如果 Q π θ ( s , a ) > 0 Q^{\pi_\theta}(s,a) > 0 Qπθ(s,a)>0(好的动作),增加 π θ ( a ∣ s ) \pi_\theta(a|s) πθ(as) 的概率
  • 如果 Q π θ ( s , a ) < 0 Q^{\pi_\theta}(s,a) < 0 Qπθ(s,a)<0(坏的动作),减少 π θ ( a ∣ s ) \pi_\theta(a|s) πθ(as) 的概率

5.4 REINFORCE算法

最简单的策略梯度算法:

  1. 用当前策略 π θ \pi_\theta πθ 收集一个完整的回合
  2. 计算每个时间步的回报 G t = ∑ k = 0 T − t − 1 γ k r t + k + 1 G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k+1} Gt=k=0Tt1γkrt+k+1
  3. 更新参数: θ ← θ + α ∇ θ log ⁡ π θ ( a t ∣ s t ) G t \theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) G_t θθ+αθlogπθ(atst)Gt

5.5 Python实现

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import deque
import randomclass PolicyNetwork(nn.Module):def __init__(self, state_size, action_size, hidden_size=64):super(PolicyNetwork, self).__init__()self.fc1 = nn.Linear(state_size, hidden_size)self.fc2 = nn.Linear(hidden_size, hidden_size)self.fc3 = nn.Linear(hidden_size, action_size)def forward(self, x):x = F.relu(self.fc1(x))x = F.relu(self.fc2(x))x = F.softmax(self.fc3(x), dim=-1)return xclass REINFORCE:def __init__(self, state_size, action_size, lr=0.01, gamma=0.99):self.state_size = state_sizeself.action_size = action_sizeself.gamma = gamma# 创建策略网络self.policy_net = PolicyNetwork(state_size, action_size)self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)# 存储轨迹self.reset_episode()def reset_episode(self):"""重置回合数据"""self.states = []self.actions = []self.rewards = []self.log_probs = []def state_to_tensor(self, state_id, mdp):"""将状态ID转换为one-hot编码的张量"""state_idx = mdp.states.index(state_id)state_vector = torch.zeros(len(mdp.states))state_vector[state_idx] = 1.0return state_vector.unsqueeze(0)def select_action(self, state_tensor):"""根据当前策略选择动作"""action_probs = self.policy_net(state_tensor)action_dist = torch.distributions.Categorical(action_probs)action = action_dist.sample()log_prob = action_dist.log_prob(action)return action.item(), log_probdef store_transition(self, state, action, reward, log_prob):"""存储转移数据"""self.states.append(state)self.actions.append(action)self.rewards.append(reward)self.log_probs.append(log_prob)def calculate_returns(self):"""计算折扣回报"""returns = []G = 0# 从后往前计算for reward in reversed(self.rewards):G = reward + self.gamma * Greturns.insert(0, G)return returnsdef update_policy(self):"""更新策略网络"""returns = self.calculate_returns()# 标准化回报(减少方差)returns = torch.tensor(returns)returns = (returns - returns.mean()) / (returns.std() + 1e-8)# 计算策略损失policy_loss = []for log_prob, G in zip(self.log_probs, returns):policy_loss.append(-log_prob * G)policy_loss = torch.stack(policy_loss).sum()# 梯度更新self.optimizer.zero_grad()policy_loss.backward()self.optimizer.step()return policy_loss.item()class PolicyGradientTrainer:def __init__(self, mdp, agent):self.mdp = mdpself.agent = agentdef train_episode(self):"""训练一个回合"""state = self.mdp.reset()self.agent.reset_episode()total_reward = 0steps = 0max_steps = 100while steps < max_steps:# 状态编码state_tensor = self.agent.state_to_tensor(state, self.mdp)# 选择动作action, log_prob = self.agent.select_action(state_tensor)# 执行动作next_state, reward, done = self.mdp.step(state, action)# 存储转移self.agent.store_transition(state, action, reward, log_prob)total_reward += rewardstate = next_statesteps += 1if done:break# 更新策略loss = self.agent.update_policy()return total_reward, steps, lossdef train(self, num_episodes=1000):"""训练多个回合"""rewards_history = []steps_history = []loss_history = []for episode in range(num_episodes):total_reward, steps, loss = self.train_episode()rewards_history.append(total_reward)steps_history.append(steps)loss_history.append(loss)if (episode + 1) % 100 == 0:avg_reward = np.mean(rewards_history[-100:])avg_steps = np.mean(steps_history[-100:])print(f"回合 {episode + 1}: 平均奖励 = {avg_reward:.2f}, 平均步数 = {avg_steps:.2f}")return rewards_history, steps_history, loss_historydef test_policy(self, num_episodes=10):"""测试训练好的策略"""test_rewards = []test_paths = []for _ in range(num_episodes):state = self.mdp.reset()path = [state]total_reward = 0steps = 0max_steps = 100while steps < max_steps:state_tensor = self.agent.state_to_tensor(state, self.mdp)# 使用贪心策略(选择概率最大的动作)with torch.no_grad():action_probs = self.agent.policy_net(state_tensor)action = torch.argmax(action_probs).item()next_state, reward, done = self.mdp.step(state, action)path.append(next_state)total_reward += rewardstate = next_statesteps += 1if done:breaktest_rewards.append(total_reward)test_paths.append(path)return test_rewards, test_paths# 创建和训练策略梯度智能体
print("创建策略梯度智能体...")
pg_agent = REINFORCE(state_size=len(mdp.states), action_size=mdp.n_actions, lr=0.01)
trainer = PolicyGradientTrainer(mdp, pg_agent)print("开始训练...")
rewards_history, steps_history, loss_history = trainer.train(num_episodes=500)# 测试训练好的策略
print("\n测试训练好的策略...")
test_rewards, test_paths = trainer.test_policy(num_episodes=5)print("测试结果:")
for i, (reward, path) in enumerate(zip(test_rewards, test_paths)):coord_path = [env.state_to_coord(state) for state in path]print(f"测试 {i+1}: 奖励 = {reward:.1f}, 路径长度 = {len(path)}")print(f"  路径: {coord_path[:10]}{'...' if len(path) > 10 else ''}")# 可视化训练过程
fig, axes = plt.subplots(1, 3, figsize=(15, 5))# 奖励曲线
axes[0].plot(rewards_history)
axes[0].set_title('训练奖励')
axes[0].set_xlabel('回合')
axes[0].set_ylabel('总奖励')# 步数曲线
axes[1].plot(steps_history)
axes[1].set_title('回合步数')
axes[1].set_xlabel('回合')
axes[1].set_ylabel('步数')# 损失曲线
axes[2].plot(loss_history)
axes[2].set_title('策略损失')
axes[2].set_xlabel('回合')
axes[2].set_ylabel('损失')plt.tight_layout()
plt.show()

第七章:近端策略优化 (PPO)

7.1 PPO的背景和动机

PPO是目前最流行的深度强化学习算法之一。它解决了传统策略梯度方法的几个问题:

  1. 样本效率低:每次更新只能使用一次数据
  2. 训练不稳定:策略更新步长难以控制
  3. 容易陷入局部最优:贪心更新可能破坏策略

7.2 PPO的核心思想

PPO通过限制策略更新的幅度来确保训练稳定性。

7.2.1 重要性采样 (Importance Sampling)

当我们用旧策略 π θ o l d \pi_{\theta_{old}} πθold 收集的数据来更新新策略 π θ \pi_\theta πθ 时,需要使用重要性采样:

E s , a ∼ π θ o l d [ π θ ( a ∣ s ) π θ o l d ( a ∣ s ) A π θ o l d ( s , a ) ] \mathbb{E}_{s,a \sim \pi_{\theta_{old}}}[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{\theta_{old}}}(s,a)] Es,aπθold[πθold(as)πθ(as)Aπθold(s,a)]

其中比率 r t ( θ ) = π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} rt(θ)=πθold(atst)πθ(atst) 表示新旧策略的概率比。

7.2.2 优势函数 (Advantage Function)

优势函数衡量某个动作相对于平均水平的好坏:
A π ( s , a ) = Q π ( s , a ) − V π ( s ) A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s) Aπ(s,a)=Qπ(s,a)Vπ(s)

直觉理解

  • A ( s , a ) > 0 A(s,a) > 0 A(s,a)>0:这个动作比平均水平好
  • A ( s , a ) < 0 A(s,a) < 0 A(s,a)<0:这个动作比平均水平差
  • A ( s , a ) = 0 A(s,a) = 0 A(s,a)=0:这个动作就是平均水平
7.2.3 GAE (Generalized Advantage Estimation)

实际中,我们使用GAE来估计优势函数:

A t = δ t + ( γ λ ) δ t + 1 + ( γ λ ) 2 δ t + 2 + . . . A_t = \delta_t + (\gamma \lambda) \delta_{t+1} + (\gamma \lambda)^2 \delta_{t+2} + ... At=δt+(γλ)δt+1+(γλ)2δt+2+...

其中 δ t = r t + γ V ( s t + 1 ) − V ( s t ) \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) δt=rt+γV(st+1)V(st) 是TD误差。

7.3 PPO的损失函数

7.3.1 剪切目标函数

PPO的核心是剪切概率比率:

L C L I P ( θ ) = E t [ min ⁡ ( r t ( θ ) A t , clip ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A t ) ] L^{CLIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t)] LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)]

其中:

  • r t ( θ ) = π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} rt(θ)=πθold(atst)πθ(atst)
  • ϵ \epsilon ϵ 是剪切参数(通常为0.1或0.2)
  • clip ( x , a , b ) \text{clip}(x, a, b) clip(x,a,b) 将x限制在[a,b]范围内
7.3.2 价值函数损失

L V F ( θ ) = E t [ ( V θ ( s t ) − V t t a r g e t ) 2 ] L^{VF}(\theta) = \mathbb{E}_t[(V_\theta(s_t) - V_t^{target})^2] LVF(θ)=Et[(Vθ(st)Vttarget)2]

7.3.3 熵奖励

为了鼓励探索,加入熵奖励:
L E N T ( θ ) = E t [ H ( π θ ( ⋅ ∣ s t ) ) ] L^{ENT}(\theta) = \mathbb{E}_t[H(\pi_\theta(\cdot|s_t))] LENT(θ)=Et[H(πθ(st))]

7.3.4 总损失函数

L ( θ ) = L C L I P ( θ ) + c 1 L V F ( θ ) + c 2 L E N T ( θ ) L(\theta) = L^{CLIP}(\theta) + c_1 L^{VF}(\theta) + c_2 L^{ENT}(\theta) L(θ)=LCLIP(θ)+c1LVF(θ)+c2LENT(θ)

7.4 PPO算法流程

  1. 收集轨迹:用当前策略收集N步数据
  2. 计算优势:使用GAE计算优势函数
  3. 多次更新:用同一批数据更新K次策略
  4. 重复:回到步骤1

7.5 Python实现

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categoricalclass ActorCritic(nn.Module):def __init__(self, state_size, action_size, hidden_size=64):super(ActorCritic, self).__init__()# 共享的特征提取层self.fc1 = nn.Linear(state_size, hidden_size)self.fc2 = nn.Linear(hidden_size, hidden_size)# Actor网络(策略)self.actor = nn.Linear(hidden_size, action_size)# Critic网络(价值函数)self.critic = nn.Linear(hidden_size, 1)def forward(self, x):x = F.relu(self.fc1(x))x = F.relu(self.fc2(x))# 策略分布action_logits = self.actor(x)action_probs = F.softmax(action_logits, dim=-1)# 状态价值state_value = self.critic(x)return action_probs, state_valueclass PPOAgent:def __init__(self, state_size, action_size, lr=3e-4, gamma=0.99, lambda_gae=0.95, epsilon_clip=0.2, c1=0.5, c2=0.01):self.state_size = state_sizeself.action_size = action_sizeself.gamma = gammaself.lambda_gae = lambda_gaeself.epsilon_clip = epsilon_clipself.c1 = c1  # 价值函数损失系数self.c2 = c2  # 熵奖励系数# 创建Actor-Critic网络self.network = ActorCritic(state_size, action_size)self.optimizer = optim.Adam(self.network.parameters(), lr=lr)# 存储轨迹数据self.reset_buffers()def reset_buffers(self):"""重置缓冲区"""self.states = []self.actions = []self.rewards = []self.log_probs = []self.values = []self.dones = []def state_to_tensor(self, state_id, mdp):"""将状态转换为张量"""state_idx = mdp.states.index(state_id)state_vector = torch.zeros(len(mdp.states))state_vector[state_idx] = 1.0return state_vector.unsqueeze(0)def select_action(self, state_tensor):"""选择动作并返回相关信息"""with torch.no_grad():action_probs, state_value = self.network(state_tensor)# 创建动作分布并采样action_dist = Categorical(action_probs)action = action_dist.sample()log_prob = action_dist.log_prob(action)return action.item(), log_prob.item(), state_value.item()def store_transition(self, state, action, reward, log_prob, value, done):"""存储转移数据"""self.states.append(state)self.actions.append(action)self.rewards.append(reward)self.log_probs.append(log_prob)self.values.append(value)self.dones.append(done)def compute_gae(self, next_value=0):"""计算GAE优势函数"""advantages = []gae = 0# 添加最后一个价值values = self.values + [next_value]# 从后往前计算GAEfor t in reversed(range(len(self.rewards))):if t == len(self.rewards) - 1:next_non_terminal = 1.0 - self.dones[t]next_value = values[t + 1]else:next_non_terminal = 1.0 - self.dones[t]next_value = values[t + 1]# TD误差delta = self.rewards[t] + self.gamma * next_value * next_non_terminal - values[t]# GAE计算gae = delta + self.gamma * self.lambda_gae * next_non_terminal * gaeadvantages.insert(0, gae)return advantagesdef update_policy(self, epochs=4):"""更新策略网络"""# 计算优势函数和回报advantages = self.compute_gae()advantages = torch.tensor(advantages, dtype=torch.float32)# 计算回报(优势 + 价值)returns = advantages + torch.tensor(self.values, dtype=torch.float32)# 标准化优势advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)# 转换数据为张量old_log_probs = torch.tensor(self.log_probs, dtype=torch.float32)actions = torch.tensor(self.actions, dtype=torch.long)# 将状态列表转换为张量states_tensor = torch.stack([s for s in self.states])# 多次更新策略total_policy_loss = 0total_value_loss = 0total_entropy = 0for epoch in range(epochs):# 前向传播action_probs, state_values = self.network(states_tensor)# 计算当前策略的log概率action_dist = Categorical(action_probs)current_log_probs = action_dist.log_prob(actions)entropy = action_dist.entropy().mean()# 计算概率比率ratio = torch.exp(current_log_probs - old_log_probs)# PPO剪切目标surr1 = ratio * advantagessurr2 = torch.clamp(ratio, 1 - self.epsilon_clip, 1 + self.epsilon_clip) * advantagespolicy_loss = -torch.min(surr1, surr2).mean()# 价值函数损失value_loss = F.mse_loss(state_values.squeeze(), returns)# 总损失total_loss = policy_loss + self.c1 * value_loss - self.c2 * entropy# 反向传播self.optimizer.zero_grad()total_loss.backward()torch.nn.utils.clip_grad_norm_(self.network.parameters(), 0.5)self.optimizer.step()total_policy_loss += policy_loss.item()total_value_loss += value_loss.item()total_entropy += entropy.item()# 清空缓冲区self.reset_buffers()return (total_policy_loss / epochs, total_value_loss / epochs, total_entropy / epochs)class PPOTrainer:def __init__(self, mdp, agent):self.mdp = mdpself.agent = agentdef train_episode(self, max_steps=100):"""训练一个回合"""state = self.mdp.reset()total_reward = 0steps = 0while steps < max_steps:# 转换状态state_tensor = self.agent.state_to_tensor(state, self.mdp)# 选择动作action, log_prob, value = self.agent.select_action(state_tensor)# 执行动作next_state, reward, done = self.mdp.step(state, action)# 存储转移self.agent.store_transition(state_tensor, action, reward, log_prob, value, done)total_reward += rewardstate = next_statesteps += 1if done:breakreturn total_reward, stepsdef train(self, num_episodes=1000, update_frequency=20):"""训练多个回合"""rewards_history = []steps_history = []policy_loss_history = []value_loss_history = []entropy_history = []episode_rewards = []episode_steps = []for episode in range(num_episodes):total_reward, steps = self.train_episode()episode_rewards.append(total_reward)episode_steps.append(steps)# 每update_frequency回合更新一次策略if (episode + 1) % update_frequency == 0:policy_loss, value_loss, entropy = self.agent.update_policy()# 记录平均表现avg_reward = np.mean(episode_rewards)avg_steps = np.mean(episode_steps)rewards_history.append(avg_reward)steps_history.append(avg_steps)policy_loss_history.append(policy_loss)value_loss_history.append(value_loss)entropy_history.append(entropy)if len(rewards_history) % 10 == 0:print(f"更新 {len(rewards_history)}: 平均奖励 = {avg_reward:.2f}, "f"平均步数 = {avg_steps:.2f}")# 重置临时记录episode_rewards = []episode_steps = []return (rewards_history, steps_history, policy_loss_history, value_loss_history, entropy_history)def test_policy(self, num_episodes=10):"""测试学习到的策略"""test_rewards = []test_paths = []for _ in range(num_episodes):state = self.mdp.reset()path = [state]total_reward = 0steps = 0max_steps = 100while steps < max_steps:state_tensor = self.agent.state_to_tensor(state, self.mdp)# 使用贪心策略with torch.no_grad():action_probs, _ = self.agent.network(state_tensor)action = torch.argmax(action_probs).item()next_state, reward, done = self.mdp.step(state, action)path.append(next_state)total_reward += rewardstate = next_statesteps += 1if done:breaktest_rewards.append(total_reward)test_paths.append(path)return test_rewards, test_paths# 示例使用PPO
print("创建PPO智能体...")
ppo_agent = PPOAgent(state_size=len(mdp.states), action_size=mdp.n_actions,lr=3e-4,gamma=0.99,lambda_gae=0.95,epsilon_clip=0.2
)ppo_trainer = PPOTrainer(mdp, ppo_agent)print("开始PPO训练...")
ppo_results = ppo_trainer.train(num_episodes=1000, update_frequency=20)
ppo_rewards, ppo_steps, ppo_policy_loss, ppo_value_loss, ppo_entropy = ppo_results# 测试PPO策略
print("\n测试PPO策略...")
ppo_test_rewards, ppo_test_paths = ppo_trainer.test_policy(num_episodes=5)print("PPO测试结果:")
for i, (reward, path) in enumerate(zip(ppo_test_rewards, ppo_test_paths)):coord_path = [env.state_to_coord(state) for state in path]print(f"测试 {i+1}: 奖励 = {reward:.1f}, 路径长度 = {len(path)}")print(f"  路径: {coord_path}")

总结与回顾

强化学习的核心概念

  1. 状态表示:如何描述环境的当前情况
  2. 马尔科夫过程:未来只依赖于现在,不依赖于过去
  3. 贝尔曼方程:价值函数的递归关系
  4. 策略梯度:直接优化策略参数
  5. Q-Learning:学习动作价值函数
  6. PPO:稳定高效的策略优化

算法特点对比

算法类型优点缺点适用场景
价值迭代基于模型理论保证,收敛性好需要环境模型小规模、已知环境
策略梯度基于策略直接优化策略方差大,样本效率低连续动作空间
Q-Learning基于价值无需环境模型需要探索策略离散动作空间
PPOActor-Critic稳定、高效实现复杂现代深度强化学习

学习建议

  1. 从简单开始:先理解基本概念,再学习复杂算法
  2. 动手实践:编程实现有助于深入理解
  3. 可视化分析:观察学习过程和结果
  4. 参数调优:学会调整超参数
  5. 扩展应用:尝试在其他问题上应用

进一步学习方向

  1. 深度强化学习:DQN、A3C、SAC等
  2. 多智能体强化学习:竞争与合作
  3. 层次强化学习:分解复杂任务
  4. 元学习:学会如何学习
  5. 实际应用:游戏、机器人、推荐系统等

附录:完整代码整合

# 完整的强化学习实验代码
def run_complete_experiment():"""运行完整的强化学习实验"""# 创建环境env = GridWorld()mdp = GridWorldMDP(env)print("4×4网格世界强化学习实验")print("=" * 50)# 1. 价值迭代print("\n1. 运行价值迭代...")solver = BellmanSolver(mdp)optimal_V, optimal_policy, _ = solver.value_iteration()# 2. 策略梯度print("\n2. 训练策略梯度...")pg_agent = REINFORCE(len(mdp.states), mdp.n_actions)pg_trainer = PolicyGradientTrainer(mdp, pg_agent)pg_rewards, _, _ = pg_trainer.train(num_episodes=500)# 3. Q-Learningprint("\n3. 训练Q-Learning...")q_agent = QLearningAgent(len(mdp.states), mdp.n_actions)q_trainer = QLearningTrainer(mdp, q_agent)q_rewards, _, _, _ = q_trainer.train(num_episodes=1000)# 4. PPOprint("\n4. 训练PPO...")ppo_agent = PPOAgent(len(mdp.states), mdp.n_actions)ppo_trainer = PPOTrainer(mdp, ppo_agent)ppo_rewards, _, _, _, _ = ppo_trainer.train(num_episodes=1000)print("\n实验完成!所有算法都已训练完毕。")return {'env': env,'mdp': mdp,'value_iteration': (optimal_V, optimal_policy),'policy_gradient': (pg_agent, pg_rewards),'q_learning': (q_agent, q_rewards),'ppo': (ppo_agent, ppo_rewards)}# 运行完整实验
if __name__ == "__main__":results = run_complete_experiment()print("强化学习教程学习完成!")

恭喜你完成了强化学习的完整学习旅程! 🎉

通过这个教程,你已经掌握了强化学习的核心概念和主要算法。现在你可以:

  1. 理解强化学习的基本原理
  2. 实现经典的强化学习算法
  3. 分析和比较不同算法的性能
  4. 将这些知识应用到新的问题中

继续探索和实践,强化学习的世界等待着你去发现更多精彩的内容!


第六章:Q-Learning

6.1 Q-Learning的基本思想

Q-Learning是一种无模型的强化学习算法,它直接学习动作价值函数 Q ( s , a ) Q(s,a) Q(s,a),而不需要知道环境的转移概率。

6.2 Q-Learning的数学原理

Q-Learning基于以下更新规则:

Q ( s , a ) ← Q ( s , a ) + α [ r + γ max ⁡ a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)] Q(s,a)Q(s,a)+α[r+γamaxQ(s,a)Q(s,a)]

参数解释

  • α \alpha α:学习率 (0 < α ≤ 1)
  • γ \gamma γ:折扣因子 (0 ≤ γ ≤ 1)
  • r r r:即时奖励
  • s ′ s' s:下一个状态
  • max ⁡ a ′ Q ( s ′ , a ′ ) \max_{a'} Q(s',a') maxaQ(s,a):下一个状态的最大Q值

更新规则的直觉

  • r + γ max ⁡ a ′ Q ( s ′ , a ′ ) r + \gamma \max_{a'} Q(s',a') r+γmaxaQ(s,a):目标值(我们希望Q(s,a)接近的值)
  • Q ( s , a ) Q(s,a) Q(s,a):当前值
  • r + γ max ⁡ a ′ Q ( s ′ , a ′ ) − Q ( s , a ) r + \gamma \max_{a'} Q(s',a') - Q(s,a) r+γmaxaQ(s,a)Q(s,a):TD误差
  • 用TD误差来调整当前的Q值

6.3 ε-贪心策略

为了平衡探索利用,我们使用ε-贪心策略:

π ( a ∣ s ) = { 1 − ϵ + ϵ ∣ A ∣ if  a = arg ⁡ max ⁡ a ′ Q ( s , a ′ ) ϵ ∣ A ∣ otherwise \pi(a|s) = \begin{cases} 1-\epsilon + \frac{\epsilon}{|A|} & \text{if } a = \arg\max_{a'} Q(s,a') \\ \frac{\epsilon}{|A|} & \text{otherwise} \end{cases} π(as)={1ϵ+AϵAϵif a=argmaxaQ(s,a)otherwise

简化版本:

  • 以概率 1 − ϵ 1-\epsilon 1ϵ 选择最优动作(利用)
  • 以概率 ϵ \epsilon ϵ 随机选择动作(探索)

6.4 Q-Learning算法流程

  1. 初始化:Q表为0
  2. 对于每个回合
    • 初始化状态 s s s
    • 对于回合中的每一步
      • 用ε-贪心策略选择动作 a a a
      • 执行动作,观察奖励 r r r 和下一状态 s ′ s' s
      • 更新Q值: Q ( s , a ) ← Q ( s , a ) + α [ r + γ max ⁡ a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)] Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]
      • s ← s ′ s \leftarrow s' ss

6.5 Python实现

class QLearningAgent:def __init__(self, state_size, action_size, lr=0.1, gamma=0.99, epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01):self.state_size = state_sizeself.action_size = action_sizeself.lr = lrself.gamma = gammaself.epsilon = epsilonself.epsilon_decay = epsilon_decayself.epsilon_min = epsilon_min# 初始化Q表self.q_table = np.zeros((state_size, action_size))def select_action(self, state_idx):"""ε-贪心策略选择动作"""if np.random.random() < self.epsilon:# 探索:随机选择动作return np.random.randint(self.action_size)else:# 利用:选择Q值最大的动作return np.argmax(self.q_table[state_idx])def update_q_table(self, state_idx, action, reward, next_state_idx, done):"""更新Q表"""current_q = self.q_table[state_idx][action]if done:# 终止状态,没有未来奖励target_q = rewardelse:# 使用贝尔曼方程计算目标Q值target_q = reward + self.gamma * np.max(self.q_table[next_state_idx])# Q-Learning更新规则td_error = target_q - current_qself.q_table[state_idx][action] += self.lr * td_errorreturn abs(td_error)def decay_epsilon(self):"""衰减探索率"""if self.epsilon > self.epsilon_min:self.epsilon *= self.epsilon_decaydef get_greedy_action(self, state_idx):"""获取贪心动作(用于测试)"""return np.argmax(self.q_table[state_idx])class QLearningTrainer:def __init__(self, mdp, agent):self.mdp = mdpself.agent = agentdef train_episode(self):"""训练一个回合"""state = self.mdp.reset()state_idx = self.mdp.states.index(state)total_reward = 0steps = 0max_steps = 100td_errors = []while steps < max_steps:# 选择动作action = self.agent.select_action(state_idx)# 执行动作next_state, reward, done = self.mdp.step(state, action)next_state_idx = self.mdp.states.index(next_state)# 更新Q表td_error = self.agent.update_q_table(state_idx, action, reward, next_state_idx, done)td_errors.append(td_error)total_reward += rewardstate = next_statestate_idx = next_state_idxsteps += 1if done:break# 衰减探索率self.agent.decay_epsilon()return total_reward, steps, np.mean(td_errors)def train(self, num_episodes=1000):"""训练多个回合"""rewards_history = []steps_history = []td_error_history = []epsilon_history = []for episode in range(num_episodes):total_reward, steps, avg_td_error = self.train_episode()rewards_history.append(total_reward)steps_history.append(steps)td_error_history.append(avg_td_error)epsilon_history.append(self.agent.epsilon)if (episode + 1) % 100 == 0:avg_reward = np.mean(rewards_history[-100:])avg_steps = np.mean(steps_history[-100:])print(f"回合 {episode + 1}: 平均奖励 = {avg_reward:.2f}, "f"平均步数 = {avg_steps:.2f}, ε = {self.agent.epsilon:.3f}")return rewards_history, steps_history, td_error_history, epsilon_historydef test_policy(self, num_episodes=10):"""测试学习到的策略"""test_rewards = []test_paths = []for _ in range(num_episodes):state = self.mdp.reset()state_idx = self.mdp.states.index(state)path = [state]total_reward = 0steps = 0max_steps = 100while steps < max_steps:# 使用贪心策略action = self.agent.get_greedy_action(state_idx)next_state, reward, done = self.mdp.step(state, action)path.append(next_state)total_reward += rewardstate = next_statestate_idx = self.mdp.states.index(state)steps += 1if done:breaktest_rewards.append(total_reward)test_paths.append(path)return test_rewards, test_pathsdef visualize_q_table(self):"""可视化Q表"""fig, axes = plt.subplots(1, 5, figsize=(20, 4))action_names = ['上', '下', '左', '右', '不动']for action in range(5):# 创建4x4网格来显示Q值q_grid = np.full((4, 4), np.nan)for i, state_id in enumerate(self.mdp.states):x, y = self.mdp.env.state_to_coord(state_id)q_grid[x, y] = self.agent.q_table[i][action]im = axes[action].imshow(q_grid, cmap='RdYlBu_r')axes[action].set_title(f'Q值 - 动作: {action_names[action]}')# 添加数值标注for x in range(4):for y in range(4):if not np.isnan(q_grid[x, y]):axes[action].text(y, x, f'{q_grid[x, y]:.1f}',ha="center", va="center", color="black", fontsize=8)# 标记障碍物for obs_x, obs_y in self.mdp.env.obstacles:axes[action].add_patch(Rectangle((obs_y-0.5, obs_x-0.5), 1, 1, facecolor='black', alpha=0.8))axes[action].set_xticks(range(4))axes[action].set_yticks(range(4))plt.colorbar(im, ax=axes[action])plt.tight_layout()return fig# 创建和训练Q-Learning智能体
print("创建Q-Learning智能体...")
q_agent = QLearningAgent(state_size=len(mdp.states), action_size=mdp.n_actions,lr=0.1, gamma=0.9, epsilon=1.0,epsilon_decay=0.995,epsilon_min=0.01
)q_trainer = QLearningTrainer(mdp, q_agent)print("开始Q-Learning训练...")
q_rewards, q_steps, q_td_errors, q_epsilons = q_trainer.train(num_episodes=1000)# 测试学习到的策略
print("\n测试Q-Learning策略...")
q_test_rewards, q_test_paths = q_trainer.test_policy(num_episodes=5)print("Q-Learning测试结果:")
for i, (reward, path) in enumerate(zip(q_test_rewards, q_test_paths)):coord_path = [env.state_to_coord(state) for state in path]print(f"测试 {i+1}: 奖励 = {reward:.1f}, 路径长度 = {len(path)}")print(f"  路径: {coord_path}")# 可视化Q表
fig = q_trainer.visualize_q_table()
plt.show()# 可视化训练过程
fig, axes = plt.subplots(2, 2, figsize=(12, 10))axes[0,0].plot(q_rewards)
axes[0,0].set_title('Q-Learning 训练奖励')
axes[0,0].set_xlabel('回合')
axes[0,0].set_ylabel('总奖励')axes[0,1].plot(q_steps)
axes[0,1].set_title('Q-Learning 回合步数')
axes[0,1].set_xlabel('回合')
axes[0,1].set_ylabel('步数')axes[1,0].plot(q_td_errors)
axes[1,0].set_title('Q-Learning TD误差')
axes[1,0].set_xlabel('回合')
axes[1,0].set_ylabel('平均TD误差')axes[1,1].plot(q_epsilons)
axes[1,1].set_title('Q-Learning 探索率衰减')
axes[1,1].set_xlabel('回合')
axes[1,1].set_ylabel('ε值')plt.tight_layout()
plt.show()

第七章:近端策略优化 (PPO)

7.1 PPO的背景和动机

PPO是目前最流行的深度强化学习算法之一。它解决了传统策略梯度方法的几个问题:

  1. 样本效率低:每次更新只能使用一次数据
  2. 训练不稳定:策略更新步长难以控制
  3. 容易陷入局部最优:贪心更新可能破坏策略

7.2 PPO的核心思想

PPO通过限制策略更新的幅度来确保训练稳定性。

7.2.1 重要性采样 (Importance Sampling)

当我们用旧策略 π θ o l d \pi_{\theta_{old}} πθold 收集的数据来更新新策略 π θ \pi_\theta πθ 时,需要使用重要性采样:

E s , a ∼ π θ o l d [ π θ ( a ∣ s ) π θ o l d ( a ∣ s ) A π θ o l d ( s , a ) ] \mathbb{E}_{s,a \sim \pi_{\theta_{old}}}[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{\theta_{old}}}(s,a)] Es,aπθold[πθold(as)πθ(as)Aπθold(s,a)]

其中比率 r t ( θ ) = π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} rt(θ)=πθold(atst)πθ(atst) 表示新旧策略的概率比。

7.2.2 优势函数 (Advantage Function)

优势函数衡量某个动作相对于平均水平的好坏:
A π ( s , a ) = Q π ( s , a ) − V π ( s ) A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s) Aπ(s,a)=Qπ(s,a)Vπ(s)

直觉理解

  • A ( s , a ) > 0 A(s,a) > 0 A(s,a)>0:这个动作比平均水平好
  • A ( s , a ) < 0 A(s,a) < 0 A(s,a)<0:这个动作比平均水平差
  • A ( s , a ) = 0 A(s,a) = 0 A(s,a)=0:这个动作就是平均水平
7.2.3 GAE (Generalized Advantage Estimation)

实际中,我们使用GAE来估计优势函数:

A t = δ t + ( γ λ ) δ t + 1 + ( γ λ ) 2 δ t + 2 + . . . A_t = \delta_t + (\gamma \lambda) \delta_{t+1} + (\gamma \lambda)^2 \delta_{t+2} + ... At=δt+(γλ)δt+1+(γλ)2δt+2+...

其中 δ t = r t + γ V ( s t + 1 ) − V ( s t ) \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) δt=rt+γV(st+1)V(st) 是TD误差。

7.3 PPO的损失函数

7.3.1 剪切目标函数

PPO的核心是剪切概率比率:

L C L I P ( θ ) = E t [ min ⁡ ( r t ( θ ) A t , clip ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A t ) ] L^{CLIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t)] LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)]

其中:

  • r t ( θ ) = π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} rt(θ)=πθold(atst)πθ(atst)
  • ϵ \epsilon ϵ 是剪切参数(通常为0.1或0.2)
  • clip ( x , a , b ) \text{clip}(x, a, b) clip(x,a,b) 将x限制在[a,b]范围内
7.3.2 价值函数损失

L V F ( θ ) = E t [ ( V θ ( s t ) − V t t a r g e t ) 2 ] L^{VF}(\theta) = \mathbb{E}_t[(V_\theta(s_t) - V_t^{target})^2] LVF(θ)=Et[(Vθ(st)Vttarget)2]

7.3.3 熵奖励

为了鼓励探索,加入熵奖励:
L E N T ( θ ) = E t [ H ( π θ ( ⋅ ∣ s t ) ) ] L^{ENT}(\theta) = \mathbb{E}_t[H(\pi_\theta(\cdot|s_t))] LENT(θ)=Et[H(πθ(st))]

7.3.4 总损失函数

L ( θ ) = L C L I P ( θ ) + c 1 L V F ( θ ) + c 2 L E N T ( θ ) L(\theta) = L^{CLIP}(\theta) + c_1 L^{VF}(\theta) + c_2 L^{ENT}(\theta) L(θ)=LCLIP(θ)+c1LVF(θ)+c2LENT(θ)

7.4 PPO算法流程

  1. 收集轨迹:用当前策略收集N步数据
  2. 计算优势:使用GAE计算优势函数
  3. 多次更新:用同一批数据更新K次策略
  4. 重复:回到步骤1

7.5 Python实现

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categoricalclass ActorCritic(nn.Module):def __init__(self, state_size, action_size, hidden_size=64):super(ActorCritic, self).__init__()# 共享的特征提取层self.fc1 = nn.Linear(state_size, hidden_size)self.fc2 = nn.Linear(hidden_size, hidden_size)# Actor网络(策略)self.actor = nn.Linear(hidden_size, action_size)# Critic网络(价值函数)self.critic = nn.Linear(hidden_size, 1)def forward(self, x):x = F.relu(self.fc1(x))x = F.relu(self.fc2(x))# 策略分布action_logits = self.actor(x)action_probs = F.softmax(action_logits, dim=-1)# 状态价值state_value = self.critic(x)return action_probs, state_valueclass PPOAgent:def __init__(self, state_size, action_size, lr=3e-4, gamma=0.99, lambda_gae=0.95, epsilon_clip=0.2, c1=0.5, c2=0.01):self.state_size = state_sizeself.action_size = action_sizeself.gamma = gammaself.lambda_gae = lambda_gaeself.epsilon_clip = epsilon_clipself.c1 = c1  # 价值函数损失系数self.c2 = c2  # 熵奖励系数# 创建Actor-Critic网络self.network = ActorCritic(state_size, action_size)self.optimizer = optim.Adam(self.network.parameters(), lr=lr)# 存储轨迹数据self.reset_buffers()def reset_buffers(self):"""重置缓冲区"""self.states = []self.actions = []self.rewards = []self.log_probs = []self.values = []self.dones = []def state_to_tensor(self, state_id, mdp):"""将状态转换为张量"""state_idx = mdp.states.index(state_id)state_vector = torch.zeros(len(mdp.states))state_vector[state_idx] = 1.0return state_vector.unsqueeze(0)def select_action(self, state_tensor):"""选择动作并返回相关信息"""with torch.no_grad():action_probs, state_value = self.network(state_tensor)# 创建动作分布并采样action_dist = Categorical(action_probs)action = action_dist.sample()log_prob = action_dist.log_prob(action)return action.item(), log_prob.item(), state_value.item()def store_transition(self, state, action, reward, log_prob, value, done):"""存储转移数据"""self.states.append(state)self.actions.append(action)self.rewards.append(reward)self.log_probs.append(log_prob)self.values.append(value)self.dones.append(done)def compute_gae(self, next_value=0):"""计算GAE优势函数"""advantages = []gae = 0# 添加最后一个价值values = self.values + [next_value]# 从后往前计算GAEfor t in reversed(range(len(self.rewards))):if t == len(self.rewards) - 1:next_non_terminal = 1.0 - self.dones[t]next_value = values[t + 1]else:next_non_terminal = 1.0 - self.dones[t]next_value = values[t + 1]# TD误差delta = self.rewards[t] + self.gamma * next_value * next_non_terminal - values[t]# GAE计算gae = delta + self.gamma * self.lambda_gae * next_non_terminal * gaeadvantages.insert(0, gae)return advantagesdef update_policy(self, epochs=4):"""更新策略网络"""# 计算优势函数和回报advantages = self.compute_gae()advantages = torch.tensor(advantages, dtype=torch.float32)# 计算回报(优势 + 价值)returns = advantages + torch.tensor(self.values, dtype=torch.float32)# 标准化优势advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)# 转换数据为张量old_log_probs = torch.tensor(self.log_probs, dtype=torch.float32)actions = torch.tensor(self.actions, dtype=torch.long)# 将状态列表转换为张量states_tensor = torch.stack([s for s in self.states])# 多次更新策略total_policy_loss = 0total_value_loss = 0total_entropy = 0for epoch in range(epochs):# 前向传播action_probs, state_values = self.network(states_tensor)# 计算当前策略的log概率action_dist = Categorical(action_probs)current_log_probs = action_dist.log_prob(actions)entropy = action_dist.entropy().mean()# 计算概率比率ratio = torch.exp(current_log_probs - old_log_probs)# PPO剪切目标surr1 = ratio * advantagessurr2 = torch.clamp(ratio, 1 - self.epsilon_clip, 1 + self.epsilon_clip) * advantagespolicy_loss = -torch.min(surr1, surr2).mean()# 价值函数损失value_loss = F.mse_loss(state_values.squeeze(), returns)# 总损失total_loss = policy_loss + self.c1 * value_loss - self.c2 * entropy# 反向传播self.optimizer.zero_grad()total_loss.backward()torch.nn.utils.clip_grad_norm_(self.network.parameters(), 0.5)self.optimizer.step()total_policy_loss += policy_loss.item()total_value_loss += value_loss.item()total_entropy += entropy.item()# 清空缓冲区self.reset_buffers()return (total_policy_loss / epochs, total_value_loss / epochs, total_entropy / epochs)class PPOTrainer:def __init__(self, mdp, agent):self.mdp = mdpself.agent = agentdef train_episode(self, max_steps=100):"""训练一个回合"""state = self.mdp.reset()total_reward = 0steps = 0while steps < max_steps:# 转换状态state_tensor = self.agent.state_to_tensor(state, self.mdp)# 选择动作action, log_prob, value = self.agent.select_action(state_tensor)# 执行动作next_state, reward, done = self.mdp.step(state, action)# 存储转移self.agent.store_transition(state_tensor, action, reward, log_prob, value, done)total_reward += rewardstate = next_statesteps += 1if done:breakreturn total_reward, stepsdef train(self, num_episodes=1000, update_frequency=20):"""训练多个回合"""rewards_history = []steps_history = []policy_loss_history = []value_loss_history = []entropy_history = []episode_rewards = []episode_steps = []for episode in range(num_episodes):total_reward, steps = self.train_episode()episode_rewards.append(total_reward)episode_steps.append(steps)# 每update_frequency回合更新一次策略if (episode + 1) % update_frequency == 0:policy_loss, value_loss, entropy = self.agent.update_policy()# 记录平均表现avg_reward = np.mean(episode_rewards)avg_steps = np.mean(episode_steps)rewards_history.append(avg_reward)steps_history.append(avg_steps)policy_loss_history.append(policy_loss)value_loss_history.append(value_loss)entropy_history.append(entropy)if len(rewards_history) % 10 == 0:print(f"更新 {len(rewards_history)}: 平均奖励 = {avg_reward:.2f}, "f"平均步数 = {avg_steps:.2f}")# 重置临时记录episode_rewards = []episode_steps = []return (rewards_history, steps_history, policy_loss_history, value_loss_history, entropy_history)def test_policy(self, num_episodes=10):"""测试学习到的策略"""test_rewards = []test_paths = []for _ in range(num_episodes):state = self.mdp.reset()path = [state]total_reward = 0steps = 0max_steps = 100while steps < max_steps:state_tensor = self.agent.state_to_tensor(state, self.mdp)# 使用贪心策略with torch.no_grad():action_probs, _ = self.agent.network(state_tensor)action = torch.argmax(action_probs).item()next_state, reward, done = self.mdp.step(state, action)path.append(next_state)total_reward += rewardstate = next_statesteps += 1if done:breaktest_rewards.append(total_reward)test_paths.append(path)return test_rewards, test_paths# 示例使用PPO
print("创建PPO智能体...")
ppo_agent = PPOAgent(state_size=len(mdp.states), action_size=mdp.n_actions,lr=3e-4,gamma=0.99,lambda_gae=0.95,epsilon_clip=0.2
)ppo_trainer = PPOTrainer(mdp, ppo_agent)print("开始PPO训练...")
ppo_results = ppo_trainer.train(num_episodes=1000, update_frequency=20)
ppo_rewards, ppo_steps, ppo_policy_loss, ppo_value_loss, ppo_entropy = ppo_results# 测试PPO策略
print("\n测试PPO策略...")
ppo_test_rewards, ppo_test_paths = ppo_trainer.test_policy(num_episodes=5)print("PPO测试结果:")
for i, (reward, path) in enumerate(zip(ppo_test_rewards, ppo_test_paths)):coord_path = [env.state_to_coord(state) for state in path]print(f"测试 {i+1}: 奖励 = {reward:.1f}, 路径长度 = {len(path)}")print(f"  路径: {coord_path}")

总结与回顾

强化学习的核心概念

  1. 状态表示:如何描述环境的当前情况
  2. 马尔科夫过程:未来只依赖于现在,不依赖于过去
  3. 贝尔曼方程:价值函数的递归关系
  4. 策略梯度:直接优化策略参数
  5. Q-Learning:学习动作价值函数
  6. PPO:稳定高效的策略优化

算法特点对比

算法类型优点缺点适用场景
价值迭代基于模型理论保证,收敛性好需要环境模型小规模、已知环境
策略梯度基于策略直接优化策略方差大,样本效率低连续动作空间
Q-Learning基于价值无需环境模型需要探索策略离散动作空间
PPOActor-Critic稳定、高效实现复杂现代深度强化学习

学习建议

  1. 从简单开始:先理解基本概念,再学习复杂算法
  2. 动手实践:编程实现有助于深入理解
  3. 可视化分析:观察学习过程和结果
  4. 参数调优:学会调整超参数
  5. 扩展应用:尝试在其他问题上应用

进一步学习方向

  1. 深度强化学习:DQN、A3C、SAC等
  2. 多智能体强化学习:竞争与合作
  3. 层次强化学习:分解复杂任务
  4. 元学习:学会如何学习
  5. 实际应用:游戏、机器人、推荐系统等

附录:完整代码整合

# 完整的强化学习实验代码
def run_complete_experiment():"""运行完整的强化学习实验"""# 创建环境env = GridWorld()mdp = GridWorldMDP(env)print("4×4网格世界强化学习实验")print("=" * 50)# 1. 价值迭代print("\n1. 运行价值迭代...")solver = BellmanSolver(mdp)optimal_V, optimal_policy, _ = solver.value_iteration()# 2. 策略梯度print("\n2. 训练策略梯度...")pg_agent = REINFORCE(len(mdp.states), mdp.n_actions)pg_trainer = PolicyGradientTrainer(mdp, pg_agent)pg_rewards, _, _ = pg_trainer.train(num_episodes=500)# 3. Q-Learningprint("\n3. 训练Q-Learning...")q_agent = QLearningAgent(len(mdp.states), mdp.n_actions)q_trainer = QLearningTrainer(mdp, q_agent)q_rewards, _, _, _ = q_trainer.train(num_episodes=1000)# 4. PPOprint("\n4. 训练PPO...")ppo_agent = PPOAgent(len(mdp.states), mdp.n_actions)ppo_trainer = PPOTrainer(mdp, ppo_agent)ppo_rewards, _, _, _, _ = ppo_trainer.train(num_episodes=1000)print("\n实验完成!所有算法都已训练完毕。")return {'env': env,'mdp': mdp,'value_iteration': (optimal_V, optimal_policy),'policy_gradient': (pg_agent, pg_rewards),'q_learning': (q_agent, q_rewards),'ppo': (ppo_agent, ppo_rewards)}# 运行完整实验
if __name__ == "__main__":results = run_complete_experiment()print("强化学习教程学习完成!")

恭喜你完成了强化学习的完整学习旅程! 🎉

通过这个教程,你已经掌握了强化学习的核心概念和主要算法。现在你可以:

  1. 理解强化学习的基本原理
  2. 实现经典的强化学习算法
  3. 分析和比较不同算法的性能
  4. 将这些知识应用到新的问题中

继续探索和实践,强化学习的世界等待着你去发现更多精彩的内容!


相关文章:

  • C/C++内存分布和管理
  • 以楼宇自控技术赋能节能,驱动绿色建筑可持续发展进程
  • PCL 导入VS配置的大量依赖项名称快速读取
  • git报错fatal: 远端意外挂断了
  • 简述Unity的资源加载和内存管理
  • 【地图服务限制范围】
  • SAP ERS 自动化发票
  • 图像处理与机器学习项目:特征提取、PCA与分类器评估
  • 多参表达式Hive UDF
  • 达梦数据库中无效触发器的排查与解决方案指南
  • 【狂飙AGI】第2课:大模型方向市场分析
  • 第四讲 基础运算之小数运算
  • 无外接物理显示器的Ubuntu系统的远程桌面连接(升级版)
  • 深度学习编译器
  • Java中wait()为何必须同步调用?
  • 手机射频功放测试学习(一)——手机线性功放的主要测试指标
  • Cesium距离测量、角度测量、面积测量
  • Redis初识第一期
  • 1.线性表的顺序存储-顺序表
  • VAS5081电动工具专用3-8节串联电池监控芯片奇力科技
  • 佛山怎么做网站/怎样建网站赚钱
  • 赣州住房与城乡建设厅网站/天津关键词排名推广
  • 博远手机销售管理系统app/seo怎么搞
  • 安卓市场官方版/徐州seo管理
  • 北京网站开发网络公司/软件推广赚佣金渠道
  • wordpress添加keywords/凌哥seo