「日拱一码」057 逆向强化学习(IRL)
目录
IRL算法简介
核心思想
IRL关键概念
最大边际IRL
特征期望匹配
概率IRL
代码示例:网格世界中的IRL
IRL算法特点
优势
局限性
实际应用建议
IRL算法简介
逆向强化学习(Inverse Reinforcement Learning, IRL)是一种从专家示范中推断奖励函数的方法,而不是直接学习策略。与强化学习不同,IRL的目标是理解专家行为背后的"为什么"
核心思想
- 输入:专家示范轨迹
- 输出:推断的奖励函数
- 应用:机器人学习、自动驾驶、游戏AI等
IRL关键概念
最大边际IRL
寻找能使得专家策略优于其他策略的奖励函数
特征期望匹配
专家策略的特征期望应与学习策略的特征期望匹配
概率IRL
将IRL建模为概率推断问题
代码示例:网格世界中的IRL
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib.pyplot as plt# 创建网格世界环境
class GridWorld:def __init__(self, size=5):self.size = sizeself.goal = (size - 1, size - 1)self.state = (0, 0)self.actions = [(0, 1), (1, 0), (0, -1), (-1, 0)] # 右、下、左、上def reset(self):self.state = (0, 0)return self.statedef step(self, action):x, y = self.statedx, dy = self.actions[action]new_x = max(0, min(self.size - 1, x + dx))new_y = max(0, min(self.size - 1, y + dy))self.state = (new_x, new_y)done = (self.state == self.goal)reward = 1.0 if done else -0.1return self.state, reward, donedef get_features(self, state):"""将状态转换为特征向量"""features = np.zeros(self.size ** 2)idx = state[0] * self.size + state[1]features[idx] = 1.0return features# 专家策略生成函数
def generate_expert_trajectories(env, n_trajectories=10):expert_trajs = []for _ in range(n_trajectories):traj = []state = env.reset()done = Falsewhile not done:# 专家策略:向右或向下移动if np.random.rand() < 0.5 and state[1] < env.size - 1:action = 0 # 右else:action = 1 # 下next_state, reward, done = env.step(action)features = env.get_features(state)traj.append((state, action, features))state = next_stateexpert_trajs.append(traj)return expert_trajs# IRL模型
class IRLModel(nn.Module):def __init__(self, feature_size):super(IRLModel, self).__init__()self.reward_weights = nn.Parameter(torch.randn(feature_size))def forward(self, features):return torch.matmul(features, self.reward_weights)# 最大边际IRL训练
def train_irl(env, expert_trajs, epochs=100, lr=0.01):feature_size = env.size ** 2model = IRLModel(feature_size)optimizer = optim.Adam(model.parameters(), lr=lr)# 计算专家特征期望expert_feature_exp = np.zeros(feature_size)for traj in expert_trajs:for _, _, features in traj:expert_feature_exp += featuresexpert_feature_exp /= len(expert_trajs)expert_feature_exp = torch.FloatTensor(expert_feature_exp)for epoch in range(epochs):# 使用当前奖励函数采样轨迹learner_trajs = []for _ in range(len(expert_trajs)):traj = []state = env.reset()done = Falsewhile not done:features = torch.FloatTensor(env.get_features(state))reward = model(features).item()# 简单策略:选择使奖励最大化的动作best_action = 0best_reward = -float('inf')for action in range(len(env.actions)):env_copy = GridWorld(env.size)env_copy.state = statenext_state, _, _ = env_copy.step(action)next_features = torch.FloatTensor(env.get_features(next_state))next_reward = model(next_features).item()if next_reward > best_reward:best_reward = next_rewardbest_action = actionnext_state, _, done = env.step(best_action)features = env.get_features(state)traj.append((state, best_action, features))state = next_statelearner_trajs.append(traj)# 计算学习者特征期望learner_feature_exp = np.zeros(feature_size)for traj in learner_trajs:for _, _, features in traj:learner_feature_exp += featureslearner_feature_exp /= len(learner_trajs)learner_feature_exp = torch.FloatTensor(learner_feature_exp)# 计算损失(最大边际)loss = torch.norm(expert_feature_exp - learner_feature_exp, p=2)# 更新参数optimizer.zero_grad()loss.backward()optimizer.step()if epoch % 10 == 0:print(f"Epoch {epoch}, Loss: {loss.item():.4f}")return model# 主程序
if __name__ == "__main__":env = GridWorld(size=5)# 生成专家轨迹expert_trajs = generate_expert_trajectories(env, n_trajectories=20)print(f"Generated {len(expert_trajs)} expert trajectories")# 训练IRL模型irl_model = train_irl(env, expert_trajs, epochs=100, lr=0.1)# 可视化学习到的奖励函数reward_map = np.zeros((env.size, env.size))for i in range(env.size):for j in range(env.size):features = torch.FloatTensor(env.get_features((i, j)))reward = irl_model(features).item()reward_map[i, j] = rewardplt.imshow(reward_map, cmap='hot')plt.colorbar()plt.title("Learned Reward Function")plt.show()
IRL算法特点
优势
- 不需要预先定义奖励函数
- 能从专家行为中学习潜在目标
- 适用于奖励难以明确指定的场景
局限性
- 计算复杂度高
- 存在解的不唯一性
- 需要足够多的专家示范
实际应用建议
-
特征设计:
- 选择能捕捉任务本质的特征
- 考虑使用神经网络自动学习特征
-
算法选择:
- 最大边际IRL:简单明了
- 概率IRL:更鲁棒但更复杂
- 深度IRL:适用于高维状态空间
-
进阶改进:
- 使用深度网络建模奖励函数
- 结合GAN框架(如GAIL)
- 添加正则化防止过拟合