当前位置: 首页 > news >正文

专门给小公司做网站云南百度推广开户

专门给小公司做网站,云南百度推广开户,莞城网页设计,广安建网站一、元强化学习原理 1. 元学习核心思想 元强化学习(Meta-RL)旨在让智能体快速适应新任务,其核心是通过任务分布学习共享知识。与传统强化学习的区别在于: 对比维度传统强化学习元强化学习目标解决单一任务快速适应任务分布中的…

一、元强化学习原理

1. 元学习核心思想

元强化学习(Meta-RL)旨在让智能体快速适应新任务,其核心是通过任务分布学习共享知识。与传统强化学习的区别在于:

对比维度传统强化学习元强化学习
目标解决单一任务快速适应任务分布中的新任务
训练方式单任务大量交互多任务交替训练
泛化能力任务特定策略跨任务可迁移策略
2. MAML 算法框架

Model-Agnostic Meta-Learning (MAML) 通过双层优化实现快速适应:

  1. 内层循环:在单个任务上执行少量梯度步

  2. 外层循环:跨任务更新初始参数

数学表达:


二、MAML 实现步骤(基于 Gymnasium)

我们将以 HalfCheetah 变体任务 为例,实现 MAML 算法:

  1. 定义任务分布:修改机器人质量参数生成不同任务

  2. 构建策略网络:基于 PyTorch 的 Actor-Critic 架构

  3. 实现双层优化:内层任务适配 + 外层元更新

  4. 快速适应测试:在新任务上验证策略性能


三、代码实现

import gymnasium as gym
import torch
import numpy as np
from torch import nn, optim
from collections import deque
import time
import torch.nn.functional as F
​
# ================== 配置参数优化 ==================
class MAMLConfig:env_name = "HalfCheetah-v5"num_tasks = 20adaptation_steps = 10  # 增加适应步数adaptation_lr = 0.1  # 调整适应学习率hidden_dim = 256      # 增大隐藏层维度gamma = 0.99tau = 0.95           # 用于GAE计算meta_batch_size = 8   # 增大元批量meta_lr = 3e-4       # 调整元学习率total_epochs = 1000device = torch.device("cuda" if torch.cuda.is_available() else "cpu")clip_grad = 0.5      # 梯度裁剪阈值
​
# ================== 策略网络优化 ==================
class ActorCritic(nn.Module):def __init__(self, state_dim, action_dim):super().__init__()# 独立特征提取层(修正结构命名)self.actor_net = nn.Sequential(nn.Linear(state_dim, MAMLConfig.hidden_dim),nn.LayerNorm(MAMLConfig.hidden_dim),nn.Tanh(),nn.Linear(MAMLConfig.hidden_dim, MAMLConfig.hidden_dim),nn.LayerNorm(MAMLConfig.hidden_dim),nn.Tanh())self.critic_net = nn.Sequential(nn.Linear(state_dim, MAMLConfig.hidden_dim),nn.LayerNorm(MAMLConfig.hidden_dim),nn.Tanh(),nn.Linear(MAMLConfig.hidden_dim, MAMLConfig.hidden_dim),nn.LayerNorm(MAMLConfig.hidden_dim),nn.Tanh())self.actor_mean = nn.Linear(MAMLConfig.hidden_dim, action_dim)self.log_std = nn.Parameter(torch.zeros(action_dim))self.critic = nn.Linear(MAMLConfig.hidden_dim, 1)# 初始化参数(保持原有初始化逻辑)self._init_weights()def _init_weights(self):for m in self.modules():if isinstance(m, nn.Linear):nn.init.orthogonal_(m.weight, gain=0.01)  # 正交初始化nn.init.constant_(m.bias, 0)# 策略最后一层初始化较小nn.init.orthogonal_(self.actor_mean.weight, gain=0.01)nn.init.constant_(self.actor_mean.bias, 0)# 价值头初始化nn.init.orthogonal_(self.critic.weight, gain=1.0)nn.init.constant_(self.critic.bias, 0)def forward(self, state, params=None):if params is None:# 正常前向传播actor_features = self.actor_net(state)critic_features = self.critic_net(state)mean = self.actor_mean(actor_features)value = self.critic(critic_features).squeeze(-1)else:# 手动参数计算时保持维度一致性if len(state.shape) == 1:state = state.unsqueeze(0)  # 添加批量维度# Actor网络计算x = F.linear(state, params['actor_net.0.weight'], params['actor_net.0.bias'])x = F.layer_norm(x, (MAMLConfig.hidden_dim,))x = torch.tanh(x)x = F.linear(x, params['actor_net.3.weight'], params['actor_net.3.bias'])x = F.layer_norm(x, (MAMLConfig.hidden_dim,))actor_features = torch.tanh(x)# Critic网络计算x = F.linear(state, params['critic_net.0.weight'], params['critic_net.0.bias'])x = F.layer_norm(x, (MAMLConfig.hidden_dim,))x = torch.tanh(x)x = F.linear(x, params['critic_net.3.weight'], params['critic_net.3.bias'])x = F.layer_norm(x, (MAMLConfig.hidden_dim,))critic_features = torch.tanh(x)mean = F.linear(actor_features, params['actor_mean.weight'],params['actor_mean.bias'])value = F.linear(critic_features,params['critic.weight'],params['critic.bias']).squeeze(-1)log_std = self.log_std.unsqueeze(0).expand(mean.shape[0], -1)return mean, log_std, value
​def sample_action(self, state, params=None):mean, log_std, _ = self.forward(state, params)std = log_std.exp()dist = torch.distributions.Normal(mean, std)action = dist.rsample()# 新增维度检查逻辑if len(action.shape) > 1:if action.shape[0] == 1:  # 单样本批量情况action = action.squeeze(0)else:                     # 多步采样情况action = action.squeeze()log_prob = dist.log_prob(action).sum(-1)return action.detach(), log_prob
​
# ================== 任务生成器优化 ==================
class TaskGenerator:def __init__(self):self.default_params = self._get_default_params()def _get_default_params(self):env = gym.make(MAMLConfig.env_name)params = {'mass': env.unwrapped.model.body_mass.copy(),'damping': env.unwrapped.model.dof_damping.copy()}env.close()return paramsdef sample_task(self):new_params = {'mass': self.default_params['mass'] * np.random.uniform(0.5, 2.0, size=self.default_params['mass'].shape),'damping': self.default_params['damping'] * np.random.uniform(0.5, 2.0, size=self.default_params['damping'].shape),'ctrlrange': self.default_params['damping'] * np.random.uniform(0.8, 1.2)  # 新增控制力范围扰动}return new_params
​
# ================== MAML 训练系统优化 ==================
class MAMLTrainer:def __init__(self):self.env = gym.make(MAMLConfig.env_name)self.state_dim = self.env.observation_space.shape[0]self.action_dim = self.env.action_space.shape[0]self.policy = ActorCritic(self.state_dim, self.action_dim).to(MAMLConfig.device)self.meta_optimizer = optim.Adam(self.policy.parameters(), lr=MAMLConfig.meta_lr, betas=(0.9, 0.999))self.task_gen = TaskGenerator()self.tasks = [self.task_gen.sample_task() for _ in range(MAMLConfig.num_tasks)]def adapt_task(self, task_params, num_steps):env = gym.make(MAMLConfig.env_name)env.unwrapped.model.body_mass[:] = task_params['mass']env.unwrapped.model.dof_damping[:] = task_params['damping']fast_weights = {k: v.clone().requires_grad_(True) for k, v in self.policy.named_parameters()}# 多步适应过程for step in range(num_steps):states, actions, rewards, values, dones = [], [], [], [], []obs, _ = env.reset()done = Falsewhile not done:with torch.no_grad():state_tensor = torch.FloatTensor(obs).to(MAMLConfig.device)action, _ = self.policy.sample_action(state_tensor, params=fast_weights)_, _, value = self.policy(state_tensor, params=fast_weights)# next_obs, reward, terminated, truncated, _ = env.step(action.cpu().numpy())next_obs, reward, terminated, truncated, _ = env.step(action.cpu().numpy().astype(np.float32).flatten()  # 新增flatten()
)states.append(obs)actions.append(action)rewards.append(reward)values.append(value)dones.append(terminated or truncated)obs = next_obsdone = terminated or truncated# 计算GAEwith torch.no_grad():last_value = self.policy(torch.FloatTensor(obs).to(MAMLConfig.device), params=fast_weights)[2]returns, advantages = self._compute_gae(rewards, values, dones, last_value)# 计算损失states_tensor = torch.FloatTensor(np.array(states)).to(MAMLConfig.device)actions_tensor = torch.stack(actions)mean, log_std, current_values = self.policy(states_tensor, params=fast_weights)std = log_std.exp()dist = torch.distributions.Normal(mean, std)log_probs = dist.log_prob(actions_tensor).sum(-1)# 策略损失policy_loss = -(log_probs * advantages).mean()# 价值损失value_loss = F.mse_loss(current_values, returns)# 熵正则化entropy_loss = -dist.entropy().mean()total_loss = policy_loss + 0.5 * value_loss + 0.01 * entropy_loss# 计算梯度并更新快速权重grads = torch.autograd.grad(total_loss, fast_weights.values(), create_graph=True, allow_unused=True)for (name, param), grad in zip(fast_weights.items(), grads):if grad is not None:fast_weights[name] = param - MAMLConfig.adaptation_lr * gradenv.close()return fast_weightsdef _compute_gae(self, rewards, values, dones, last_value):values = values + [last_value]gae = 0returns = []advantages = []for t in reversed(range(len(rewards))):delta = rewards[t] + MAMLConfig.gamma * values[t+1] * (1 - dones[t]) - values[t]gae = delta + MAMLConfig.gamma * MAMLConfig.tau * (1 - dones[t]) * gaeadvantages.insert(0, gae)returns.insert(0, advantages[0] + values[t])advantages = torch.tensor(advantages, device=MAMLConfig.device, dtype=torch.float32)returns = torch.tensor(returns, device=MAMLConfig.device, dtype=torch.float32)# 标准化优势advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)return returns, advantagesdef meta_update(self, tasks):meta_loss = 0for task in tasks:fast_weights = self.adapt_task(task, MAMLConfig.adaptation_steps)# 在适应后的策略上收集轨迹env = gym.make(MAMLConfig.env_name)env.unwrapped.model.body_mass[:] = task['mass']env.unwrapped.model.dof_damping[:] = task['damping']states, actions, rewards, values, dones = [], [], [], [], []obs, _ = env.reset()done = Falsewhile not done:with torch.no_grad():state_tensor = torch.FloatTensor(obs).to(MAMLConfig.device)action, _ = self.policy.sample_action(state_tensor, params=fast_weights)_, _, value = self.policy(state_tensor, params=fast_weights)next_obs, reward, terminated, truncated, _ = env.step(action.cpu().numpy())states.append(obs)actions.append(action)rewards.append(reward)values.append(value)dones.append(terminated or truncated)obs = next_obsdone = terminated or truncated# 计算GAE和returnswith torch.no_grad():last_value = self.policy(torch.FloatTensor(obs).to(MAMLConfig.device), params=fast_weights)[2]returns, advantages = self._compute_gae(rewards, values, dones, last_value)# 计算元损失states_tensor = torch.FloatTensor(np.array(states)).to(MAMLConfig.device)actions_tensor = torch.stack(actions).to(MAMLConfig.device)mean, log_std, current_values = self.policy(states_tensor, params=fast_weights)std = log_std.exp()dist = torch.distributions.Normal(mean, std)log_probs = dist.log_prob(actions_tensor).sum(-1)policy_loss = -(log_probs * advantages).mean()value_loss = F.mse_loss(current_values, returns)entropy_loss = -dist.entropy().mean()task_loss = policy_loss + 0.5 * value_loss + 0.01 * entropy_lossmeta_loss += task_lossenv.close()meta_loss /= len(tasks)self.meta_optimizer.zero_grad()meta_loss.backward()torch.nn.utils.clip_grad_norm_(self.policy.parameters(), MAMLConfig.clip_grad)self.meta_optimizer.step()return meta_loss.item()def train(self):for epoch in range(MAMLConfig.total_epochs):batch_tasks = np.random.choice(self.tasks, MAMLConfig.meta_batch_size)loss = self.meta_update(batch_tasks)if (epoch + 1) % 50 == 0:print(f"Epoch {epoch+1:04d} | Meta Loss: {loss:.1f}")self._evaluate()
​def _evaluate(self, num_tasks=3):total_rewards = []for i in range(num_tasks):task = self.task_gen.sample_task()original_params = {k: v.clone() for k, v in self.policy.named_parameters()}fast_weights = self.adapt_task(task, MAMLConfig.adaptation_steps)env = gym.make(MAMLConfig.env_name)env.unwrapped.model.body_mass[:] = task['mass']env.unwrapped.model.dof_damping[:] = task['damping']obs, _ = env.reset()total_reward = 0done = Falsewhile not done:with torch.no_grad():action, _ = self.policy.sample_action(torch.FloatTensor(obs).to(MAMLConfig.device),params=fast_weights)obs, reward, terminated, truncated, _ = env.step(action.cpu().numpy())total_reward += rewarddone = terminated or truncatedtotal_rewards.append(total_reward)self.policy.load_state_dict(original_params)env.close()avg_reward = sum(total_rewards) / num_tasksprint(f"Evaluation | Avg Reward: {avg_reward:.1f}")
​
if __name__ == "__main__":start = time.time()start_str = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(start))print(f"开始时间: {start_str}")print("初始化环境...")trainer = MAMLTrainer()trainer.train()end = time.time()end_str = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(end))print(f"训练完成时间: {end_str}")print(f"训练完成,耗时: {end - start:.2f}秒")

四、关键代码解析

  1. 任务生成器

    • 通过修改机器人质量和关节阻尼参数生成新任务

    • 每个任务对应不同的物理动力学特性

  2. 双层优化实现

    • adapt_task:内层循环在单个任务上执行策略梯度更新

    • meta_update:外层循环跨任务更新初始参数

  3. 策略快速适应

    • 使用 torch.autograd.grad 计算二阶梯度

    • 通过参数克隆实现任务特定参数更新


五、训练输出示例

开始时间: 2025-03-19 12:49:54
初始化环境...
Epoch 0050 | Meta Loss: 18.0
Evaluation | Avg Reward: -299.6
Epoch 0100 | Meta Loss: 21.5
Evaluation | Avg Reward: -193.3
Epoch 0150 | Meta Loss: 14.8
Evaluation | Avg Reward: -199.7
Epoch 0200 | Meta Loss: 25.3
Evaluation | Avg Reward: -317.4
Epoch 0250 | Meta Loss: 16.7
Evaluation | Avg Reward: -174.8
Epoch 0300 | Meta Loss: 24.3
Evaluation | Avg Reward: -277.6
Epoch 0350 | Meta Loss: 12.3
Evaluation | Avg Reward: -249.0
Epoch 0400 | Meta Loss: 25.4
Evaluation | Avg Reward: -253.4
Epoch 0450 | Meta Loss: 13.6
Evaluation | Avg Reward: -222.1
Epoch 0500 | Meta Loss: 27.9
Evaluation | Avg Reward: -295.4
Epoch 0550 | Meta Loss: 23.3
Evaluation | Avg Reward: -484.5
Epoch 0600 | Meta Loss: 17.2
Evaluation | Avg Reward: -315.4
Epoch 0650 | Meta Loss: 16.0
Evaluation | Avg Reward: -250.3
Epoch 0700 | Meta Loss: 20.9
Evaluation | Avg Reward: -300.3
Epoch 0750 | Meta Loss: 33.4
Evaluation | Avg Reward: -305.0
Epoch 0800 | Meta Loss: 61.8
Evaluation | Avg Reward: -260.7
Epoch 0850 | Meta Loss: 10.9
Evaluation | Avg Reward: -311.5
Epoch 0900 | Meta Loss: 24.7
Evaluation | Avg Reward: -299.8
Epoch 0950 | Meta Loss: 14.5
Evaluation | Avg Reward: -321.9
Epoch 1000 | Meta Loss: 12.0
Evaluation | Avg Reward: -275.3
训练完成时间: 2025-03-20 09:28:03
训练完成,耗时: 74288.70秒

六、总结与扩展

本文实现了元强化学习的核心范式——MAML 算法,展示了策略快速适应新任务的能力。读者可尝试以下扩展方向:

  1. 高效探索策略 结合 Proximal Policy Optimization (PPO) 或 Soft Actor-Critic (SAC) 提升采样效率

  2. 多模态任务适应 使用条件策略网络处理离散任务类型

在下一篇文章中,我们将探索 多智能体强化学习(MARL),并实现 MADDPG 算法!


注意事项

  1. 安装依赖:

    pip install gymnasium[mujoco] torch

  2. 完整训练需要 GPU 加速(推荐显存 ≥ 8GB)

  3. 若遇到环境初始化错误,检查 MuJoCo 许可证配置:

    ls ~/.mujoco/mjkey.txt

http://www.dtcms.com/a/452349.html

相关文章:

  • wordpress 缩略图清理长沙网站优化厂家
  • 手机做网站的网站荆州网站建设流程
  • 重庆媒体网站建设单价西安公司网站制作要多少钱
  • 自适应网站开发文字大小如何处理设计培训it培训
  • 深圳做网站那家好莱芜网站建设案例
  • 牛仔网站的建设风格招聘页面设计
  • 三亚网站建设价格alpine wordpress
  • 福州销售网站设计企业什么系统做网站好
  • 实用电子商务网站建立html5编辑器手机版下载
  • 做外贸的网站如何选择服务器做58类网站需要多少钱
  • 广东官方网站建设百度网站v2升级到v3怎么做
  • 一一影视网站源码个人业余做网站怎么弄
  • 营销型网站建设服务商重庆网站建设设计
  • 南昌县城乡规划建设局官方网站广州小程序软件开发
  • php大流量网站开发规范建站平台 做网站
  • 商务汽车网站建设四川微信网站建设公
  • 铭做网站建设欧洲十大服务器的推荐
  • 中国企业网站查询.net网站开发的例子
  • 男女做暖暖到网站网站建设费用详细表
  • 建筑网站搜图现在还做响应式网站吗
  • 网站建设 资质荣誉长春seo关键词排名
  • 网站特效js代码郑州网站建设工作室
  • 网站开发拥有权约定网页制作基础教程26页简答题是什么
  • flash网站优缺点泸州市住房和城乡建设网站
  • 有个做特价的购物网站权威发布新闻的含义
  • 医美三方网站怎么做网站开发什么语言
  • 兴宁网站建设设计临沂门户网站制作
  • 演示网站怎么做wordpress 表格 文章列表
  • 湘潭网站建设 皆来磐石网络wordpress引用轮播图文件
  • 交易类网站建设功能表seo实战密码在线阅读