当前位置: 首页 > wzjs >正文

摄影作品欣赏网站西安建站系统

摄影作品欣赏网站,西安建站系统,徐州手工活外发加工网,装修公司展厅工艺样板Title: Gymnasium Cart Pole 环境与 REINFORCE 算法 —— 强化学习入门 2 文章目录 I. Gymnasium Cart Pole 环境II. REINFORCE 算法1. 原理说明2. REINFORCE 算法实现 I. Gymnasium Cart Pole 环境 Gymnasium Cart Pole 环境是一个倒立摆的动力学仿真环境. 状态空间: 0: Ca…

Title: Gymnasium Cart Pole 环境与 REINFORCE 算法 —— 强化学习入门 2


文章目录

  • I. Gymnasium Cart Pole 环境
  • II. REINFORCE 算法
    • 1. 原理说明
    • 2. REINFORCE 算法实现


I. Gymnasium Cart Pole 环境

Gymnasium Cart Pole 环境是一个倒立摆的动力学仿真环境.

状态空间:

0: Cart Position

1: Cart Velocity

2: Pole Angle

3: Pole Angular Velocity

动作空间:

0: Push cart to the left

1: Push cart to the right

即时激励:

为了更长时间地保持倒立摆呈倒立状态, 每一时间步都是获得即时激励 +1.

回合结束判据:

Termination: Pole Angle is greater than ±12°

Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)

Truncation: Episode length is greater than 200


II. REINFORCE 算法

1. 原理说明

REINFORCE 算法原理及 Python实现, 我们参考了 Foundations of Deep Reinforcement Learning: Theory and Practice in Python.
需要说明的是, 我们此处采用了 Improving REINFORCE
∇ θ J ( π θ ) ≈ ∑ t = 0 T ( R t ( τ ) − b ) ∇ θ log ⁡ π θ ( a t ∣ s t ) \nabla_{\theta} J(\pi_\theta) \approx \sum_{t=0}^{T} \left(R_t(\tau)-b\right) \nabla_{\theta}\log\pi_\theta(a_t|s_t) θJ(πθ)t=0T(Rt(τ)b)θlogπθ(atst)
其中 b b b 是整个轨迹上的回报均值, 是每条轨迹的常值基线.
b = 1 T ∑ t = 0 T R t ( τ ) b=\frac{1}{T} \sum_{t=0}^{T} R_t(\tau) b=T1t=0TRt(τ)
另外, 我们设定连续 15 次倒立摆控制成功后, 结束 REINFORCE 算法训练, 并保存策略映射神经网络.

测试的时候, 加载已保存的策略映射神经网络, 加长测试时间步, 也都能较好控制倒立摆.


2. REINFORCE 算法实现

REINFORCE 算法的策略映射网络:

class Pi(nn.Module):# a policy network to be optimized in reinforcement learning# 待优化的策略网络def __init__(self, in_dim, out_dim): # in_dim = 4, out_dim = 2# super(Pi, self).__init__()super().__init__()# a policy networklayers = [nn.Linear(in_dim, 64), # 4 -> 64nn.ReLU(), # activation functionnn.Linear(64, out_dim), # 64 -> 2]self.model = nn.Sequential(*layers) self.onpolicy_reset()  # initialize memoryself.train()  # Set the model to training modedef onpolicy_reset(self):self.log_probs = []self.rewards = []def forward(self, x): # x -> statepdparam = self.model(x) # forward passreturn pdparam# pdparam -> probability distribution# such as the logits of a categorical distributiondef act(self, state):# Convert the state from a NumPy array to a PyTorch tensor# 由策略网络输出的采样动作和对数概率分布x = torch.from_numpy(state.astype(np.float32)) # print("state: {}".format(state))pdparam = self.forward(x)     # Perform a forward pass through the neural network   # print("pdparam: {}".format(pdparam))# to obtain the probability distribution parameterspd = torch.distributions.Categorical(logits=pdparam) # probability distribution# print("pd.probs: {}\t pd.logits: {}".format(pd.probs, pd.logits))action = pd.sample()            # pi(a|s) in action via pd#calculates the log probability of the sampled action action under the probability distribution pd#$\log(\pi_{\theta}(a_t|s_t))$#where $\pi_{\theta}$ is the policy network,#	$a_t$ is the action at time step $t$,#	$s_t$ is the state at time step $t$log_prob = pd.log_prob(action)  # log_prob of pi(a|s), log_prob = pd.logitsself.log_probs.append(log_prob) # store for trainingreturn action.item()  # extracts the value of a single-element tensor as a scalar

对策略映射网络的方向传播训练:

def train(pi, optimizer):# 以下利用蒙特卡洛法计算损失函数值,并利用梯度上升法更新策略网络参数# 蒙特卡洛法需要采样多条轨迹来求损失函数的均值,但是为了简化只采样了一条轨迹当做均值# Inner gradient-ascent loop of REINFORCE algorithmT = len(pi.rewards)rets = np.empty(T, dtype=np.float32)  # Initialize returnsfuture_ret = 0.0# compute the returns efficiently in reverse order# R_t(\tau) = \Sigma_{t'=t}^{T} {\gamma^{t'-t} r_{t'}}for t in reversed(range(T)):future_ret = pi.rewards[t] + gamma * future_retrets[t] = future_retbaseline = sum(rets) / Trets = torch.tensor(rets)rets = rets - baseline  # modify the returns by subtracting a baselinelog_probs = torch.stack(pi.log_probs)# - R_t(\tau) * log(\pi_{\theta}(a_t|s_t))# Negative for maximizingloss = - log_probs * rets  #  - \Sigma_{t=0}^{T}  [R_t(\tau) * log(\pi_{\theta}(a_t|s_t))] loss = torch.sum(loss)optimizer.zero_grad()# backpropagate, compute gradients# computes the gradients of the loss with respect to the model's parameters (\theta)loss.backward()   # gradient-ascent, update the weights of the policy network          optimizer.step()            return loss

多回合强化学习训练, 连续多次控制倒立摆成功就结束整个 REINFORCE 算法的训练.

def train_main():env = gym.make('CartPole-v1', render_mode="human")in_dim = env.observation_space.shape[0] # 4out_dim = env.action_space.n # 2pi = Pi(in_dim, out_dim)   # an ibstance of the policy network for REINFORCE algorithmoptimizer = optim.Adam(pi.parameters(), lr=0.01)episode = 0continuous_solved_episode = 0# for epi in range(300): # episode = 300while continuous_solved_episode <= 14:# state = env.reset() # gymstate, _ = env.reset()  # gymnasiumfor t in range(200):  # cartpole max timestep is 200action = pi.act(state)# state, reward, done, _ = env.step(action)  # gymstate, reward, done, _, _ = env.step(action)  # gymnasiumpi.rewards.append(reward)env.render()if done:breakloss = train(pi, optimizer) # train per episodetotal_reward = sum(pi.rewards)   solved = total_reward > 195.0episode += 1if solved:continuous_solved_episode += 1else:continuous_solved_episode = 0print(f'Episode {episode}, loss: {loss}, \total_reward: {total_reward}, solved: {solved}, contnuous_solved: {continuous_solved_episode}')pi.onpolicy_reset()   # onpolicy: clear memory after trainingsave_model(pi)

一个简单的训练录屏

REINFORCE_training

测试需要在神经网络的 evaluation 模式下进行, 测试中可以完成更长时间的倒立摆控制.

def test_process():env = gym.make('CartPole-v1', render_mode="human")# in_dim = env.observation_space.shape[0] # 4# out_dim = env.action_space.n # 2# pi_model = Pi(in_dim, out_dim)pi_model = torch.load(model_path)# set the model to evaluation modepi_model.eval()# 进行前向传播with torch.no_grad():pi_model.onpolicy_reset()   # onpolicy: clear memory after trainingstate, _ = env.reset()  # gymnasiumsteps = 600for t in range(steps):  # cartpole max timestep is 2000action = pi_model.act(state)state, reward, done, _, _ = env.step(action) pi_model.rewards.append(reward)env.render()if done:breaktotal_reward = sum(pi_model.rewards)   solved = total_reward >= stepsprint(f'[Test] total_reward: {total_reward}, solved: {solved}')

一个简单的测试录屏

REINFORCE_testing

完整代码:

import gymnasium as gym
# import gymimport numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import sysgamma = 0.99 # discount factor
model_path = "./reinforce_pi.pt" class Pi(nn.Module):# a policy network to be optimized in reinforcement learning# 待优化的策略网络def __init__(self, in_dim, out_dim): # in_dim = 4, out_dim = 2# super(Pi, self).__init__()super().__init__()# a policy networklayers = [nn.Linear(in_dim, 64), # 4 -> 64nn.ReLU(), # activation functionnn.Linear(64, out_dim), # 64 -> 2]self.model = nn.Sequential(*layers) self.onpolicy_reset()  # initialize memoryself.train()  # Set the model to training modedef onpolicy_reset(self):self.log_probs = []self.rewards = []def forward(self, x): # x -> statepdparam = self.model(x) # forward passreturn pdparam# pdparam -> probability distribution# such as the logits of a categorical distributiondef act(self, state):# Convert the state from a NumPy array to a PyTorch tensor# 由策略网络输出的采样动作和对数概率分布x = torch.from_numpy(state.astype(np.float32)) # print("state: {}".format(state))pdparam = self.forward(x)     # Perform a forward pass through the neural network   # print("pdparam: {}".format(pdparam))# to obtain the probability distribution parameterspd = torch.distributions.Categorical(logits=pdparam) # probability distribution# print("pd.probs: {}\t pd.logits: {}".format(pd.probs, pd.logits))action = pd.sample()            # pi(a|s) in action via pd#calculates the log probability of the sampled action action under the probability distribution pd#$\log(\pi_{\theta}(a_t|s_t))$#where $\pi_{\theta}$ is the policy network,#	$a_t$ is the action at time step $t$,#	$s_t$ is the state at time step $t$log_prob = pd.log_prob(action)  # log_prob of pi(a|s), log_prob = pd.logitsself.log_probs.append(log_prob) # store for trainingreturn action.item()  # extracts the value of a single-element tensor as a scalardef train(pi, optimizer):# 以下利用蒙特卡洛法计算损失函数值,并利用梯度上升法更新策略网络参数# 蒙特卡洛法需要采样多条轨迹来求损失函数的均值,但是为了简化只采样了一条轨迹当做均值# Inner gradient-ascent loop of REINFORCE algorithmT = len(pi.rewards)rets = np.empty(T, dtype=np.float32)  # Initialize returnsfuture_ret = 0.0# compute the returns efficiently in reverse order# R_t(\tau) = \Sigma_{t'=t}^{T} {\gamma^{t'-t} r_{t'}}for t in reversed(range(T)):future_ret = pi.rewards[t] + gamma * future_retrets[t] = future_retbaseline = sum(rets) / Trets = torch.tensor(rets)rets = rets - baseline  # modify the returns by subtracting a baselinelog_probs = torch.stack(pi.log_probs)# - R_t(\tau) * log(\pi_{\theta}(a_t|s_t))# Negative for maximizingloss = - log_probs * rets  #  - \Sigma_{t=0}^{T}  [R_t(\tau) * log(\pi_{\theta}(a_t|s_t))] loss = torch.sum(loss)optimizer.zero_grad()# backpropagate, compute gradients# computes the gradients of the loss with respect to the model's parameters (\theta)loss.backward()   # gradient-ascent, update the weights of the policy network          optimizer.step()            return lossdef save_model(pi):print("pi.state_dict(): {}\n\n".format(pi.state_dict()))for param_tensor in pi.state_dict():print(param_tensor, "\t", pi.state_dict()[param_tensor].size())torch.save(pi, model_path)def train_main():env = gym.make('CartPole-v1', render_mode="human")in_dim = env.observation_space.shape[0] # 4out_dim = env.action_space.n # 2pi = Pi(in_dim, out_dim)   # an ibstance of the policy network for REINFORCE algorithmoptimizer = optim.Adam(pi.parameters(), lr=0.01)episode = 0continuous_solved_episode = 0# for epi in range(300): # episode = 300while continuous_solved_episode <= 14:# state = env.reset() # gymstate, _ = env.reset()  # gymnasiumfor t in range(200):  # cartpole max timestep is 200action = pi.act(state)# state, reward, done, _ = env.step(action)  # gymstate, reward, done, _, _ = env.step(action)  # gymnasiumpi.rewards.append(reward)env.render()if done:breakloss = train(pi, optimizer) # train per episodetotal_reward = sum(pi.rewards)   solved = total_reward > 195.0episode += 1if solved:continuous_solved_episode += 1else:continuous_solved_episode = 0print(f'Episode {episode}, loss: {loss}, \total_reward: {total_reward}, solved: {solved}, contnuous_solved: {continuous_solved_episode}')pi.onpolicy_reset()   # onpolicy: clear memory after trainingsave_model(pi)def usage():if len(sys.argv) != 2:print("Usage: python ./REINFORCE.py --train/--test")sys.exit()mode = sys.argv[1]return mode def test_process():env = gym.make('CartPole-v1', render_mode="human")# in_dim = env.observation_space.shape[0] # 4# out_dim = env.action_space.n # 2# pi_model = Pi(in_dim, out_dim)pi_model = torch.load(model_path)# set the model to evaluation modepi_model.eval()# 进行前向传播with torch.no_grad():pi_model.onpolicy_reset()   # onpolicy: clear memory after trainingstate, _ = env.reset()  # gymnasiumsteps = 600for t in range(steps):  # cartpole max timestep is 2000action = pi_model.act(state)state, reward, done, _, _ = env.step(action) pi_model.rewards.append(reward)env.render()if done:breaktotal_reward = sum(pi_model.rewards)   solved = total_reward >= stepsprint(f'[Test] total_reward: {total_reward}, solved: {solved}')if __name__ == '__main__':mode = usage()if mode == "--train":train_main()elif mode == "--test":test_process()

版权声明:本文为博主原创文章,遵循 CC 4.0 BY 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/woyaomaishu2/article/details/146382384
本文作者:wzf@robotics_notes


文章转载自:

http://KuOGexQK.xgchm.cn
http://dKogz3Gu.xgchm.cn
http://0UjjOa8T.xgchm.cn
http://fbHAQ29T.xgchm.cn
http://WLdjbLNS.xgchm.cn
http://iGz5rCMU.xgchm.cn
http://lhzwfkKX.xgchm.cn
http://FJdqmto1.xgchm.cn
http://wyuaPfPg.xgchm.cn
http://32AY8ktL.xgchm.cn
http://rKKTrsqk.xgchm.cn
http://YL8ZAUVL.xgchm.cn
http://Cg5YEtLL.xgchm.cn
http://fGLVjw0V.xgchm.cn
http://UYn0aXHG.xgchm.cn
http://9EvFZXJT.xgchm.cn
http://QLiQOEiM.xgchm.cn
http://ZFYGbYLt.xgchm.cn
http://tGvAApGI.xgchm.cn
http://9yCVN6aA.xgchm.cn
http://t2PLUU5I.xgchm.cn
http://8y4fqvTw.xgchm.cn
http://mXjrJ7Uo.xgchm.cn
http://X6VGX4V8.xgchm.cn
http://UgGkoBOB.xgchm.cn
http://G2U0laoW.xgchm.cn
http://cwqJUiWS.xgchm.cn
http://t5pg2wif.xgchm.cn
http://phXj33l3.xgchm.cn
http://HNrJ3JXG.xgchm.cn
http://www.dtcms.com/wzjs/741972.html

相关文章:

  • 品牌大全网站源码黄江网站建设公司
  • 龙港 网站建设深圳ui设计培训班
  • 品牌网站建设 细致磐石网络四川建设网证书查询
  • 网站站群建设怎么查网站做百度竞价信息
  • 网站开发课程改革wordpress娱乐网
  • 青海省建设厅网站职称评审表wordpress4.6字体
  • 网站首页动画怎么做的企业网站推广策划书
  • 深圳 网站开发佛山市禅城网站建设
  • 做程序题的国外网站how to use wordpress ninja forms
  • 游戏网站模板html网站wap转换
  • 网站建设的方法有wordpress修复
  • 网站建设需要提供功能目录吗松江新城投资建设集团有限公司网站
  • 建站系统源码免费的室内设计网站
  • 什么网站可以做自考试题域名估价
  • 太原网站关键词优化wordpress文章标题优化
  • 海南省做购房合同网站wordpress小工具怎么用
  • 小程序代运营多少钱一个月seo服务公司推荐
  • 网站建设管理概述怎样建设小游戏网站
  • 做那种网站受欢迎初级软件工程师报考条件
  • 厦门手机网站建设方案网站建设分录
  • 北京网站建设排行做旅游网站一年能挣多少
  • WordPress缩略图短代码郑州网站优化公司价位
  • 做网站怎么写代码杭州有几个区
  • 网站开发的理解淘宝网页版登陆
  • 怎么做网站的移动端适配版学校网站备案前置审批
  • 网站里的图片切换怎么做wordpress投稿系统
  • 网站别人能打开我打不开山东规划 建设部门的网站
  • 天津企业免费建站net core 仿wordpress
  • 双流规划建设管理局网站广州专业视频制作
  • wordpress 知名站点网站建设相关问题