当前位置: 首页 > news >正文

强化学习1.1 使用Gymnasium库

学习资料链接

环境初始化脚本->自动安装依赖并启动虚拟显示

import sys, os
if 'google.colab' in sys.modules and not os.path.exists('.setup_complete'):!wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/setup_colab.sh -O- | bash!touch .setup_complete# This code creates a virtual display to draw game images on.
# It will have no effect if your machine has a monitor.
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:!bash ../xvfb startos.environ['DISPLAY'] = ':1'

导入依赖库

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

安装Gymnasium

!pip install gymnasium

把一帧渲染成RGB数组,打印observation_space / action_space得到输入和输出的维度

import gymnasium as gymenv = gym.make("MountainCar-v0", render_mode="rgb_array")
env.reset()plt.imshow(env.render())
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

Gymnasium三条主接口

方法调用格式回什么用法
resetobs, info = env.reset(seed=?)初始观测 + 辅助字典重新开局
stepobs, r, terminated, truncated, info = env.step(a)5 元组(注意拆包)推一步环境
renderrgb = env.render()当前画面(rgb_array)或弹窗可视化

step返回的5个元素:
new_observation:动作后的新状态,给 agent 下一帧用。
reward:这一步的即时分数。
terminated:游戏自己说“结束”。
truncated:时间到了被“强制下班”。
info:调试用的额外信息,可先无视。

用不同的seed种子把环境“倒带”两次

# Set seed to reproduce initial state in stochastic environment
obs0, info = env.reset(seed=0)
print("initial observation code:", obs0)obs0, info = env.reset(seed=1)
print("initial observation code:", obs0)# Note: in MountainCar, observation is just two numbers: car position and velocity

在这里插入图片描述

把动作 2(向右推)送进环境,拿到下一步的“五件套”:

  • new_obs 里位置比原来大了约 0.0008,说明车确实向右挪了一点;
  • reward 是这一步的即时分数(MountainCar 里通常是 -1,除非到终点);
  • terminated 当前还是 False——还没到达旗帜;
  • truncated 也是 False——步数还没耗完。

简单说:就是手动点了一帧右键,打印出新状态看看变化。

print("taking action 2 (right)")
new_obs, reward, terminated, truncated, _ = env.step(2)print("new observation code:", new_obs)
print("reward:", reward)
print("is game over?:", terminated)
print("is game truncated due to time limit?:", truncated)# Note: as you can see, the car has moved to the right slightly (around 0.0005)

在这里插入图片描述

完成“手搓策略”小作业:
给你一辆 MountainCar,默认代码只会一直往右踩油门,但坡度太大、重力会把车拖回左边,永远到不了旗。
目标:不用任何 RL 算法,靠硬编码(if-else、循环、加速度利用、来回摆)让车自己冲到最右边的旗帜。

环境定义如下

from IPython import display# Create env manually to set time limit. Please don't change this.
TIME_LIMIT = 250
env = gym.wrappers.TimeLimit(gym.make("MountainCar-v0", render_mode="rgb_array"),max_episode_steps=TIME_LIMIT + 1,
)
actions = {"left": 0, "stop": 1, "right": 2}

只考虑速度即可,引导小车向行驶方向加速

def policy(obs, t):# Write the code for your policy here. You can use the observation# (a tuple of position and velocity), the current time step, or both,# if you want.position, velocity = obsif velocity > 0:return actions["right"]else:return actions["left"]# This is an example policy. You can try running it, but it will not work.# Your goal is to fix that. You don't need anything sophisticated here,# and you can hard-code any policy that seems to work.# Hint: think how you would make a swing go farther and faster.

复位,进行小游戏

plt.figure(figsize=(4, 3))
display.clear_output(wait=True)obs, _ = env.reset()
for t in range(TIME_LIMIT):plt.gca().clear()action = policy(obs, t)  # Call your policyobs, reward, terminated, truncated, _ = env.step(action)  # Pass the action chosen by the policy to the environment# We don't do anything with reward here because MountainCar is a very simple environment,# and reward is a constant -1. Therefore, your goal is to end the episode as quickly as possible.# Draw game image on display.plt.imshow(env.render())display.display(plt.gcf())display.clear_output(wait=True)if terminated or truncated:print("Well done!")break
else:print("Time limit exceeded. Try again.")display.clear_output(wait=True)

在这里插入图片描述

验证是否完成

assert obs[0] > 0.47
print("You solved it!")

任务完成
在这里插入图片描述

http://www.dtcms.com/a/388997.html

相关文章:

  • 日语学习-日语知识点小记-进阶-JLPT-N1阶段蓝宝书,共120语法(11):101-110语法 +(考え方15)
  • 运维分享:神卓 N600 如何实现 NAS 安全稳定访问
  • 系统集成项目管理工程师:第十四章 收尾过程组
  • 云手机通道具体是指什么?
  • C++ :实现多线程编程
  • 嵌入式科普(40)浅谈“功能安全“概念,深悟“功能安全“本质
  • 分布式系统理论-CAP和BASE
  • SaaS 安全的原则、挑战及其最佳实践指南
  • Flink on Native K8S源码解析
  • VMwarea安装
  • HarmonyOS之Swiper全解析
  • React18中性能优化方式
  • X133核心板--智能教育平板的芯动力​
  • 下载flink和flink cdc jar
  • 华为三层交换技术
  • 潮起之江:算力创新与赋能开启AI产业新征程
  • 华为链路聚合技术基础
  • 百度智能云车牌识别API官方配置指南
  • Git 拉Github的仓库却要求登录GitLab
  • 【Kafka】Kafka如何开启sasl认证?
  • 国产化Excel开发组件Spire.XLS教程:C# 轻松将 DataSet 导出到 Excel
  • NLP情绪因子解构鲍威尔“风险管理降息”信号,黄金价格在3707高位触发量化抛售潮
  • 【Python办公】Excel多Sheet拆分工具
  • Unity_程序集_.asmdef_引用命名域失败
  • FPGA采集AD7606转SRIO传输,基于Serial Rapidlo Gen2,提供6套工程源码和技术支持
  • Cloudcompare实现在模型上进行点云(下)采样
  • 【Linux】聊聊文件那些事:从空文件占空间到系统调用怎么玩
  • 基于代码层对运动台性能提升实战
  • openfeigin配置相关
  • 网络传输协议解析及SSE补充