当前位置：首页 > news >正文

强化学习1.1 使用Gymnasium库

news 2025/9/19 5:46:59

学习资料链接

环境初始化脚本->自动安装依赖并启动虚拟显示

import sys, os
if 'google.colab' in sys.modules and not os.path.exists('.setup_complete'):!wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/setup_colab.sh -O- | bash!touch .setup_complete# This code creates a virtual display to draw game images on.
# It will have no effect if your machine has a monitor.
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:!bash ../xvfb startos.environ['DISPLAY'] = ':1'

导入依赖库

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

安装Gymnasium

!pip install gymnasium

把一帧渲染成RGB数组，打印observation_space / action_space得到输入和输出的维度

import gymnasium as gymenv = gym.make("MountainCar-v0", render_mode="rgb_array")
env.reset()plt.imshow(env.render())
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

Gymnasium三条主接口

方法	调用格式	回什么	用法
`reset`	`obs, info = env.reset(seed=?)`	初始观测 + 辅助字典	重新开局
`step`	`obs, r, terminated, truncated, info = env.step(a)`	5 元组（注意拆包）	推一步环境
`render`	`rgb = env.render()`	当前画面（rgb_array）或弹窗	可视化

step返回的5个元素：
new_observation：动作后的新状态，给 agent 下一帧用。
reward：这一步的即时分数。
terminated：游戏自己说“结束”。
truncated：时间到了被“强制下班”。
info：调试用的额外信息，可先无视。

用不同的seed种子把环境“倒带”两次

# Set seed to reproduce initial state in stochastic environment
obs0, info = env.reset(seed=0)
print("initial observation code:", obs0)obs0, info = env.reset(seed=1)
print("initial observation code:", obs0)# Note: in MountainCar, observation is just two numbers: car position and velocity

在这里插入图片描述

把动作 2（向右推）送进环境，拿到下一步的“五件套”：

new_obs 里位置比原来大了约 0.0008，说明车确实向右挪了一点；
reward 是这一步的即时分数（MountainCar 里通常是 -1，除非到终点）；
terminated 当前还是 False——还没到达旗帜；
truncated 也是 False——步数还没耗完。

简单说：就是手动点了一帧右键，打印出新状态看看变化。

print("taking action 2 (right)")
new_obs, reward, terminated, truncated, _ = env.step(2)print("new observation code:", new_obs)
print("reward:", reward)
print("is game over?:", terminated)
print("is game truncated due to time limit?:", truncated)# Note: as you can see, the car has moved to the right slightly (around 0.0005)

在这里插入图片描述

完成“手搓策略”小作业：
给你一辆 MountainCar，默认代码只会一直往右踩油门，但坡度太大、重力会把车拖回左边，永远到不了旗。
目标：不用任何 RL 算法，靠硬编码（if-else、循环、加速度利用、来回摆）让车自己冲到最右边的旗帜。

环境定义如下

from IPython import display# Create env manually to set time limit. Please don't change this.
TIME_LIMIT = 250
env = gym.wrappers.TimeLimit(gym.make("MountainCar-v0", render_mode="rgb_array"),max_episode_steps=TIME_LIMIT + 1,
)
actions = {"left": 0, "stop": 1, "right": 2}

只考虑速度即可，引导小车向行驶方向加速

def policy(obs, t):# Write the code for your policy here. You can use the observation# (a tuple of position and velocity), the current time step, or both,# if you want.position, velocity = obsif velocity > 0:return actions["right"]else:return actions["left"]# This is an example policy. You can try running it, but it will not work.# Your goal is to fix that. You don't need anything sophisticated here,# and you can hard-code any policy that seems to work.# Hint: think how you would make a swing go farther and faster.

复位，进行小游戏

plt.figure(figsize=(4, 3))
display.clear_output(wait=True)obs, _ = env.reset()
for t in range(TIME_LIMIT):plt.gca().clear()action = policy(obs, t)  # Call your policyobs, reward, terminated, truncated, _ = env.step(action)  # Pass the action chosen by the policy to the environment# We don't do anything with reward here because MountainCar is a very simple environment,# and reward is a constant -1. Therefore, your goal is to end the episode as quickly as possible.# Draw game image on display.plt.imshow(env.render())display.display(plt.gcf())display.clear_output(wait=True)if terminated or truncated:print("Well done!")break
else:print("Time limit exceeded. Try again.")display.clear_output(wait=True)

在这里插入图片描述

验证是否完成

assert obs[0] > 0.47
print("You solved it!")

任务完成
在这里插入图片描述

查看全文

http://www.dtcms.com/a/388997.html

日语学习-日语知识点小记-进阶-JLPT-N1阶段蓝宝书，共120语法（11）：101-110语法 +（考え方１5）

运维分享：神卓 N600 如何实现 NAS 安全稳定访问

系统集成项目管理工程师：第十四章收尾过程组

云手机通道具体是指什么？

C++ ：实现多线程编程

嵌入式科普(40)浅谈“功能安全“概念，深悟“功能安全“本质

分布式系统理论-CAP和BASE

SaaS 安全的原则、挑战及其最佳实践指南

Flink on Native K8S源码解析

VMwarea安装

HarmonyOS之Swiper全解析

React18中性能优化方式

X133核心板--智能教育平板的芯动力

下载flink和flink cdc jar

华为三层交换技术

潮起之江：算力创新与赋能开启AI产业新征程

华为链路聚合技术基础

百度智能云车牌识别API官方配置指南

Git 拉Github的仓库却要求登录GitLab

【Kafka】Kafka如何开启sasl认证？

国产化Excel开发组件Spire.XLS教程：C# 轻松将 DataSet 导出到 Excel

NLP情绪因子解构鲍威尔“风险管理降息”信号，黄金价格在3707高位触发量化抛售潮

【Python办公】Excel多Sheet拆分工具

Unity_程序集_.asmdef_引用命名域失败

FPGA采集AD7606转SRIO传输，基于Serial Rapidlo Gen2，提供6套工程源码和技术支持

Cloudcompare实现在模型上进行点云(下)采样

【Linux】聊聊文件那些事：从空文件占空间到系统调用怎么玩

基于代码层对运动台性能提升实战

openfeigin配置相关

网络传输协议解析及SSE补充

相关文章：