强化学习1.1 使用Gymnasium库
学习资料链接
环境初始化脚本->自动安装依赖并启动虚拟显示
import sys, os
if 'google.colab' in sys.modules and not os.path.exists('.setup_complete'):!wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/setup_colab.sh -O- | bash!touch .setup_complete# This code creates a virtual display to draw game images on.
# It will have no effect if your machine has a monitor.
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:!bash ../xvfb startos.environ['DISPLAY'] = ':1'
导入依赖库
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
安装Gymnasium
!pip install gymnasium
把一帧渲染成RGB数组,打印observation_space / action_space得到输入和输出的维度
import gymnasium as gymenv = gym.make("MountainCar-v0", render_mode="rgb_array")
env.reset()plt.imshow(env.render())
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)
Gymnasium三条主接口
方法 | 调用格式 | 回什么 | 用法 |
---|---|---|---|
reset | obs, info = env.reset(seed=?) | 初始观测 + 辅助字典 | 重新开局 |
step | obs, r, terminated, truncated, info = env.step(a) | 5 元组(注意拆包) | 推一步环境 |
render | rgb = env.render() | 当前画面(rgb_array)或弹窗 | 可视化 |
step返回的5个元素:
new_observation:动作后的新状态,给 agent 下一帧用。
reward:这一步的即时分数。
terminated:游戏自己说“结束”。
truncated:时间到了被“强制下班”。
info:调试用的额外信息,可先无视。
用不同的seed种子把环境“倒带”两次
# Set seed to reproduce initial state in stochastic environment
obs0, info = env.reset(seed=0)
print("initial observation code:", obs0)obs0, info = env.reset(seed=1)
print("initial observation code:", obs0)# Note: in MountainCar, observation is just two numbers: car position and velocity
把动作 2(向右推)送进环境,拿到下一步的“五件套”:
new_obs
里位置比原来大了约 0.0008,说明车确实向右挪了一点;reward
是这一步的即时分数(MountainCar 里通常是 -1,除非到终点);terminated
当前还是 False——还没到达旗帜;truncated
也是 False——步数还没耗完。
简单说:就是手动点了一帧右键,打印出新状态看看变化。
print("taking action 2 (right)")
new_obs, reward, terminated, truncated, _ = env.step(2)print("new observation code:", new_obs)
print("reward:", reward)
print("is game over?:", terminated)
print("is game truncated due to time limit?:", truncated)# Note: as you can see, the car has moved to the right slightly (around 0.0005)
完成“手搓策略”小作业:
给你一辆 MountainCar,默认代码只会一直往右踩油门,但坡度太大、重力会把车拖回左边,永远到不了旗。
目标:不用任何 RL 算法,靠硬编码(if-else、循环、加速度利用、来回摆)让车自己冲到最右边的旗帜。
环境定义如下
from IPython import display# Create env manually to set time limit. Please don't change this.
TIME_LIMIT = 250
env = gym.wrappers.TimeLimit(gym.make("MountainCar-v0", render_mode="rgb_array"),max_episode_steps=TIME_LIMIT + 1,
)
actions = {"left": 0, "stop": 1, "right": 2}
只考虑速度即可,引导小车向行驶方向加速
def policy(obs, t):# Write the code for your policy here. You can use the observation# (a tuple of position and velocity), the current time step, or both,# if you want.position, velocity = obsif velocity > 0:return actions["right"]else:return actions["left"]# This is an example policy. You can try running it, but it will not work.# Your goal is to fix that. You don't need anything sophisticated here,# and you can hard-code any policy that seems to work.# Hint: think how you would make a swing go farther and faster.
复位,进行小游戏
plt.figure(figsize=(4, 3))
display.clear_output(wait=True)obs, _ = env.reset()
for t in range(TIME_LIMIT):plt.gca().clear()action = policy(obs, t) # Call your policyobs, reward, terminated, truncated, _ = env.step(action) # Pass the action chosen by the policy to the environment# We don't do anything with reward here because MountainCar is a very simple environment,# and reward is a constant -1. Therefore, your goal is to end the episode as quickly as possible.# Draw game image on display.plt.imshow(env.render())display.display(plt.gcf())display.clear_output(wait=True)if terminated or truncated:print("Well done!")break
else:print("Time limit exceeded. Try again.")display.clear_output(wait=True)
验证是否完成
assert obs[0] > 0.47
print("You solved it!")
任务完成