当前位置：首页 > news >正文

Gymnasium Taxi‐v3 环境与 Q-learning 算法 —— 强化学习入门 I

news 2025/10/20 23:00:59

Title: Gymnasium Taxi‐v3 环境与 Q-learning 算法 —— 强化学习入门 I

文章目录

I. 模拟环境
II. 算法及实现
- 1. Q-learning 算法
- 2. 算法训练
- 3. 算法测试

I. 模拟环境

这是一个 Baby 级强化学习例子, 基于 gymnasium. 仿真模拟一个乘客叫出租车, 出租车载着乘客到达目的地的场景, 可看作强化学习在路径规划应用.

仿真环境大概如下图所示

Fig 1. 仿真环境图

这是一个 5×5 方格世界 grid world. 图中 - 或 | 代表墙或障碍, : 和空白代表可通行区域.

四个字母 R、G、Y、B 是乘客等车 (上车) 和下车 (目的) 地点. 上车地和目的地可以相同, 那样的话直接就完成了. 四个字母中, 乘客上车地点用蓝色标记, 目的地用紫色/洋红色描述. 四个字母在程序中对应数值为 R : 0, G : 1, Y : 2, B : 3.

当乘客还未上车时, 等车的位置就是乘客的位置, 当乘客上车后出租车的位置就是乘客的位置. 乘客在出租车中时, 用 4 来表示. 所以乘客位置 pass_index 表示为 0 : R、1 : G、2 : Y、3 : B、4 : In taxi. 目的地的位置描述 dest_index 和乘客位置描述少了 4 : In taxi 一项而其他一样.

出驻车的位置用坐标 (row_coordinate, col_coordinate) 来表示, 左上角 R 的位置坐标为 (0, 0). 黄色块代表出租车空车状态的位置, 而当出租车载有乘客时, 用绿色块标示. 出租车的动作 action 有六种 0: Move south (down)、1: Move north (up)、2: Move east (right)、3: Move west (left)、4: Pickup passenger、5: Drop off passenger.

这个世界中的状态 state 一共有 25 (方格数) × 5 (乘客位置) × 4 (目的地) = 500 种.

系统的状态以 (row_coordinate, col_coordinate, pass_index, dest_index) 来表示. 上图所示的仿真环境中 agent 观察到的状态为 (3 4 2 3).

利用 ((row_coordinate × 5 + col_coordinate) × 5 + pass_index) × 4 + dest_index 对系统状态进行编码, 得到一个状态 state 的整数值表示.

所以 (3 4 2 3) 编码获得状态码 state code {[(3×5)+4]×5+2}×4+3=391.

由编码公式可以看出出租车位置行数坐标加 1, 编码值增加 100, 反之亦然; 列数坐标增加 1, 编码值增加 20.

关于强化学习的即时回报 reward, 该环境中是这样定义

-1 per step unless other reward is triggered 每一步 -1,除了下面两种情况;

+20 delivering passenger 将乘客送达目的地 +20;

-10 executing “pickup” and “drop-off” actions illegally 错误的地点试图让乘客上车或者下车 -10.

所以在上面仿真环境图中描述的当前状态中, agent 执行不同的确定性动作所获得的即时奖励为

{0: [(1.0, 491, -1, False)],
 1: [(1.0, 291, -1, False)],
 2: [(1.0, 391, -1, False)],
 3: [(1.0, 371, -1, False)],
 4: [(1.0, 391, -10, False)],
 5: [(1.0, 391, -10, False)]}

上述式子的格式为 action: [probablity of action, code of next state, reward, done]. 其中 done 标志是否已将乘客正确送达.

II. 算法及实现

1. Q-learning 算法

采用常用算法 Q-Learning 来进行学习. 关于 Q-learning 算法可以参考 Rafael Ris-Ala 写的 ‘‘Fundamentals of Reinforcement Learning’’ 这本书.

Q-learning 需要维护一张包含全部状态的 Q-value 的 Q-table.

Q-table 中的初始值都设为 0. 通过如下学习算法更新和迭代 Q-value,
$\begin{aligned} {Q'(s,a)} &= Q(s,a)+\alpha\left[R(s,a) + \gamma \max Q'(s', a') - Q(s,a) \right]\\ &= (1-\alpha) Q(s,a) + \alpha[R(s,a) + \gamma \max Q'(s', a')] \end{aligned}$
Q-table 中的 Q-value 就代表强化学习学到的经验.

其中

$s$ 是当前状态 current state

$a$ 是当前状态对应的执行动作 current action

$s^{'}$ 是下一状态 next state

$a^{'}$ 是下一状态对应的执行动作 next action

${Q(s,a)}$ 代表现有的 Q-value

${Q'(s,a)}$ 代表更新后的 Q-value

$\alpha$ 是学习率或者更新率

$R (s, a)$ 为当前状态 $s$ 执行动作 $a$ 后获得的即时奖励

$\gamma$ 为奖励折扣因子

$\max Q'(s', a')$ 为在下一个状态中执行不同动作所能够获得的最大 Q-value. 这样 next action 选择为 $\arg\max Q'(s', a')$ .

下面直接看简单 Python 示例代码 (在 Jupyter 中运行).

2. 算法训练

模拟环境:

import gymnasium as gym
import random

env=gym.make('Taxi-v3', render_mode="ansi")
current_state, info = env.reset()
print(env.render())

print("current_state: {}".format(current_state))
taxi_row, taxi_col, pass_loc, dest_idx = env.unwrapped.decode(current_state)
# passenger’s initial location —— one of four letters with blue color 蓝颜色标记
# passenger’s destination —— one of four letters with purple/magenta color 紫颜色/洋红色标记
# the location indices of the four letters —— left-top 0, right-top 1, left-bottom 2, right-bottom 3
print("{} {} {} {}".format(taxi_row, taxi_col, pass_loc, dest_idx))
print("pass_idx: {}, coordinates: {}".format(pass_loc, env.locs[pass_loc]))
print("dest_idx: {}, coordinates: {}".format(dest_idx, env.locs[dest_idx]))
print(env.P[current_state])
# 0 = south, 1 = north, 2 = east, 3 = west, 4 = pickup, 5 = dropoff

print("Total number of actions: {}".format(env.action_space))
print("Total number of states: {}".format(env.observation_space))

env.encode(taxi_row, taxi_col, pass_loc, dest_idx)
# ((taxi_row*5 + taxi_col)*5 + pass_loc)*4 + dest_idx

Q-leanring 训练:

# training

import numpy as np
q_table = np.zeros([env.observation_space.n,  env.action_space.n])
# q_table.shape, q_table
alpha = 0.1  # leanring rate
gamma = 0.6  # discount factor
epsilon = 0.2  # random choice chance between exploration and exploitation

from IPython.display import clear_output
from time import sleep

iteration_array = []
penalty_array = []

for i in range(100000):
    current_state, info = env.reset()
    iterations, penalties, reward = 0, 0, 0
    done = False
    while not done:
        if np.random.uniform(0,1) < epsilon:
            action = env.action_space.sample()
            # exploration
        else:
            action = np.argmax(q_table[current_state])
            # exploitation
    
        next_state, reward, done, truncated, info = env.step(action)
        current_q_value = q_table[current_state, action]
        next_max_q_value = np.max(q_table[next_state])
        update_q_table = (1-alpha)*(current_q_value) + alpha*(reward+gamma*next_max_q_value)  
        # Q learning, update a Q value in the Q table
        q_table[current_state, action] = update_q_table
        current_state = next_state

        iterations += 1
        if reward == -10:
            penalties += 1

    iteration_array.append(iterations)
    penalty_array.append(penalties)
    
    if i% 100 == 0:
        clear_output(wait=True)
        print('Episode: {}'.format(i))

print("Training completed.")

看一下训练中每一 Episode 的执行动作步数的变化以及因错误上下车而接受惩罚 (-10) 的变化.

import matplotlib.pyplot as plt
import numpy as np

iteration_array = np.array(iteration_array)
x = np.arange(len(iteration_array))
penalty_array = np.array(penalty_array)
plt.plot(x, iteration_array)
plt.title('iteration steps')
plt.show()

plt.plot(x, penalty_array)
plt.title('penalty numbers')
plt.show()

Fig 2. 完成一次 Episod 所用迭代步数变化
fig3-penalty-numbers

Fig 3. 因错误上下车而接受惩罚 (-10) 的次数变化

也可以查看一下 Q-table 以便有个直观的认识.

# q_table
for i in range(len(q_table)):
    print("{}: {}".format(i, q_table[i]))

输出格式: state i: [Q-value(state i, action 0), Q-value(state i, action 1), Q-value(state i, action 2), Q-value(state i, action 3), Q-value(state i, action 4), Q-value(state i, action 5)]

0: [0. 0. 0. 0. 0. 0.]
1: [ -2.41837066  -2.3639511   -2.41837066  -2.3639511   -2.27325184
 -11.3639511 ]
2: [ -1.870144  -1.45024   -1.870144  -1.45024   -0.7504   -10.45024 ]
3: [ -2.3639511   -2.27325184  -2.3639511   -2.27325184  -2.1220864
 -11.27325184]
4: [ -2.4961915   -2.49715753  -2.4961915   -2.4971953  -11.33502848
 -11.38792439]
5: [0. 0. 0. 0. 0. 0.]
6: [ -2.4961915   -2.49734168  -2.4961915   -2.49740291 -11.2916401
 -11.21719859]
7: [ -2.48236806  -2.48581571  -2.48236806  -2.48705227 -11.21642269
 -10.4176698 ]
8: [ -2.27325184  -2.35586584  -2.41127252  -2.35406925 -11.11505892
 -10.91130772]
9: [ -2.47061344  -2.48195367  -2.48692052  -2.47983331 -11.22554927
 -11.33555549]
 ...

3. 算法测试

下面利用已经学习到的 Q-table 进行测试.

# testing 1
step_array = []
penalty_array = []
for i in range(30):
    current_state, info = env.reset()
    # print(current_state)
    # print(env.render())
    steps = 0
    penalties = 0
    test_done = False

    while not test_done:
        action = np.argmax(q_table[current_state])
        # print("action: {}".format(action))
        next_state, reward, test_done, truncated, info = env.step(action)
        current_state = next_state
        # print("steps {}, action {}".format(steps, action))
        steps += 1
    
        if reward == -10:
            penalties += 1

    step_array.append(steps)
    penalty_array.append(penalties)

print(step_array)
print(penalty_array)

结果输出

[8, 10, 14, 13, 12, 14, 11, 7, 10, 17, 17, 11, 12, 14, 10, 8, 16, 12, 14, 17, 9, 17, 9, 16, 13, 12, 15, 8, 12, 13]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

从结果可以看出, 平均十几步能够完成接送乘客的任务, 并且没有因为错误地让乘客上下车而被惩罚的情况.

再测试一下并看一下 taxi怎么规划路径的.

# testing 2 with animation
from IPython.display import clear_output
import time

clear_output(wait=True)
current_state, info = env.reset()
# print(current_state)
print(env.render())
steps = 0
penalties = 0
test_done = False

while not test_done:
    action = np.argmax(q_table[current_state])
    next_state, reward, test_done, truncated, info = env.step(action)
    current_state = next_state
    
    steps += 1
    if reward == -10:
        penalties += 1

    clear_output(wait=True)
    print(env.render())
    time.sleep(1)