当前位置：首页 > news >正文

TensorFlow深度学习实战（33）——深度确定性策略梯度

news 2025/8/22 6:26:24

TensorFlow深度学习实战（33）——深度确定性策略梯度

- 0. 前言
- 1. 深度确定性策略梯度
- 2. 问题与模型分析
- 3. 构建 DDPG 模型
- 小结
- 系列链接

0. 前言

深度确定性策略梯度 (Deep Deterministic Policy Gradient, DDPG) 是一种无模型、离策略的算法，用于学习连续动作。它结合了确定性策略梯度 (Deterministic Policy Gradient, DPG) 和深度Q网络 (Deep Q-Network, DQN) 中的思想，借鉴了 DQN 中的经验回放和缓慢更新的目标网络，并且基于 DPG，使得它能够在连续动作空间上运行。

1. 深度确定性策略梯度

深度Q网络 (Deep Q-Network, DQN) 及其变体在解决状态空间为连续、动作空间为离散的问题上取得了很大的成功。例如，在 Atari 游戏中，输入空间由原始像素组成，而动作是离散的——[向上，向下，向左，向右，不动作]。接下来，考虑具有连续动作空间的问题，例如，假设一个强化学习 (Reinforcement learning, RL) 智能体在驾驶汽车时需要转动方向盘，该动作就具有连续的动作空间。
处理这种情况的一种方法是将动作空间离散化，然后继续使用 DQN 或其变体。然而，更好的解决方案是使用策略梯度算法。在策略梯度方法中，直接使用神经网络近似策略 $π(A∣s)\pi(A|s)$ 。最简单的方式是，神经网络通过使用梯度上升法调整权重，从而学习选择最大化奖励的动作，因此得名“策略梯度”。
在本节中，我们将重点介绍深度确定性策略梯度 (Deep Deterministic Policy Gradient, DDPG) 算法，这是谷歌 DeepMind 提出的一种强化学习算法。DDPG 使用两个网络来实现，分别为“演员网络” (actor network) 和“评论家网络” (critic network)。
演员网络以确定性的方式近似最优策略，即它为任何给定的输入状态输出最优的动作。而评论家通过使用演员的最优动作来评估最优动作值函数。DDPG 的一般架构如下图所示：

DDPG架构

左侧是评论家网络，它以状态向量 $S$ 和采取的动作 $A$ 为输入。网络的输出是该状态和动作的Q值。右侧是演员网络，它以状态向量 $S$ 为输入，并预测要采取的最优动作 $A$ 。
演员网络输出最优的动作；评论家则同时接受输入状态和采取的动作，并评估其Q值。为了训练评论家网络，尽量减少估计Q值与目标Q值之间的差异。然后，将Q值相对于动作的梯度反向传播回去，以训练演员网络。因此，如果评论家足够优秀，将促使演员选择具有最优值函数的动作。

2. 问题与模型分析

在本节中，我们将使用深度确定性策略梯度 (Deep Deterministic Policy Gradient, DDPG) 尝试解决经典的倒立摆 (Inverted Pendulum) 控制问题。在这个问题中，我们只能采取两种动作：向左摆动或向右摆动。
对于Q学习算法而言，这个问题的挑战在于，动作是连续的，而不是离散的。也就是说，我们不是选择 -1 或 +1 这两个离散动作，而是必须从 -2 到 +2 之间的无限动作中进行选择。
在 DDPG 中，有两个网络：

Actor (策略网络)：根据当前状态输出一个动作
Critic (值网络)：根据当前状态和动作预测该动作是好还是坏

此外，DDPG 使用两个目标网络，用于增加了训练的稳定性。简单来说，我们是从估计的目标中进行学习，而目标网络更新得很慢，因此保持了目标的稳定性。这就类似于，智能体有一个关于如何做得更好的想法，它会尝试一段时间，直到找到更好的方法，而不是每次执行动作后都重新学习如何解决该问题。使用经验回放将状态、动作、奖励、下一状态组成的元组存储起来，而不是仅仅从最近的经验中学习，从所有已积累的经验中进行采样学习。

3. 构建 DDPG 模型

(1) 首先，导入所需库：

import os
from tensorflow import keras
from tensorflow.keras import layersimport tensorflow as tf
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

(2) 实例化倒立摆问题，并观察环境信息：

env = gym.make("Pendulum-v1", render_mode="human")num_states = env.observation_space.shape[0]
print("Size of State Space ->  {}".format(num_states))
num_actions = env.action_space.shape[0]
print("Size of Action Space ->  {}".format(num_actions))upper_bound = env.action_space.high[0]
lower_bound = env.action_space.low[0]print("Max Value of Action ->  {}".format(upper_bound))
print("Min Value of Action ->  {}".format(lower_bound))

(3) 为了实现 Actor 网络更好的探索，使用噪声扰动，具体来说是使用 Ornstein-Uhlenbeck 分布生成噪声：

class OUActionNoise:def __init__(self, mean, std_deviation, theta=0.15, dt=1e-2, x_initial=None):self.theta = thetaself.mean = meanself.std_dev = std_deviationself.dt = dtself.x_initial = x_initialself.reset()def __call__(self):x = (self.x_prev + self.theta * (self.mean - self.x_prev) * self.dt+ self.std_dev * np.sqrt(self.dt) * np.random.normal(size=self.mean.shape))# Store x into x_prev# Makes next noise dependent on current oneself.x_prev = xreturn xdef reset(self):if self.x_initial is not None:self.x_prev = self.x_initialelse:self.x_prev = np.zeros_like(self.mean)

(4) 定义经验回放池与模型参数更新策略。
Critic 损失是 $y - Q (S, S)$ 的均方误差，其中 $y$ 是目标网络所看到的期望回报， $Q (S, A)$ 是 Critic 网络预测的动作值。 $y$ 是一个动态目标，Critic 模型试图逼近该目标；我们通过缓慢更新目标模型来使这个目标保持稳定。
Actor 损失通过 Critic 网络对 Actor 网络所采取的动作给出的值的均值进行计算，我们希望最大化该损失值。
因此，更新 Actor 网络，使其生成对于给定状态来说能获得 Critic 预测的最大值的动作。

class Buffer:def __init__(self, buffer_capacity=100000, batch_size=64):# Number of "experiences" to store at maxself.buffer_capacity = buffer_capacity# Num of tuples to train on.self.batch_size = batch_size# Its tells us num of times record() was called.self.buffer_counter = 0# Instead of list of tuples as the exp.replay concept go# We use different np.arrays for each tuple elementself.state_buffer = np.zeros((self.buffer_capacity, num_states))self.action_buffer = np.zeros((self.buffer_capacity, num_actions))self.reward_buffer = np.zeros((self.buffer_capacity, 1))self.next_state_buffer = np.zeros((self.buffer_capacity, num_states))# Takes (s,a,r,s') observation tuple as inputdef record(self, obs_tuple):# Set index to zero if buffer_capacity is exceeded,# replacing old recordsindex = self.buffer_counter % self.buffer_capacityself.state_buffer[index] = obs_tuple[0]self.action_buffer[index] = obs_tuple[1]self.reward_buffer[index] = obs_tuple[2]self.next_state_buffer[index] = obs_tuple[3]self.buffer_counter += 1# Eager execution is turned on by default in TensorFlow 2. Decorating with tf.function allows# TensorFlow to build a static graph out of the logic and computations in our function.# This provides a large speed up for blocks of code that contain many small TensorFlow operations such as this one.@tf.functiondef update(self,state_batch,action_batch,reward_batch,next_state_batch,):# Training and updating Actor & Critic networks.with tf.GradientTape() as tape:target_actions = target_actor(next_state_batch, training=True)y = reward_batch + gamma * target_critic([next_state_batch, target_actions], training=True)critic_value = critic_model([state_batch, action_batch], training=True)critic_loss = keras.ops.mean(keras.ops.square(y - critic_value))critic_grad = tape.gradient(critic_loss, critic_model.trainable_variables)critic_optimizer.apply_gradients(zip(critic_grad, critic_model.trainable_variables))with tf.GradientTape() as tape:actions = actor_model(state_batch, training=True)critic_value = critic_model([state_batch, actions], training=True)# Used `-value` as we want to maximize the value given# by the critic for our actionsactor_loss = -keras.ops.mean(critic_value)actor_grad = tape.gradient(actor_loss, actor_model.trainable_variables)actor_optimizer.apply_gradients(zip(actor_grad, actor_model.trainable_variables))# We compute the loss and update parametersdef learn(self):# Get sampling rangerecord_range = min(self.buffer_counter, self.buffer_capacity)# Randomly sample indicesbatch_indices = np.random.choice(record_range, self.batch_size)# Convert to tensorsstate_batch = keras.ops.convert_to_tensor(self.state_buffer[batch_indices])action_batch = keras.ops.convert_to_tensor(self.action_buffer[batch_indices])reward_batch = keras.ops.convert_to_tensor(self.reward_buffer[batch_indices])reward_batch = keras.ops.cast(reward_batch, dtype="float32")next_state_batch = keras.ops.convert_to_tensor(self.next_state_buffer[batch_indices])self.update(state_batch, action_batch, reward_batch, next_state_batch)# This update target parameters slowly
# Based on rate `tau`, which is much less than one.
def update_target(target, original, tau):target_weights = target.get_weights()original_weights = original.get_weights()for i in range(len(target_weights)):target_weights[i] = original_weights[i] * tau + target_weights[i] * (1 - tau)target.set_weights(target_weights)

(5) 定义 Actor 和 Critic 网络，使用基本的全连接网络：

def get_actor():# Initialize weights between -3e-3 and 3-e3last_init = keras.initializers.RandomUniform(minval=-0.003, maxval=0.003)inputs = layers.Input(shape=(num_states,))out = layers.Dense(256, activation="relu")(inputs)out = layers.Dense(256, activation="relu")(out)outputs = layers.Dense(1, activation="tanh", kernel_initializer=last_init)(out)# Our upper bound is 2.0 for Pendulum.outputs = outputs * upper_boundmodel = keras.Model(inputs, outputs)return modeldef get_critic():# State as inputstate_input = layers.Input(shape=(num_states,))state_out = layers.Dense(16, activation="relu")(state_input)state_out = layers.Dense(32, activation="relu")(state_out)# Action as inputaction_input = layers.Input(shape=(num_actions,))action_out = layers.Dense(32, activation="relu")(action_input)# Both are passed through separate layer before concatenatingconcat = layers.Concatenate()([state_out, action_out])out = layers.Dense(256, activation="relu")(concat)out = layers.Dense(256, activation="relu")(out)outputs = layers.Dense(1)(out)# Outputs single value for give state-actionmodel = keras.Model([state_input, action_input], outputs)return model

需要注意的是，我们需要将 Actor 最后一层的初始化值设置在 -0.003 和 0.003 之间，因为这样可以防止在初始阶段得到 1 或 -1 的输出值，这会导致使用 tanh 激活函数时梯度消失为零。

(6) policy() 函数返回从 Actor 网络中采样的动作，并加上一些噪声用于探索：

def policy(state, noise_object):sampled_actions = keras.ops.squeeze(actor_model(state))noise = noise_object()# Adding noise to actionsampled_actions = sampled_actions.numpy() + noise# We make sure action is within boundslegal_action = np.clip(sampled_actions, lower_bound, upper_bound)return [np.squeeze(legal_action)]

(7) 定义超参数：

std_dev = 0.2
ou_noise = OUActionNoise(mean=np.zeros(1), std_deviation=float(std_dev) * np.ones(1))actor_model = get_actor()
critic_model = get_critic()target_actor = get_actor()
target_critic = get_critic()# Making the weights equal initially
target_actor.set_weights(actor_model.get_weights())
target_critic.set_weights(critic_model.get_weights())# Learning rate for actor-critic models
critic_lr = 0.002
actor_lr = 0.001critic_optimizer = keras.optimizers.Adam(critic_lr)
actor_optimizer = keras.optimizers.Adam(actor_lr)total_episodes = 100
# Discount factor for future rewards
gamma = 0.99
# Used to update target networks
tau = 0.005buffer = Buffer(50000, 64)

(8) 实现训练循环，并在每个回合中进行迭代。使用 policy() 采样动作，并在每个时间步使用 learn() 进行训练，同时以速率 tau 更新目标网络：

# To store reward history of each episode
ep_reward_list = []
# To store average reward history of last few episodes
avg_reward_list = []# Takes about 4 min to train
for ep in range(total_episodes):prev_state, _ = env.reset()episodic_reward = 0while True:tf_prev_state = keras.ops.expand_dims(keras.ops.convert_to_tensor(prev_state), 0)action = policy(tf_prev_state, ou_noise)# Receive state and reward from environment.state, reward, done, truncated, _ = env.step(action)buffer.record((prev_state, action, reward, state))episodic_reward += rewardbuffer.learn()update_target(target_actor, actor_model, tau)update_target(target_critic, critic_model, tau)# End this episode when `done` or `truncated` is Trueif done or truncated:breakprev_state = stateep_reward_list.append(episodic_reward)# Mean of last 40 episodesavg_reward = np.mean(ep_reward_list[-40:])print("Episode * {} * Avg Reward is ==> {}".format(ep, avg_reward))avg_reward_list.append(avg_reward)# Plotting graph
plt.plot(avg_reward_list)
plt.xlabel("Episode")
plt.ylabel("Avg. Episodic Reward")
plt.show()# Save the weights
actor_model.save_weights("pendulum_actor.weights.h5")
critic_model.save_weights("pendulum_critic.weights.h5")target_actor.save_weights("pendulum_target_actor.weights.h5")
target_critic.save_weights("pendulum_target_critic.weights.h5")

可以尝试不同的学习率、tau 值以及 Actor 和 Critic 网络的架构。倒立摆问题的复杂度较低，但 DDPG 在许多其他问题上同样表现优异。

小结

深度确定性策略梯度 (DDPG, Deep Deterministic Policy Gradient) 是一种基于策略梯度的强化学习算法，用于解决连续动作空间中的问题。它是 Actor-Critic 方法的扩展，结合了深度学习和强化学习中的一些先进技术，如目标网络、经验回放等，适用于解决高维、连续动作空间的任务。

系列链接

TensorFlow深度学习实战（1）——神经网络与模型训练过程详解
TensorFlow深度学习实战（2）——使用TensorFlow构建神经网络
TensorFlow深度学习实战（3）——深度学习中常用激活函数详解
TensorFlow深度学习实战（4）——正则化技术详解
TensorFlow深度学习实战（5）——神经网络性能优化技术详解
TensorFlow深度学习实战（6）——回归分析详解
TensorFlow深度学习实战（7）——分类任务详解
TensorFlow深度学习实战（8）——卷积神经网络
TensorFlow深度学习实战（9）——构建VGG模型实现图像分类
TensorFlow深度学习实战（10）——迁移学习详解
TensorFlow深度学习实战（11）——风格迁移详解
TensorFlow深度学习实战（12）——词嵌入技术详解
TensorFlow深度学习实战（13）——神经嵌入详解
TensorFlow深度学习实战（14）——循环神经网络详解
TensorFlow深度学习实战（15）——编码器-解码器架构
TensorFlow深度学习实战（16）——注意力机制详解
TensorFlow深度学习实战（17）——主成分分析详解
TensorFlow深度学习实战（18）——K-means 聚类详解
TensorFlow深度学习实战（19）——受限玻尔兹曼机
TensorFlow深度学习实战（20）——自组织映射详解
TensorFlow深度学习实战（21）——Transformer架构详解与实现
TensorFlow深度学习实战（22）——从零开始实现Transformer机器翻译
TensorFlow深度学习实战（23）——自编码器详解与实现
TensorFlow深度学习实战（24）——卷积自编码器详解与实现
TensorFlow深度学习实战（25）——变分自编码器详解与实现
TensorFlow深度学习实战（26）——生成对抗网络详解与实现
TensorFlow深度学习实战（27）——CycleGAN详解与实现
TensorFlow深度学习实战（28）——扩散模型（Diffusion Model）
TensorFlow深度学习实战（29）——自监督学习（Self-Supervised Learning）
TensorFlow深度学习实战（30）——强化学习（Reinforcement learning，RL）
TensorFlow深度学习实战（31）——强化学习仿真库Gymnasium
TensorFlow深度学习实战（32）——深度Q网络（Deep Q-Network，DQN）