当前位置: 首页 > news >正文

RL【1】:Basic Concepts

系列文章目录


文章目录

  • 系列文章目录
  • 前言
  • Fundamental concepts in Reinforcement Learning
  • Markov decision process (MDP)
  • 总结


前言

本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning


Fundamental concepts in Reinforcement Learning

  • State: The status of the agent with respect to the environment.
  • State space: the set of all states. S={si}i=1NS = \{ s_i \}^N_{i=1}S={si}i=1N
  • Action: For each state, an action refers to a set of possible moves or operations that the agent can take.
  • Action space of a state: the set of all possible actions of a state. A(si)={ai}i=1NA(s_i) = \{a_i\}^N_{i=1}A(si)={ai}i=1N
  • State transition: When taking an action, the agent may move from one state to another. Such a process is called state transition. State transition defines the interaction with the environment.
  • Policy π\piπ: Policy tells the agent what actions to take at a state.
  • Reward: a real number we get after taking an action.
    • A positive reward represents encouragement to take such actions.
    • A negative reward represents punishment to take such actions.
    • Reward can be interpreted as a human-machine interface, with which we can guide the agent to behave as what we expect.
    • The reward depends on the state and action, but not the next state.
  • Trajectory: A trajectory is a state-action-reward chain.
  • Return: The return of this trajectory is the sum of all the rewards collected along the trajectory.
  • Discount rate: The discount rate is a scalar factor γ∈[0,1)\gamma \in [0,1)γ[0,1) that determines the present value of future rewards. It specifies how much importance the agent assigns to rewards received in the future compared to immediate rewards.
    • A smaller value of γ\gammaγ makes the agent more short-sighted, emphasizing immediate returns.
    • A value closer to 1 encourages long-term planning by valuing distant rewards nearly as much as immediate ones.
  • Discounted return GtG_tGt: The discounted return is the cumulative reward the agent aims to maximize, defined as the weighted sum of future rewards starting from time step ttt
    • Gt=Rt+1+γRt+2+γ2Rt+3+⋯=∑k=0∞γkRt+k+1.G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}.Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1.
    • It captures both immediate and future rewards while incorporating the discount rate γ\gammaγ, thereby balancing short-term and long-term gains.
  • Episode: When interacting with the environment following a policy, the agent may stop
    at some terminal states. The resulting trajectory is called an episode (or a
    trial).
    • An episode is usually assumed to be a finite trajectory. Tasks with episodes are called episodic tasks.
    • Some tasks may have no terminal states, meaning the interaction with the environment will never end. Such tasks are called continuing tasks.
    • In fact, we can treat episodic and continuing tasks in a unified mathematical way by converting episodic tasks to continuing tasks.
      • Option 1: Treat the target state as a special absorbing state. Once the agent reaches an absorbing state, it will never leave. The consequent rewards r=0r = 0r=0.
      • Option 2: Treat the target state as a normal state with a policy. The agent can still leave the target state and gain r=r+1r= r+1r=r+1 when entering the target state.

Markov decision process (MDP)

  • Sets:
    • State: the set of states S\mathcal{S}S
    • Action: the set of actions A(s)\mathcal{A}(s)A(s) associated with state s∈Ss \in \mathcal{S}sS
    • Reward: the set of rewards R(s,a)\mathcal{R}(s, a)R(s,a)
  • Probability distribution:
    • State transition probability: at state sss, taking action aaa, the probability of transitioning to state s′s's is p(s′∣s,a)p(s' | s, a)p(ss,a)
    • Reward probability: at state sss, taking action aaa, the probability of receiving reward rrr is p(r∣s,a)p(r | s, a)p(rs,a)
  • Policy: at state sss, the probability of choosing action aaa is π(a∣s)\pi(a | s)π(as)
  • Markov property: memoryless property
    • p(st+1∣at,st,...,a0,s0)=p(st+1∣at,st)p(s_{t+1} | a_t, s_t, ..., a_0, s_0) = p(s_{t+1} | a_t, s_t)p(st+1at,st,...,a0,s0)=p(st+1at,st)
    • p(rt+1∣at,st,...,a0,s0)=p(rt+1∣at,st)p(r_{t+1} | a_t, s_t, ..., a_0, s_0) = p(r_{t+1} | a_t, s_t)p(rt+1at,st,...,a0,s0)=p(rt+1at,st)

总结

第一节系统性地介绍了 RL 中常见的一些概念,并做了通俗易懂的解释。同时,结合了马尔可夫决策过程,用数学公式化的语言进一步讲解了各概念的含义,为后续内容的讲解做好了铺垫。

http://www.dtcms.com/a/363436.html

相关文章:

  • 一个真正跨平台可用的免费PDF解决方案
  • PyTorch 训练随机卡死复盘:DataLoader × OpenCV 多进程死锁,三步定位与彻底修复
  • 金融学硕士这么多,都说只有中国人民大学与加拿大女王大学金融硕士值得读
  • 提示工程+领域知识:DeepSeek在工业控制代码生成中的突破——基于PLC梯形图转C语言的实战
  • Flink - 基础学习(1)-三种时间语义
  • 大数据开发环境搭建(Linux + Hadoop + Spark + Flink + Hive + Kafka)
  • React事件机制总结(基于5W2H分析法)
  • Vue3 + GSAP 动画库完全指南:从入门到精通,打造专业级网页动画
  • 学习React-8-useImmer
  • Linux---初始Linux及其基本指令
  • 02-Media-2-ai_rtsp.py 人脸识别加网络画面RTSP推流演示
  • 【51单片机】【protues仿真】基于51单片机脉搏体温检测仪系统
  • JavaScript》》JS》》ES6》》 数组 自定义属性
  • HTTP发展历程
  • RPC和HTTP的区别?
  • HttpHeadersFilter
  • GPT-Realtime 弹幕TTS API 低延迟集成教程
  • 网络原理——HTTP/HTTPS
  • 【MySQL体系结构详解:一条SQL查询的旅程】
  • 分布式中防止重复消费
  • 计算机视觉与深度学习 | 视觉里程计技术全解析:定义、原理、与SLAM的关系及应用场景
  • STM32之SPI详解
  • Linux《进程信号(上)》
  • mit6.031 2023spring 软件构造 笔记 Specification
  • 【XR硬件系列】Apple Vision Pro 完全解读:苹果为我们定义了怎样的 “空间计算” 未来?
  • springboot项目使用websocket功能,使用了nginx反向代理后连接失败问题解决
  • 集采与反腐双重压力下,医药销售的破局之道:从资源依赖到价值重构
  • DASK shuffle任务图分析
  • 阅读Linux 4.0内核RMAP机制的代码,画出父子进程之间VMA、AVC、anon_vma和page等数据结构之间的关系图。
  • 解密llama.cpp CUDA后端:512 token大模型批处理的异步流水线架构