当前位置: 首页 > news >正文

RL【6】:Stochastic Approximation and Stochastic Gradient Descent

系列文章目录

Fundamental Tools

RL【1】:Basic Concepts
RL【2】:Bellman Equation
RL【3】:Bellman Optimality Equation

Algorithm

RL【4】:Value Iteration and Policy Iteration
RL【5】:Monte Carlo Learning
RL【6】:Stochastic Approximation and Stochastic Gradient Descent

Method


文章目录

  • 系列文章目录
    • Fundamental Tools
    • Algorithm
    • Method
  • 前言
  • Robbins-Monro Algorithm
  • Stochastic Gradient Descent
  • BGD, MBGD, and SGD
  • Summary
  • 总结


前言

本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning

Tips:本文主要记录了一些扩展知识,涉及一些复杂的数学推导,故在此先做简单的记录,后续会进一步完善


Robbins-Monro Algorithm

Problem statement

Suppose we would like to find the root of the equation g(w)=0g(w) = 0g(w)=0, where w∈Rw \in \mathbb{R}wR is the variable to be solved and g:R→Rg : \mathbb{R} \to \mathbb{R}g:RR is a function.

  • Many problems can be eventually converted to this root finding problem. For example, suppose J(w)J(w)J(w) is an objective function to be minimized. Then, the optimization problem can be converted to

    g(w)=∇wJ(w)=0.g(w) = \nabla_w J(w) = 0.g(w)=wJ(w)=0.

  • Note that an equation like g(w)=cg(w) = cg(w)=c with ccc as a constant can also be converted to the above equation by rewriting g(w)−cg(w) - cg(w)c as a new function.

The Robbins–Monro (RM) algorithm can solve this problem:

wk+1=wk−akg~(wk,ηk),k=1,2,3,…w_{k+1} = w_k - a_k \tilde{g}(w_k, \eta_k), \quad k = 1,2,3,\ldotswk+1=wkakg~(wk,ηk),k=1,2,3,

  • where
    • wkw_kwk is the kkk-th estimate of the root.
    • g~(wk,ηk)=g(wk)+ηk\tilde{g}(w_k, \eta_k) = g(w_k) + \eta_kg~(wk,ηk)=g(wk)+ηk is the kkk-th noisy observation.
    • aka_kak is a positive coefficient.
  • The function g(w)g(w)g(w) is a black box!
    • Input sequence: {wk}\{ w_k \}{wk}
    • Noisy output sequence: {g~(wk,ηk)}\{ \tilde{g}(w_k, \eta_k) \}{g~(wk,ηk)}

Robbins-Monro Theorem

In the Robbins–Monro algorithm, if

  1. 0<c1≤∇wg(w)≤c2for all w;0 < c_1 \leq \nabla_w g(w) \leq c_2 \quad \text{for all } w;0<c1wg(w)c2for all w;
  2. ∑k=1∞ak=∞and∑k=1∞ak2<∞;\sum_{k=1}^{\infty} a_k = \infty \quad \text{and} \quad \sum_{k=1}^{\infty} a_k^2 < \infty;k=1ak=andk=1ak2<;
  3. E[ηk∣Hk]=0andE[ηk2∣Hk]<∞;\mathbb{E}[\eta_k \mid \mathcal{H}_k] = 0 \quad \text{and} \quad \mathbb{E}[\eta_k^2 \mid \mathcal{H}_k] < \infty;E[ηkHk]=0andE[ηk2Hk]<;

where Hk={wk,wk−1,…}\mathcal{H}k = \{ w_k, w{k-1}, \ldots \}Hk={wk,wk1,}, then wkw_kwk converges with probability 1 (w.p.1) to the root w∗w^*w satisfying ∗g(w∗)=0∗*g(w^*) = 0*g(w)=0.

Convergence properties

  1. 0<c1≤∇wg(w)≤c2for all w0 < c_1 \leq \nabla_w g(w) \leq c_2 \quad \text{for all } w0<c1wg(w)c2for all w
    • This condition indicates ggg to be monotonically increasing, which ensures that the root of g(w)=0g(w) = 0g(w)=0 exists and is unique.
    • The gradient is bounded from above.
  2. ∑k=1∞ak=∞and∑k=1∞ak2<∞\sum_{k=1}^\infty a_k = \infty \quad \text{and} \quad \sum_{k=1}^\infty a_k^2 < \inftyk=1ak=andk=1ak2<
    • The condition ∑k=1∞ak2<∞\sum_{k=1}^\infty a_k^2 < \inftyk=1ak2< ensures that aka_kak converges to zero as k→∞k \to \inftyk.
    • The condition ∑k=1∞ak=∞\sum_{k=1}^\infty a_k = \inftyk=1ak= ensures that aka_kak do not converge to zero too fast.
    • Why is this condition important?
      • Summarizing w2=w1−a1g~(w1,η1),w3=w2−a2g~(w2,η2),…,wk+1=wk−akg~(wk,ηk)w_2 = w_1 - a_1 \tilde{g}(w_1, \eta_1), \quad w_3 = w_2 - a_2 \tilde{g}(w_2, \eta_2), \quad \ldots, \quad w_{k+1} = w_k - a_k \tilde{g}(w_k, \eta_k)w2=w1a1g~(w1,η1),w3=w2a2g~(w2,η2),,wk+1=wkakg~(wk,ηk) leads to

        w1−w∞=∑k=1∞akg~(wk,ηk).w_1 - w_\infty = \sum_{k=1}^\infty a_k \tilde{g}(w_k, \eta_k).w1w=k=1akg~(wk,ηk).

      • Suppose w∞=w∗w_\infty = w^*w=w. If ∑k=1∞ak<∞\sum_{k=1}^\infty a_k < \inftyk=1ak<, then ∑k=1∞akg~(wk,ηk)\sum_{k=1}^\infty a_k \tilde{g}(w_k, \eta_k)k=1akg~(wk,ηk) may be bounded. Then, if the initial guess w1w_1w1 is chosen arbitrarily far away from w∗w^*w, then the above equality would be invalid.

  3. E[ηk∣Hk]=0andE[ηk2∣Hk]<∞\mathbb{E}[\eta_k \mid \mathcal{H}_k] = 0 \quad \text{and} \quad \mathbb{E}[\eta_k^2 \mid \mathcal{H}_k] < \inftyE[ηkHk]=0andE[ηk2Hk]<
    • A special yet common case is that {ηk}\{\eta_k\}{ηk} is an iid stochastic sequence satisfying E[ηk]=0\mathbb{E}[\eta_k] = 0E[ηk]=0 and E[ηk2]<∞\mathbb{E}[\eta_k^2] < \inftyE[ηk2]<.
    • The observation error ηk\eta_kηk is not required to be Gaussian.
  4. What {ak}\{a_k\}{ak} satisfies the two conditions?
    • One typical sequence is ak=1ka_k = \frac{1}{k}ak=k1.
    • It holds that lim⁡n→∞(∑k=1n1k−ln⁡n)=κ\lim_{n \to \infty} \left( \sum_{k=1}^n \frac{1}{k} - \ln n \right) = \kappalimn(k=1nk1lnn)=κ,
      • where κ≈0.577\kappa \approx 0.577κ0.577 is called the Euler–Mascheroni constant (also called Euler’s constant)
    • It is notable that ∑k=1∞1k2=π26<∞.\sum_{k=1}^\infty \frac{1}{k^2} = \frac{\pi^2}{6} < \infty.k=1k21=6π2<∞.
      • The limit ∑k=1∞1k2\sum_{k=1}^\infty \frac{1}{k^2}k=1k21 also has a specific name in the number theory

Stochastic Gradient Descent

Algorithm description

  • Suppose we aim to solve the following optimization problem:

    min⁡wJ(w)=E[f(w,X)]\min_{w} \; J(w) = \mathbb{E}[f(w, X)]minwJ(w)=E[f(w,X)]

    • www is the parameter to be optimized.
    • XXX is a random variable. The expectation is with respect to XXX.
    • www and XXX can be either scalars or vectors. The function f(⋅)f(\cdot)f() is a scalar.
  • Three Methods

    • Method 1: gradient descent (GD)

      wk+1=wk−αk∇wE[f(wk,X)]=wk−αkE[∇wf(wk,X)]w_{k+1} = w_k - \alpha_k \nabla_w \mathbb{E}[f(w_k, X)] = w_k - \alpha_k \mathbb{E}[\nabla_w f(w_k, X)]wk+1=wkαkwE[f(wk,X)]=wkαkE[wf(wk,X)]

      • Drawback: the expected value is difficult to obtain.
    • Method 2: batch gradient descent (BGD)

      E[∇wf(wk,X)]≈1n∑i=1n∇wf(wk,xi),\mathbb{E}[\nabla_w f(w_k, X)] \;\approx\; \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i),E[wf(wk,X)]n1i=1nwf(wk,xi),

      wk+1=wk−αk1n∑i=1n∇wf(wk,xi).w_{k+1} = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i).wk+1=wkαkn1i=1nwf(wk,xi).

      • Drawback: it requires many samples in each iteration for each wkw_kwk.
    • Method 3: stochastic gradient descent (SGD)

      wk+1=wk−αk∇wf(wk,xk),w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k),wk+1=wkαkwf(wk,xk),

      • Compared to the gradient descent method: Replace the true gradient
        E[∇wf(wk,X)]\mathbb{E}[\nabla_w f(w_k, X)]E[wf(wk,X)] by the stochastic gradient ∇wf(wk,xk)\nabla_w f(w_k, x_k)wf(wk,xk).
      • Compared to the batch gradient descent method: let n=1n = 1n=1.

Convergence analysis

  • From GD to SGD:

    wk+1=wk−αkE[∇wf(wk,X)]⟹wk+1=wk−αk∇wf(wk,xk)w_{k+1} = w_k - \alpha_k \mathbb{E}[\nabla_w f(w_k, X)] \quad \Longrightarrow \quad w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k)wk+1=wkαkE[wf(wk,X)]wk+1=wkαkwf(wk,xk)

    • ∇wf(wk,xk)\nabla_w f(w_k, x_k)wf(wk,xk) can be viewed as a noisy measurement of
      E[∇wf(w,X)]\mathbb{E}[\nabla_w f(w, X)]E[wf(w,X)]:

      ∇wf(wk,xk)=E[∇wf(w,X)]+(∇wf(wk,xk)−E[∇wf(w,X)])⏟η.\nabla_w f(w_k, x_k) = \mathbb{E}[\nabla_w f(w, X)] + \underbrace{\big(\nabla_w f(w_k, x_k) - \mathbb{E}[\nabla_w f(w, X)]\big)}_{\eta}.wf(wk,xk)=E[wf(w,X)]+η(wf(wk,xk)E[wf(w,X)]).

    • Since

      ∇wf(wk,xk)≠E[∇wf(w,X)],\nabla_w f(w_k, x_k) \neq \mathbb{E}[\nabla_w f(w, X)],wf(wk,xk)=E[wf(w,X)],

  • SGD as RM Algorithm

    • Then, the convergence naturally follows. The aim of SGD is to minimize

      J(w)=E[f(w,X)].J(w) = \mathbb{E}[f(w, X)].J(w)=E[f(w,X)].

    • This problem can be converted to a root-finding problem:

      ∇wJ(w)=E[∇wf(w,X)]=0.\nabla_w J(w) = \mathbb{E}[\nabla_w f(w, X)] = 0.wJ(w)=E[wf(w,X)]=0.

    • Let

      g(w)=∇wJ(w)=E[∇wf(w,X)].g(w) = \nabla_w J(w) = \mathbb{E}[\nabla_w f(w, X)].g(w)=wJ(w)=E[wf(w,X)].

    • Then, the aim of SGD is to find the root of g(w)=0g(w)=0g(w)=0.

  • SGD = RM

    • What we can measure is

      g~(w,η)=∇wf(w,x)=E[∇wf(w,X)]⏟g(w)+∇wf(w,x)−E[∇wf(w,X)]⏟η.\tilde{g}(w, \eta) = \nabla_w f(w, x) = \underbrace{\mathbb{E}[\nabla_w f(w, X)]}_{g(w)} + \underbrace{\nabla_w f(w, x) - \mathbb{E}[\nabla_w f(w, X)]}_{\eta}.g~(w,η)=wf(w,x)=g(w)E[wf(w,X)]+ηwf(w,x)E[wf(w,X)].

      • It is exactly the SGD algorithm.
      • Therefore, SGD is a special RM algorithm.
  • Convergence of SGD

    • In the SGD algorithm, if
      1. 0<c1≤∇w2f(w,X)≤c20 < c_1 \leq \nabla_w^2 f(w, X) \leq c_20<c1w2f(w,X)c2;
      2. ∑k=1∞ak=∞and∑k=1∞ak2<∞\sum_{k=1}^\infty a_k = \infty \quad \text{and} \quad \sum_{k=1}^\infty a_k^2 < \inftyk=1ak=andk=1ak2<;
      3. {xk}k=1∞\{x_k\}_{k=1}^\infty{xk}k=1 is iid;
    • then wkw_kwk converges to the root of ∇wE[f(w,X)]=0\nabla_w \mathbb{E}[f(w, X)] = 0wE[f(w,X)]=0 with probability 111.

A deterministic formulation

  • Consider the optimization problem:

    min⁡wJ(w)=1n∑i=1nf(w,xi),\min_{w} J(w) = \frac{1}{n} \sum_{i=1}^n f(w, x_i),minwJ(w)=n1i=1nf(w,xi),

    • f(w,xi)f(w, x_i)f(w,xi) is a parameterized function.
    • www is the parameter to be optimized.
    • {xi}i=1n\{x_i\}_{i=1}^n{xi}i=1n is a set of real numbers, where xix_ixi does not have to be a sample of any random variable.
  • The gradient descent algorithm for solving this problem is

    wk+1=wk−αk∇wJ(wk)=wk−αk1n∑i=1n∇wf(wk,xi).w_{k+1} = w_k - \alpha_k \nabla_w J(w_k) = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i).wk+1=wkαkwJ(wk)=wkαkn1i=1nwf(wk,xi).

  • Suppose the set is large and we can only fetch a single number every time. In this case, we can use the following iterative algorithm:

    wk+1=wk−αk∇wf(wk,xk).w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k).wk+1=wkαkwf(wk,xk).

  • Questions:

    • Is this algorithm SGD? It does not involve any random variables or expected values.
    • How should we use the finite set of numbers {xi}i=1n\{x_i\}_{i=1}^n{xi}i=1n? Should we sort these numbers in a certain order and then use them one by one? Or should we randomly sample a number from the set?
  • Answer:

    • A quick answer to the above questions is that we can introduce a random variable manually and convert the deterministic formulation to the stochastic formulation of SGD.

    • In particular, suppose XXX is a random variable defined on the set {xi}i=1n\{x_i\}_{i=1}^n{xi}i=1n.

    • Suppose its probability distribution is uniform such that

      p(X=xi)=1n.p(X = x_i) = \frac{1}{n}.p(X=xi)=n1.

    • Then, the deterministic optimization problem becomes a stochastic one:

      min⁡wJ(w)=1n∑i=1nf(w,xi)=E[f(w,X)].\min_{w} J(w) = \frac{1}{n} \sum_{i=1}^n f(w, x_i) = \mathbb{E}[f(w, X)].minwJ(w)=n1i=1nf(w,xi)=E[f(w,X)].

      • The last equality in the above equation is strict instead of approximate. Therefore, the algorithm is SGD.
      • The estimate converges if xkx_kxk is uniformly and independently sampled from {xi}i=1n\{x_i\}{i=1}^n{xi}i=1n. Note that xkx_kxk may repeatedly take the same number in {xi}i=1n\{x_i\}{i=1}^n{xi}i=1n since it is sampled randomly.

BGD, MBGD, and SGD

  • Suppose we would like to minimize J(w)=E[f(w,X)]J(w) = \mathbb{E}[f(w,X)]J(w)=E[f(w,X)] given a set of random samples {xi}i=1n\{x_i\}_{i=1}^n{xi}i=1n of XXX. The BGD, SGD, and MBGD algorithms solving this problem are, respectively,
    • wk+1=wk−αk1n∑i=1n∇wf(wk,xi),(BGD)w_{k+1} = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i), \quad \text{(BGD)}wk+1=wkαkn1i=1nwf(wk,xi),(BGD)
      • In the BGD algorithm, all the samples are used in every iteration. When nnn is large, 1n∑i=1n∇wf(wk,xi)≈E[∇wf(wk,X)]\frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i) \approx \mathbb{E}[\nabla_w f(w_k, X)]n1i=1nwf(wk,xi)E[wf(wk,X)].
    • wk+1=wk−αk1m∑j∈Ik∇wf(wk,xj),(MBGD)w_{k+1} = w_k - \alpha_k \frac{1}{m} \sum_{j \in \mathcal{I}k} \nabla_w f(w_k, x_j), \quad \text{(MBGD)}wk+1=wkαkm1jIkwf(wk,xj),(MBGD)
      • In the MBGD algorithm, Ik\mathcal{I}_kIk is a subset of {1,…,n}\{1, \ldots, n\}{1,,n} with size ∣Ik∣=m|\mathcal{I}_k| = mIk=m. The set Ik\mathcal{I}kIk is obtained by mmm i.i.d. samplings.
    • ∗wk+1=wk−αk∇wf(wk,xk).(SGD)∗*w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k). \quad \text{(SGD)}*wk+1=wkαkwf(wk,xk).(SGD)
      • In the SGD algorithm, xkx_kxk is randomly sampled from {xi}i=1n\{x_i\}{i=1}^n{xi}i=1n at time kkk.
  • Compare MBGD with BGD and SGD:
    • Compared to SGD, MBGD has less randomness because it uses more samples instead of just one as in SGD.
    • Compared to BGD, MBGD does not require to use all the samples in every iteration, making it more flexible and efficient.
    • If m=1m=1m=1, MBGD becomes SGD.
    • If m=nm=nm=n, MBGD does NOT become BGD strictly speaking because MBGD uses randomly fetched nnn samples whereas BGD uses all nnn numbers. In particular, MBGD may use a value in {xi}i=1n\{x_i\}_{i=1}^n{xi}i=1n multiple times whereas BGD uses each number once.

Summary

  • Mean estimation: compute E[X]\mathbb{E}[X]E[X] using {xk}\{x_k\}{xk}

    wk+1=wk−1k(wk−xk).w_{k+1} = w_k - \frac{1}{k}(w_k - x_k).wk+1=wkk1(wkxk).

  • RM algorithm: solve g(w)=0g(w) = 0g(w)=0 using {g~(wk,ηk)}\{\tilde{g}(w_k, \eta_k)\}{g~(wk,ηk)}

    wk+1=wk−akg~(wk,ηk).w_{k+1} = w_k - a_k \tilde{g}(w_k, \eta_k).wk+1=wkakg~(wk,ηk).

  • SGD algorithm: minimize J(w)=E[f(w,X)]J(w) = \mathbb{E}[f(w,X)]J(w)=E[f(w,X)] using {∇wf(wk,xk)}\{\nabla_w f(w_k, x_k)\}{wf(wk,xk)}

    wk+1=wk−ak∇wf(wk,xk).w_{k+1} = w_k - a_k \nabla_w f(w_k, x_k).wk+1=wkakwf(wk,xk).


总结


文章转载自:

http://i9ZyxG83.wztLr.cn
http://rxd3FzFj.wztLr.cn
http://YrChkr3o.wztLr.cn
http://iv0gvMSt.wztLr.cn
http://BDzjoM79.wztLr.cn
http://fTgT0h4M.wztLr.cn
http://S0NPMHHW.wztLr.cn
http://jqEqXece.wztLr.cn
http://KopymAzY.wztLr.cn
http://PkKSN8Sy.wztLr.cn
http://zBvNoPaj.wztLr.cn
http://TFG7CsXM.wztLr.cn
http://U6J4MGaA.wztLr.cn
http://K7P0p1hG.wztLr.cn
http://ND3BTp7N.wztLr.cn
http://9fs5Bmuc.wztLr.cn
http://42UaLzlf.wztLr.cn
http://1lxTyuil.wztLr.cn
http://5KhYeRFK.wztLr.cn
http://Mbvmz1xJ.wztLr.cn
http://CIn4haTk.wztLr.cn
http://DSE7hXME.wztLr.cn
http://HP6Shcwv.wztLr.cn
http://j9q5OJIp.wztLr.cn
http://rydjn1kk.wztLr.cn
http://EWMJoAV8.wztLr.cn
http://u4emtuFr.wztLr.cn
http://7YE8VRhq.wztLr.cn
http://2QmWg9aT.wztLr.cn
http://2bAfibNc.wztLr.cn
http://www.dtcms.com/a/375311.html

相关文章:

  • 计算机毕设Python项目:基于爬虫技术的网络小说数据分析系统
  • 基于springboot 校园餐厅预约点餐微信小程序的设计与实现(代码+数据库+LW)
  • Day20 K8S学习
  • Mockito 原理与实战
  • Django项目架构
  • SpringBoot整合通用ClamAV文件扫描病毒
  • 提权分析报告 —— 基于DriftingBlues: 4靶场
  • 设计模式-原则概述
  • LAMPSecurity: CTF8靶场渗透
  • python网络爬取个人学习指南-(五)
  • CSS 基础概念
  • 在企业内部分发 iOS App 时如何生成并使用 manifest.plist
  • AJAX入门-AJAX 概念和 axios 使用
  • 框架-MyBatis|Plus-1
  • Spring Boot 2.7 启动流程详解
  • springboot框架使用websocket实现一个聊天室的细节
  • Kubernetes集群部署Jenkins指南
  • 027、全球数据库市场深度分析:技术革命下的产业格局重塑
  • 贪心算法与动态规划:数学原理、实现与优化
  • Oracle APEX 利用卡片实现翻转(方法二)
  • 记一次 electron 添加 检测 终端编码,解决终端打印中文乱码问题
  • 从生活照料到精神关怀,七彩喜打造全场景养老服务体系
  • 2025-09-08升级问题记录: 升级SDK从Android11到Android12
  • BizDevOps 是什么?如何建设企业 BizDevOps 体系
  • 一、ARM异常等级及切换
  • 【项目复现】MOOSE-Chem 用于重新发现未见化学科学假说的大型语言模型
  • mybatis plus 使用wrapper输出SQL
  • PgSQL中优化术语HOT详解
  • Python 绘制 2025年 9~11月 P/1999 RO28 (LONEOS) 彗星路径
  • Spring Cloud Stream深度实战:发布订阅模式解决微服务通信难题