当前位置：首页 > news >正文

RL【6】：Stochastic Approximation and Stochastic Gradient Descent

news 2025/9/10 8:52:19

系列文章目录

Fundamental Tools

RL【1】：Basic Concepts
RL【2】：Bellman Equation
RL【3】：Bellman Optimality Equation

Algorithm

RL【4】：Value Iteration and Policy Iteration
RL【5】：Monte Carlo Learning
RL【6】：Stochastic Approximation and Stochastic Gradient Descent

Method

文章目录

系列文章目录
- Fundamental Tools
- Algorithm
- Method
前言
Robbins-Monro Algorithm
Stochastic Gradient Descent
BGD, MBGD, and SGD
Summary
总结

前言

本系列文章主要用于记录 B站赵世钰老师的【强化学习的数学原理】的学习笔记，关于赵老师课程的具体内容，可以移步：
B站视频：【【强化学习的数学原理】课程：从零开始到透彻理解（完结）】
GitHub 课程资料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Tips：本文主要记录了一些扩展知识，涉及一些复杂的数学推导，故在此先做简单的记录，后续会进一步完善

Robbins-Monro Algorithm

Problem statement

Suppose we would like to find the root of the equation $g (w) = 0$ , where $\in \mathbb{R}$ is the variable to be solved and $\mathbb{R} \to \mathbb{R}$ is a function.

Many problems can be eventually converted to this root finding problem. For example, suppose $J (w)$ is an objective function to be minimized. Then, the optimization problem can be converted to

$\nabla_w J(w) = 0.$
Note that an equation like $g (w) = c$ with $c$ as a constant can also be converted to the above equation by rewriting $g (w) - c$ as a new function.

The Robbins–Monro (RM) algorithm can solve this problem:

$wk+1=wk−akg~(wk,ηk),k=1,2,3,…w_{k+1} = w_k - a_k \tilde{g}(w_k, \eta_k), \quad k = 1,2,3,\ldots$

where
- $w_k$ is the $k$ -th estimate of the root.
- $g~(wk,ηk)=g(wk)+ηk\tilde{g}(w_k, \eta_k) = g(w_k) + \eta_k$ is the $k$ -th noisy observation.
- $a_k$ is a positive coefficient.
The function $g (w)$ is a black box!
- Input sequence: ${ w_k \}$
- Noisy output sequence: ${g~(wk,ηk)}\{ \tilde{g}(w_k, \eta_k) \}$

Robbins-Monro Theorem

In the Robbins–Monro algorithm, if

$c_1 \leq \nabla_w g(w) \leq c_2 \quad \text{for all } w;$
$∑k=1∞ak=∞and∑k=1∞ak2<∞;\sum_{k=1}^{\infty} a_k = \infty \quad \text{and} \quad \sum_{k=1}^{\infty} a_k^2 < \infty;$
$E[ηk∣Hk]=0andE[ηk2∣Hk]<∞;\mathbb{E}[\eta_k \mid \mathcal{H}_k] = 0 \quad \text{and} \quad \mathbb{E}[\eta_k^2 \mid \mathcal{H}_k] < \infty;$

where $Hk={wk,wk−1,…}\mathcal{H}k = \{ w_k, w{k-1}, \ldots \}$ , then $w_k$ converges with probability 1 (w.p.1) to the root $w^*$ satisfying $g(w^*) = 0*$ .

Convergence properties

$c_1 \leq \nabla_w g(w) \leq c_2 \quad \text{for all } w$
- This condition indicates $g$ to be monotonically increasing, which ensures that the root of $g (w) = 0$ exists and is unique.
- The gradient is bounded from above.
$∑k=1∞ak=∞and∑k=1∞ak2<∞\sum_{k=1}^\infty a_k = \infty \quad \text{and} \quad \sum_{k=1}^\infty a_k^2 < \infty$
- The condition $∑k=1∞ak2<∞\sum_{k=1}^\infty a_k^2 < \infty$ ensures that $a_k$ converges to zero as $\to \infty$ .
- The condition $∑k=1∞ak=∞\sum_{k=1}^\infty a_k = \infty$ ensures that $a_k$ do not converge to zero too fast.
- Why is this condition important?
  - Summarizing $w2=w1−a1g~(w1,η1),w3=w2−a2g~(w2,η2),…,wk+1=wk−akg~(wk,ηk)w_2 = w_1 - a_1 \tilde{g}(w_1, \eta_1), \quad w_3 = w_2 - a_2 \tilde{g}(w_2, \eta_2), \quad \ldots, \quad w_{k+1} = w_k - a_k \tilde{g}(w_k, \eta_k)$ leads to
    
    $w1−w∞=∑k=1∞akg~(wk,ηk).w_1 - w_\infty = \sum_{k=1}^\infty a_k \tilde{g}(w_k, \eta_k).$
  - Suppose $w∞=w∗w_\infty = w^*$ . If $∑k=1∞ak<∞\sum_{k=1}^\infty a_k < \infty$ , then $∑k=1∞akg~(wk,ηk)\sum_{k=1}^\infty a_k \tilde{g}(w_k, \eta_k)$ may be bounded. Then, if the initial guess $w_1$ is chosen arbitrarily far away from $w^*$ , then the above equality would be invalid.
$E[ηk∣Hk]=0andE[ηk2∣Hk]<∞\mathbb{E}[\eta_k \mid \mathcal{H}_k] = 0 \quad \text{and} \quad \mathbb{E}[\eta_k^2 \mid \mathcal{H}_k] < \infty$
- A special yet common case is that ${ηk}\{\eta_k\}$ is an iid stochastic sequence satisfying $E[ηk]=0\mathbb{E}[\eta_k] = 0$ and $E[ηk2]<∞\mathbb{E}[\eta_k^2] < \infty$ .
- The observation error $ηk\eta_k$ is not required to be Gaussian.
What ${a_k\}$ satisfies the two conditions?
- One typical sequence is $ak=1ka_k = \frac{1}{k}$ .
- It holds that $lim⁡n→∞(∑k=1n1k−ln⁡n)=κ\lim_{n \to \infty} \left( \sum_{k=1}^n \frac{1}{k} - \ln n \right) = \kappa$ ,
  - where $κ≈0.577\kappa \approx 0.577$ is called the Euler–Mascheroni constant (also called Euler’s constant)
- It is notable that $∑k=1∞1k2=π26<∞.\sum_{k=1}^\infty \frac{1}{k^2} = \frac{\pi^2}{6} < \infty.$
  - The limit $∑k=1∞1k2\sum_{k=1}^\infty \frac{1}{k^2}$ also has a specific name in the number theory

Stochastic Gradient Descent

Algorithm description

Suppose we aim to solve the following optimization problem:

$min⁡wJ(w)=E[f(w,X)]\min_{w} \; J(w) = \mathbb{E}[f(w, X)]$
- $w$ is the parameter to be optimized.
- $X$ is a random variable. The expectation is with respect to $X$ .
- $w$ and $X$ can be either scalars or vectors. The function $f(⋅)f(\cdot)$ is a scalar.
Three Methods
- Method 1: gradient descent (GD)
  
  $wk+1=wk−αk∇wE[f(wk,X)]=wk−αkE[∇wf(wk,X)]w_{k+1} = w_k - \alpha_k \nabla_w \mathbb{E}[f(w_k, X)] = w_k - \alpha_k \mathbb{E}[\nabla_w f(w_k, X)]$
  - Drawback: the expected value is difficult to obtain.
- Method 2: batch gradient descent (BGD)
  
  $E[∇wf(wk,X)]≈1n∑i=1n∇wf(wk,xi),\mathbb{E}[\nabla_w f(w_k, X)] \;\approx\; \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i),$
  
  $wk+1=wk−αk1n∑i=1n∇wf(wk,xi).w_{k+1} = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i).$
  - Drawback: it requires many samples in each iteration for each $w_k$ .
- Method 3: stochastic gradient descent (SGD)
  
  $wk+1=wk−αk∇wf(wk,xk),w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k),$
  - Compared to the gradient descent method: Replace the true gradient
    $E[∇wf(wk,X)]\mathbb{E}[\nabla_w f(w_k, X)]$ by the stochastic gradient $∇wf(wk,xk)\nabla_w f(w_k, x_k)$ .
  - Compared to the batch gradient descent method: let $n = 1$ .

Convergence analysis

From GD to SGD:

$wk+1=wk−αkE[∇wf(wk,X)]⟹wk+1=wk−αk∇wf(wk,xk)w_{k+1} = w_k - \alpha_k \mathbb{E}[\nabla_w f(w_k, X)] \quad \Longrightarrow \quad w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k)$
- $∇wf(wk,xk)\nabla_w f(w_k, x_k)$ can be viewed as a noisy measurement of
  $E[∇wf(w,X)]\mathbb{E}[\nabla_w f(w, X)]$ :
  
  $∇wf(wk,xk)=E[∇wf(w,X)]+(∇wf(wk,xk)−E[∇wf(w,X)])⏟η.\nabla_w f(w_k, x_k) = \mathbb{E}[\nabla_w f(w, X)] + \underbrace{\big(\nabla_w f(w_k, x_k) - \mathbb{E}[\nabla_w f(w, X)]\big)}_{\eta}.$
- Since
  
  $∇wf(wk,xk)≠E[∇wf(w,X)],\nabla_w f(w_k, x_k) \neq \mathbb{E}[\nabla_w f(w, X)],$
SGD as RM Algorithm
- Then, the convergence naturally follows. The aim of SGD is to minimize
  
  $\mathbb{E}[f(w, X)].$
- This problem can be converted to a root-finding problem:
  
  $∇wJ(w)=E[∇wf(w,X)]=0.\nabla_w J(w) = \mathbb{E}[\nabla_w f(w, X)] = 0.$
- Let
  
  $\nabla_w J(w) = \mathbb{E}[\nabla_w f(w, X)].$
- Then, the aim of SGD is to find the root of $g (w) = 0$ .
SGD = RM
- What we can measure is
  
  $g~(w,η)=∇wf(w,x)=E[∇wf(w,X)]⏟g(w)+∇wf(w,x)−E[∇wf(w,X)]⏟η.\tilde{g}(w, \eta) = \nabla_w f(w, x) = \underbrace{\mathbb{E}[\nabla_w f(w, X)]}_{g(w)} + \underbrace{\nabla_w f(w, x) - \mathbb{E}[\nabla_w f(w, X)]}_{\eta}.$
  - It is exactly the SGD algorithm.
  - Therefore, SGD is a special RM algorithm.
Convergence of SGD
- In the SGD algorithm, if
  1. $c_1 \leq \nabla_w^2 f(w, X) \leq c_2$ ;
  2. $∑k=1∞ak=∞and∑k=1∞ak2<∞\sum_{k=1}^\infty a_k = \infty \quad \text{and} \quad \sum_{k=1}^\infty a_k^2 < \infty$ ;
  3. ${xk}k=1∞\{x_k\}_{k=1}^\infty$ is iid;
- then $w_k$ converges to the root of $∇wE[f(w,X)]=0\nabla_w \mathbb{E}[f(w, X)] = 0$ with probability $1$ .

A deterministic formulation

Consider the optimization problem:

$min⁡wJ(w)=1n∑i=1nf(w,xi),\min_{w} J(w) = \frac{1}{n} \sum_{i=1}^n f(w, x_i),$
- $f(w, x_i)$ is a parameterized function.
- $w$ is the parameter to be optimized.
- ${x_i\}_{i=1}^n$ is a set of real numbers, where $x_i$ does not have to be a sample of any random variable.
The gradient descent algorithm for solving this problem is

$wk+1=wk−αk∇wJ(wk)=wk−αk1n∑i=1n∇wf(wk,xi).w_{k+1} = w_k - \alpha_k \nabla_w J(w_k) = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i).$
Suppose the set is large and we can only fetch a single number every time. In this case, we can use the following iterative algorithm:

$wk+1=wk−αk∇wf(wk,xk).w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k).$
Questions:
- Is this algorithm SGD? It does not involve any random variables or expected values.
- How should we use the finite set of numbers ${x_i\}_{i=1}^n$ ? Should we sort these numbers in a certain order and then use them one by one? Or should we randomly sample a number from the set?
Answer:
- A quick answer to the above questions is that we can introduce a random variable manually and convert the deterministic formulation to the stochastic formulation of SGD.
- In particular, suppose $X$ is a random variable defined on the set ${x_i\}_{i=1}^n$ .
- Suppose its probability distribution is uniform such that
  
  $x_i) = \frac{1}{n}.$
- Then, the deterministic optimization problem becomes a stochastic one:
  
  $min⁡wJ(w)=1n∑i=1nf(w,xi)=E[f(w,X)].\min_{w} J(w) = \frac{1}{n} \sum_{i=1}^n f(w, x_i) = \mathbb{E}[f(w, X)].$
  - The last equality in the above equation is strict instead of approximate. Therefore, the algorithm is SGD.
  - The estimate converges if $x_k$ is uniformly and independently sampled from ${x_i\}{i=1}^n$ . Note that $x_k$ may repeatedly take the same number in ${x_i\}{i=1}^n$ since it is sampled randomly.

BGD, MBGD, and SGD

Suppose we would like to minimize $\mathbb{E}[f(w,X)]$ given a set of random samples ${x_i\}_{i=1}^n$ of $X$ . The BGD, SGD, and MBGD algorithms solving this problem are, respectively,
- $wk+1=wk−αk1n∑i=1n∇wf(wk,xi),(BGD)w_{k+1} = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i), \quad \text{(BGD)}$
  - In the BGD algorithm, all the samples are used in every iteration. When $n$ is large, $1n∑i=1n∇wf(wk,xi)≈E[∇wf(wk,X)]\frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i) \approx \mathbb{E}[\nabla_w f(w_k, X)]$ .
- $wk+1=wk−αk1m∑j∈Ik∇wf(wk,xj),(MBGD)w_{k+1} = w_k - \alpha_k \frac{1}{m} \sum_{j \in \mathcal{I}k} \nabla_w f(w_k, x_j), \quad \text{(MBGD)}$
  - In the MBGD algorithm, $Ik\mathcal{I}_k$ is a subset of ${1,…,n}\{1, \ldots, n\}$ with size $∣Ik∣=m|\mathcal{I}_k| = m$ . The set $Ik\mathcal{I}k$ is obtained by $m$ i.i.d. samplings.
- $∗wk+1=wk−αk∇wf(wk,xk).(SGD)∗*w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k). \quad \text{(SGD)}*$
  - In the SGD algorithm, $x_k$ is randomly sampled from ${x_i\}{i=1}^n$ at time $k$ .
Compare MBGD with BGD and SGD:
- Compared to SGD, MBGD has less randomness because it uses more samples instead of just one as in SGD.
- Compared to BGD, MBGD does not require to use all the samples in every iteration, making it more flexible and efficient.
- If $m = 1$ , MBGD becomes SGD.
- If $m = n$ , MBGD does NOT become BGD strictly speaking because MBGD uses randomly fetched $n$ samples whereas BGD uses all $n$ numbers. In particular, MBGD may use a value in ${x_i\}_{i=1}^n$ multiple times whereas BGD uses each number once.

Summary

Mean estimation: compute $E[X]\mathbb{E}[X]$ using ${x_k\}$

$wk+1=wk−1k(wk−xk).w_{k+1} = w_k - \frac{1}{k}(w_k - x_k).$
RM algorithm: solve $g (w) = 0$ using ${g~(wk,ηk)}\{\tilde{g}(w_k, \eta_k)\}$

$wk+1=wk−akg~(wk,ηk).w_{k+1} = w_k - a_k \tilde{g}(w_k, \eta_k).$
SGD algorithm: minimize $\mathbb{E}[f(w,X)]$ using ${∇wf(wk,xk)}\{\nabla_w f(w_k, x_k)\}$

$wk+1=wk−ak∇wf(wk,xk).w_{k+1} = w_k - a_k \nabla_w f(w_k, x_k).$