RL【6】:Stochastic Approximation and Stochastic Gradient Descent
系列文章目录
Fundamental Tools
RL【1】:Basic Concepts
RL【2】:Bellman Equation
RL【3】:Bellman Optimality Equation
Algorithm
RL【4】:Value Iteration and Policy Iteration
RL【5】:Monte Carlo Learning
RL【6】:Stochastic Approximation and Stochastic Gradient Descent
Method
文章目录
- 系列文章目录
- Fundamental Tools
- Algorithm
- Method
- 前言
- Robbins-Monro Algorithm
- Stochastic Gradient Descent
- BGD, MBGD, and SGD
- Summary
- 总结
前言
本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning
Tips:本文主要记录了一些扩展知识,涉及一些复杂的数学推导,故在此先做简单的记录,后续会进一步完善
Robbins-Monro Algorithm
Problem statement
Suppose we would like to find the root of the equation g(w)=0g(w) = 0g(w)=0, where w∈Rw \in \mathbb{R}w∈R is the variable to be solved and g:R→Rg : \mathbb{R} \to \mathbb{R}g:R→R is a function.
-
Many problems can be eventually converted to this root finding problem. For example, suppose J(w)J(w)J(w) is an objective function to be minimized. Then, the optimization problem can be converted to
g(w)=∇wJ(w)=0.g(w) = \nabla_w J(w) = 0.g(w)=∇wJ(w)=0.
-
Note that an equation like g(w)=cg(w) = cg(w)=c with ccc as a constant can also be converted to the above equation by rewriting g(w)−cg(w) - cg(w)−c as a new function.
The Robbins–Monro (RM) algorithm can solve this problem:
wk+1=wk−akg~(wk,ηk),k=1,2,3,…w_{k+1} = w_k - a_k \tilde{g}(w_k, \eta_k), \quad k = 1,2,3,\ldotswk+1=wk−akg~(wk,ηk),k=1,2,3,…
- where
- wkw_kwk is the kkk-th estimate of the root.
- g~(wk,ηk)=g(wk)+ηk\tilde{g}(w_k, \eta_k) = g(w_k) + \eta_kg~(wk,ηk)=g(wk)+ηk is the kkk-th noisy observation.
- aka_kak is a positive coefficient.
- The function g(w)g(w)g(w) is a black box!
- Input sequence: {wk}\{ w_k \}{wk}
- Noisy output sequence: {g~(wk,ηk)}\{ \tilde{g}(w_k, \eta_k) \}{g~(wk,ηk)}
Robbins-Monro Theorem
In the Robbins–Monro algorithm, if
- 0<c1≤∇wg(w)≤c2for all w;0 < c_1 \leq \nabla_w g(w) \leq c_2 \quad \text{for all } w;0<c1≤∇wg(w)≤c2for all w;
- ∑k=1∞ak=∞and∑k=1∞ak2<∞;\sum_{k=1}^{\infty} a_k = \infty \quad \text{and} \quad \sum_{k=1}^{\infty} a_k^2 < \infty;∑k=1∞ak=∞and∑k=1∞ak2<∞;
- E[ηk∣Hk]=0andE[ηk2∣Hk]<∞;\mathbb{E}[\eta_k \mid \mathcal{H}_k] = 0 \quad \text{and} \quad \mathbb{E}[\eta_k^2 \mid \mathcal{H}_k] < \infty;E[ηk∣Hk]=0andE[ηk2∣Hk]<∞;
where Hk={wk,wk−1,…}\mathcal{H}k = \{ w_k, w{k-1}, \ldots \}Hk={wk,wk−1,…}, then wkw_kwk converges with probability 1 (w.p.1) to the root w∗w^*w∗ satisfying ∗g(w∗)=0∗*g(w^*) = 0*∗g(w∗)=0∗.
Convergence properties
- 0<c1≤∇wg(w)≤c2for all w0 < c_1 \leq \nabla_w g(w) \leq c_2 \quad \text{for all } w0<c1≤∇wg(w)≤c2for all w
- This condition indicates ggg to be monotonically increasing, which ensures that the root of g(w)=0g(w) = 0g(w)=0 exists and is unique.
- The gradient is bounded from above.
- ∑k=1∞ak=∞and∑k=1∞ak2<∞\sum_{k=1}^\infty a_k = \infty \quad \text{and} \quad \sum_{k=1}^\infty a_k^2 < \infty∑k=1∞ak=∞and∑k=1∞ak2<∞
- The condition ∑k=1∞ak2<∞\sum_{k=1}^\infty a_k^2 < \infty∑k=1∞ak2<∞ ensures that aka_kak converges to zero as k→∞k \to \inftyk→∞.
- The condition ∑k=1∞ak=∞\sum_{k=1}^\infty a_k = \infty∑k=1∞ak=∞ ensures that aka_kak do not converge to zero too fast.
- Why is this condition important?
-
Summarizing w2=w1−a1g~(w1,η1),w3=w2−a2g~(w2,η2),…,wk+1=wk−akg~(wk,ηk)w_2 = w_1 - a_1 \tilde{g}(w_1, \eta_1), \quad w_3 = w_2 - a_2 \tilde{g}(w_2, \eta_2), \quad \ldots, \quad w_{k+1} = w_k - a_k \tilde{g}(w_k, \eta_k)w2=w1−a1g~(w1,η1),w3=w2−a2g~(w2,η2),…,wk+1=wk−akg~(wk,ηk) leads to
w1−w∞=∑k=1∞akg~(wk,ηk).w_1 - w_\infty = \sum_{k=1}^\infty a_k \tilde{g}(w_k, \eta_k).w1−w∞=∑k=1∞akg~(wk,ηk).
-
Suppose w∞=w∗w_\infty = w^*w∞=w∗. If ∑k=1∞ak<∞\sum_{k=1}^\infty a_k < \infty∑k=1∞ak<∞, then ∑k=1∞akg~(wk,ηk)\sum_{k=1}^\infty a_k \tilde{g}(w_k, \eta_k)∑k=1∞akg~(wk,ηk) may be bounded. Then, if the initial guess w1w_1w1 is chosen arbitrarily far away from w∗w^*w∗, then the above equality would be invalid.
-
- E[ηk∣Hk]=0andE[ηk2∣Hk]<∞\mathbb{E}[\eta_k \mid \mathcal{H}_k] = 0 \quad \text{and} \quad \mathbb{E}[\eta_k^2 \mid \mathcal{H}_k] < \inftyE[ηk∣Hk]=0andE[ηk2∣Hk]<∞
- A special yet common case is that {ηk}\{\eta_k\}{ηk} is an iid stochastic sequence satisfying E[ηk]=0\mathbb{E}[\eta_k] = 0E[ηk]=0 and E[ηk2]<∞\mathbb{E}[\eta_k^2] < \inftyE[ηk2]<∞.
- The observation error ηk\eta_kηk is not required to be Gaussian.
- What {ak}\{a_k\}{ak} satisfies the two conditions?
- One typical sequence is ak=1ka_k = \frac{1}{k}ak=k1.
- It holds that limn→∞(∑k=1n1k−lnn)=κ\lim_{n \to \infty} \left( \sum_{k=1}^n \frac{1}{k} - \ln n \right) = \kappalimn→∞(∑k=1nk1−lnn)=κ,
- where κ≈0.577\kappa \approx 0.577κ≈0.577 is called the Euler–Mascheroni constant (also called Euler’s constant)
- It is notable that ∑k=1∞1k2=π26<∞.\sum_{k=1}^\infty \frac{1}{k^2} = \frac{\pi^2}{6} < \infty.∑k=1∞k21=6π2<∞.
- The limit ∑k=1∞1k2\sum_{k=1}^\infty \frac{1}{k^2}∑k=1∞k21 also has a specific name in the number theory
Stochastic Gradient Descent
Algorithm description
-
Suppose we aim to solve the following optimization problem:
minwJ(w)=E[f(w,X)]\min_{w} \; J(w) = \mathbb{E}[f(w, X)]minwJ(w)=E[f(w,X)]
- www is the parameter to be optimized.
- XXX is a random variable. The expectation is with respect to XXX.
- www and XXX can be either scalars or vectors. The function f(⋅)f(\cdot)f(⋅) is a scalar.
-
Three Methods
-
Method 1: gradient descent (GD)
wk+1=wk−αk∇wE[f(wk,X)]=wk−αkE[∇wf(wk,X)]w_{k+1} = w_k - \alpha_k \nabla_w \mathbb{E}[f(w_k, X)] = w_k - \alpha_k \mathbb{E}[\nabla_w f(w_k, X)]wk+1=wk−αk∇wE[f(wk,X)]=wk−αkE[∇wf(wk,X)]
- Drawback: the expected value is difficult to obtain.
-
Method 2: batch gradient descent (BGD)
E[∇wf(wk,X)]≈1n∑i=1n∇wf(wk,xi),\mathbb{E}[\nabla_w f(w_k, X)] \;\approx\; \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i),E[∇wf(wk,X)]≈n1∑i=1n∇wf(wk,xi),
wk+1=wk−αk1n∑i=1n∇wf(wk,xi).w_{k+1} = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i).wk+1=wk−αkn1∑i=1n∇wf(wk,xi).
- Drawback: it requires many samples in each iteration for each wkw_kwk.
-
Method 3: stochastic gradient descent (SGD)
wk+1=wk−αk∇wf(wk,xk),w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k),wk+1=wk−αk∇wf(wk,xk),
- Compared to the gradient descent method: Replace the true gradient
E[∇wf(wk,X)]\mathbb{E}[\nabla_w f(w_k, X)]E[∇wf(wk,X)] by the stochastic gradient ∇wf(wk,xk)\nabla_w f(w_k, x_k)∇wf(wk,xk). - Compared to the batch gradient descent method: let n=1n = 1n=1.
- Compared to the gradient descent method: Replace the true gradient
-
Convergence analysis
-
From GD to SGD:
wk+1=wk−αkE[∇wf(wk,X)]⟹wk+1=wk−αk∇wf(wk,xk)w_{k+1} = w_k - \alpha_k \mathbb{E}[\nabla_w f(w_k, X)] \quad \Longrightarrow \quad w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k)wk+1=wk−αkE[∇wf(wk,X)]⟹wk+1=wk−αk∇wf(wk,xk)
-
∇wf(wk,xk)\nabla_w f(w_k, x_k)∇wf(wk,xk) can be viewed as a noisy measurement of
E[∇wf(w,X)]\mathbb{E}[\nabla_w f(w, X)]E[∇wf(w,X)]:∇wf(wk,xk)=E[∇wf(w,X)]+(∇wf(wk,xk)−E[∇wf(w,X)])⏟η.\nabla_w f(w_k, x_k) = \mathbb{E}[\nabla_w f(w, X)] + \underbrace{\big(\nabla_w f(w_k, x_k) - \mathbb{E}[\nabla_w f(w, X)]\big)}_{\eta}.∇wf(wk,xk)=E[∇wf(w,X)]+η(∇wf(wk,xk)−E[∇wf(w,X)]).
-
Since
∇wf(wk,xk)≠E[∇wf(w,X)],\nabla_w f(w_k, x_k) \neq \mathbb{E}[\nabla_w f(w, X)],∇wf(wk,xk)=E[∇wf(w,X)],
-
-
SGD as RM Algorithm
-
Then, the convergence naturally follows. The aim of SGD is to minimize
J(w)=E[f(w,X)].J(w) = \mathbb{E}[f(w, X)].J(w)=E[f(w,X)].
-
This problem can be converted to a root-finding problem:
∇wJ(w)=E[∇wf(w,X)]=0.\nabla_w J(w) = \mathbb{E}[\nabla_w f(w, X)] = 0.∇wJ(w)=E[∇wf(w,X)]=0.
-
Let
g(w)=∇wJ(w)=E[∇wf(w,X)].g(w) = \nabla_w J(w) = \mathbb{E}[\nabla_w f(w, X)].g(w)=∇wJ(w)=E[∇wf(w,X)].
-
Then, the aim of SGD is to find the root of g(w)=0g(w)=0g(w)=0.
-
-
SGD = RM
-
What we can measure is
g~(w,η)=∇wf(w,x)=E[∇wf(w,X)]⏟g(w)+∇wf(w,x)−E[∇wf(w,X)]⏟η.\tilde{g}(w, \eta) = \nabla_w f(w, x) = \underbrace{\mathbb{E}[\nabla_w f(w, X)]}_{g(w)} + \underbrace{\nabla_w f(w, x) - \mathbb{E}[\nabla_w f(w, X)]}_{\eta}.g~(w,η)=∇wf(w,x)=g(w)E[∇wf(w,X)]+η∇wf(w,x)−E[∇wf(w,X)].
- It is exactly the SGD algorithm.
- Therefore, SGD is a special RM algorithm.
-
-
Convergence of SGD
- In the SGD algorithm, if
- 0<c1≤∇w2f(w,X)≤c20 < c_1 \leq \nabla_w^2 f(w, X) \leq c_20<c1≤∇w2f(w,X)≤c2;
- ∑k=1∞ak=∞and∑k=1∞ak2<∞\sum_{k=1}^\infty a_k = \infty \quad \text{and} \quad \sum_{k=1}^\infty a_k^2 < \infty∑k=1∞ak=∞and∑k=1∞ak2<∞;
- {xk}k=1∞\{x_k\}_{k=1}^\infty{xk}k=1∞ is iid;
- then wkw_kwk converges to the root of ∇wE[f(w,X)]=0\nabla_w \mathbb{E}[f(w, X)] = 0∇wE[f(w,X)]=0 with probability 111.
- In the SGD algorithm, if
A deterministic formulation
-
Consider the optimization problem:
minwJ(w)=1n∑i=1nf(w,xi),\min_{w} J(w) = \frac{1}{n} \sum_{i=1}^n f(w, x_i),minwJ(w)=n1∑i=1nf(w,xi),
- f(w,xi)f(w, x_i)f(w,xi) is a parameterized function.
- www is the parameter to be optimized.
- {xi}i=1n\{x_i\}_{i=1}^n{xi}i=1n is a set of real numbers, where xix_ixi does not have to be a sample of any random variable.
-
The gradient descent algorithm for solving this problem is
wk+1=wk−αk∇wJ(wk)=wk−αk1n∑i=1n∇wf(wk,xi).w_{k+1} = w_k - \alpha_k \nabla_w J(w_k) = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i).wk+1=wk−αk∇wJ(wk)=wk−αkn1∑i=1n∇wf(wk,xi).
-
Suppose the set is large and we can only fetch a single number every time. In this case, we can use the following iterative algorithm:
wk+1=wk−αk∇wf(wk,xk).w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k).wk+1=wk−αk∇wf(wk,xk).
-
Questions:
- Is this algorithm SGD? It does not involve any random variables or expected values.
- How should we use the finite set of numbers {xi}i=1n\{x_i\}_{i=1}^n{xi}i=1n? Should we sort these numbers in a certain order and then use them one by one? Or should we randomly sample a number from the set?
-
Answer:
-
A quick answer to the above questions is that we can introduce a random variable manually and convert the deterministic formulation to the stochastic formulation of SGD.
-
In particular, suppose XXX is a random variable defined on the set {xi}i=1n\{x_i\}_{i=1}^n{xi}i=1n.
-
Suppose its probability distribution is uniform such that
p(X=xi)=1n.p(X = x_i) = \frac{1}{n}.p(X=xi)=n1.
-
Then, the deterministic optimization problem becomes a stochastic one:
minwJ(w)=1n∑i=1nf(w,xi)=E[f(w,X)].\min_{w} J(w) = \frac{1}{n} \sum_{i=1}^n f(w, x_i) = \mathbb{E}[f(w, X)].minwJ(w)=n1∑i=1nf(w,xi)=E[f(w,X)].
- The last equality in the above equation is strict instead of approximate. Therefore, the algorithm is SGD.
- The estimate converges if xkx_kxk is uniformly and independently sampled from {xi}i=1n\{x_i\}{i=1}^n{xi}i=1n. Note that xkx_kxk may repeatedly take the same number in {xi}i=1n\{x_i\}{i=1}^n{xi}i=1n since it is sampled randomly.
-
BGD, MBGD, and SGD
- Suppose we would like to minimize J(w)=E[f(w,X)]J(w) = \mathbb{E}[f(w,X)]J(w)=E[f(w,X)] given a set of random samples {xi}i=1n\{x_i\}_{i=1}^n{xi}i=1n of XXX. The BGD, SGD, and MBGD algorithms solving this problem are, respectively,
- wk+1=wk−αk1n∑i=1n∇wf(wk,xi),(BGD)w_{k+1} = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i), \quad \text{(BGD)}wk+1=wk−αkn1∑i=1n∇wf(wk,xi),(BGD)
- In the BGD algorithm, all the samples are used in every iteration. When nnn is large, 1n∑i=1n∇wf(wk,xi)≈E[∇wf(wk,X)]\frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i) \approx \mathbb{E}[\nabla_w f(w_k, X)]n1∑i=1n∇wf(wk,xi)≈E[∇wf(wk,X)].
- wk+1=wk−αk1m∑j∈Ik∇wf(wk,xj),(MBGD)w_{k+1} = w_k - \alpha_k \frac{1}{m} \sum_{j \in \mathcal{I}k} \nabla_w f(w_k, x_j), \quad \text{(MBGD)}wk+1=wk−αkm1∑j∈Ik∇wf(wk,xj),(MBGD)
- In the MBGD algorithm, Ik\mathcal{I}_kIk is a subset of {1,…,n}\{1, \ldots, n\}{1,…,n} with size ∣Ik∣=m|\mathcal{I}_k| = m∣Ik∣=m. The set Ik\mathcal{I}kIk is obtained by mmm i.i.d. samplings.
- ∗wk+1=wk−αk∇wf(wk,xk).(SGD)∗*w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k). \quad \text{(SGD)}*∗wk+1=wk−αk∇wf(wk,xk).(SGD)∗
- In the SGD algorithm, xkx_kxk is randomly sampled from {xi}i=1n\{x_i\}{i=1}^n{xi}i=1n at time kkk.
- wk+1=wk−αk1n∑i=1n∇wf(wk,xi),(BGD)w_{k+1} = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i), \quad \text{(BGD)}wk+1=wk−αkn1∑i=1n∇wf(wk,xi),(BGD)
- Compare MBGD with BGD and SGD:
- Compared to SGD, MBGD has less randomness because it uses more samples instead of just one as in SGD.
- Compared to BGD, MBGD does not require to use all the samples in every iteration, making it more flexible and efficient.
- If m=1m=1m=1, MBGD becomes SGD.
- If m=nm=nm=n, MBGD does NOT become BGD strictly speaking because MBGD uses randomly fetched nnn samples whereas BGD uses all nnn numbers. In particular, MBGD may use a value in {xi}i=1n\{x_i\}_{i=1}^n{xi}i=1n multiple times whereas BGD uses each number once.
Summary
-
Mean estimation: compute E[X]\mathbb{E}[X]E[X] using {xk}\{x_k\}{xk}
wk+1=wk−1k(wk−xk).w_{k+1} = w_k - \frac{1}{k}(w_k - x_k).wk+1=wk−k1(wk−xk).
-
RM algorithm: solve g(w)=0g(w) = 0g(w)=0 using {g~(wk,ηk)}\{\tilde{g}(w_k, \eta_k)\}{g~(wk,ηk)}
wk+1=wk−akg~(wk,ηk).w_{k+1} = w_k - a_k \tilde{g}(w_k, \eta_k).wk+1=wk−akg~(wk,ηk).
-
SGD algorithm: minimize J(w)=E[f(w,X)]J(w) = \mathbb{E}[f(w,X)]J(w)=E[f(w,X)] using {∇wf(wk,xk)}\{\nabla_w f(w_k, x_k)\}{∇wf(wk,xk)}
wk+1=wk−ak∇wf(wk,xk).w_{k+1} = w_k - a_k \nabla_w f(w_k, x_k).wk+1=wk−ak∇wf(wk,xk).