糖尿病康复,内容丰富有趣,生活中的好帮手!
糖尿病康复 > 李宏毅Reinforcement Learning强化学习入门笔记

李宏毅Reinforcement Learning强化学习入门笔记

时间:2021-08-16 12:48:45

相关推荐

李宏毅Reinforcement Learning强化学习入门笔记

文章目录

Concepts in Reinforcement LearningDifficulties in RLA3C Method Brief IntroductionPolicy-based Approach - Learn an Actor (Policy Gradient Method)1. Decide Function of Actor Model (NN? ...)2. Decide Goodness of this Function3. Choose the best functionOn-Policy v.s. Off-PolicyImportance Sampling (On-Policy →\rightarrow→ Off-Policy)PPO Algorithm —— Proximal Policy OptimizationValue-based Approach - Learn an CriticQ-learningDouble DQNOther Advanced Structure of Q-LearningA3C Method - Asynchronous Advantage Actor-CriticAdvantage Actor-Critic (A2C Method)Asynchronous Advantage Actor-Critic (A3C Method)Sparse RewardReward ShapingCurriculum LearningHierarchical Reinforcement LearningICM —— Intrinsic Curiosity Module

Concepts in Reinforcement Learning

The main goal of Reinforcement Learning is to maximum theTotal Reward.Total Rewardis the sum of all reward inOne Eposide, so the model doesn’t know which steps in this episode are good and which are bad.Only few actions can get the positive reward (ex: fire and killing the enemy in Space War game gets positive reward but moving gets no reward), so how to let the model find these right actions is very important.

Difficulties in RL

Reward Delay Only “Fire” can obtain rewards, but moving before fire is also important (moving has no reward), how to let the model learn to move properly?In chess game, it may be better to sacrifice immediate reward to gain more long-term reward. Agent’s actions may affect the environment How to explore the world (observation) as more as possible.How to explore theaction-combinationas more as possible.

A3C Method Brief Introduction

The A3C method is the most popular model which combines policy-based method and value-based method, the structure is shown as below. To learn A3C model, we need to know the concepts ofpolicy-basedandvalue-based. The details of A3C are shown [here](#A3C Method - Asynchronous Advantage Actor-Critic).

Policy-based Approach - Learn an Actor (Policy Gradient Method)

This approach try to learn a policy(also called actor). It accepts the observation as input, and output an action. The policy(actor) can be any model. If you useuseuse an Neural Network to as your actor, then you are doing Deep Reinforcement Learning.

Input(Observation)→Actor/Policy→Output(Action)Input(Observation) \rightarrow Actor/Policy \rightarrow Output(Action) Input(Observation)→Actor/Policy→Output(Action)

There arethree stepsto build DRL:

1. Decide Function of Actor Model (NN? …)

Here we use the NN as our Actor, so:

The Input of this NN is the observation of machine represented as Vector or Matrix. (Ex: Image Pixels to Matrix)The Output of this NN is ActionProbability. The most important point is that we shouldn’t always choose the action which has the highest probability, it should be a stochastic decisions according to the probability distribution.The Advantage of NN to Q-table is: we can’t enumerate all observations (such as we can’t list all pixels’ combinations of a game) in some complex scenes, then we can use Neural Network to promise that we can always obtain an output even if this observation didn’t appear in the previous train set.

2. Decide Goodness of this Function

Since we use the Neural Network as our function model, we need to decide what is the goodness of this model (a standard to judge the performance of current model). We use R(θ)‾\overline{R(\theta)}R(θ)​ to express this standard, which θ\thetaθ is the parameters of current model.

Given an actor πθ(t)\pi_\theta(t)πθ​(t) with Network-Parameters θ\thetaθ, ttt is the observation (input).Use the actor πθ(t)\pi_\theta(t)πθ​(t) to play the video game until this game finished.Sum all rewards in this episode and marked as R(θ)→R(θ)=∑t=1TrtR(\theta) \rightarrow R(\theta) = \sum_{t=1}^Tr_tR(θ)→R(θ)=∑t=1T​rt​.

Note: R(θ)R(\theta)R(θ) is a variable, cause even if we use the same actor πθ(t)\pi_\theta(t)πθ​(t) to play the same game many times, we can get the different R(θ)R(\theta)R(θ) (random mechanism in game and action chosen). So we want to maximum the R(θ)‾\overline{R(\theta)}R(θ)​ which expresses the expect of R(θ)R(\theta)R(θ).Use the R(θ)‾\overline{R(\theta)}R(θ)​ to expresses the goodness of πθ(t)\pi_\theta(t)πθ​(t).

How to Calculate the R(θ)R(\theta)R(θ)?

An episode is considered as a trajectory τ\tauτ

τ\tauτ = {s1,a1,r1,s2,a2,r2,...,sT,aT,rTs_1, a_1, r_1, s_2, a_2, r_2, ..., s_T, a_T, r_Ts1​,a1​,r1​,s2​,a2​,r2​,...,sT​,aT​,rT​} →\rightarrow→all the history in this episodeR(τ)=∑t=1TrtR(\tau) = \sum_{t=1}^Tr_tR(τ)=∑t=1T​rt​

Different τ\tauτ has different probability to appear, the probability of τ\tauτ is depending on the parameter θ\thetaθ of actor πθ(t)\pi_\theta(t)πθ​(t). So we define the probability of τ\tauτ as P(τ∣θ)P(\tau|\theta)P(τ∣θ).

R(θ)‾=∑τP(τ∣θ)R(τ)\overline{R(\theta)} = \sum_\tau{P(\tau|\theta)R(\tau)} R(θ)​=τ∑​P(τ∣θ)R(τ)

We use actor πθ(t)\pi_\theta(t)πθ​(t) to play N times game, obtain the list {τ1,τ2,...,τN\tau^1, \tau^2, ..., \tau^Nτ1,τ2,...,τN}. Each τn\tau^nτn has a reward R(τn)R(\tau^n)R(τn), the mean of these R(τn)R(\tau^n)R(τn) approximate equals to the expect R(θ)‾\overline{R(\theta)}R(θ)​.

R(θ)‾≈1N∑n=1NR(τn)\overline{R(\theta)} \approx \frac{1}{N}\sum_{n=1}^NR(\tau^n) R(θ)​≈N1​n=1∑N​R(τn)

3. Choose the best function

Now we need to know how to calculate the θ\thetaθ, here we use theGradient Ascendmethod.

problem statements:

θ∗=argmaxθR(θ)‾→R(θ)‾=∑τP(τ∣θ)R(τ)\theta^* = argmax_\theta\overline{R(\theta)} \rightarrow \overline{R(\theta)} = \sum_{\tau}P(\tau|\theta)R(\tau) θ∗=argmaxθ​R(θ)​→R(θ)​=τ∑​P(τ∣θ)R(τ)gradient ascent: Start with θ0\theta^0θ0.θ1=θ0+η▽R(θ0)‾\theta^1 = \theta^0 + \eta\bigtriangledown{\overline{R(\theta^0)}}θ1=θ0+η▽R(θ0)​θ2=θ1+η▽R(θ1)‾\theta^2 = \theta^1 + \eta\bigtriangledown{\overline{R(\theta^1)}}θ2=θ1+η▽R(θ1)​… The θ\thetaθ includes the parameters in the current Neural Network, $\theta = $ {w1,w2,w3,...,b1,b2,b3,...w_1, w_2, w_3, ..., b_1, b_2, b_3, ...w1​,w2​,w3​,...,b1​,b2​,b3​,...}, which the ▽R(θ)=[∂R(θ)∂w1∂R(θ)∂w2...∂R(θ)∂b1∂R(θ)∂b2...]\bigtriangledown R(\theta) = \left[ \begin{matrix} \frac{\partial{R(\theta)}}{\partial{w_1}} \\ \frac{\partial{R(\theta)}}{\partial{w_2}} \\ ... \\ \frac{\partial{R(\theta)}}{\partial{b_1}} \\ \frac{\partial{R(\theta)}}{\partial{b_2}} \\ ... \end{matrix} \right]▽R(θ)=⎣⎢⎢⎢⎢⎢⎢⎢⎡​∂w1​∂R(θ)​∂w2​∂R(θ)​...∂b1​∂R(θ)​∂b2​∂R(θ)​...​⎦⎥⎥⎥⎥⎥⎥⎥⎤​.

It’s time to calculate the gradient of R(θ)=∑τP(τ∣θ)R(τ)R(\theta) = \sum_{\tau}P(\tau|\theta)R(\tau)R(θ)=∑τ​P(τ∣θ)R(τ), since R(τ)R(\tau)R(τ) has nothing to do with θ\thetaθ, the gradient can be expressed as:

▽R(θ)=∑τR(τ)▽P(τ∣θ)=∑τR(τ)P(τ∣θ)▽P(τ∣θ)P(τ∣θ)=∑τR(τ)P(τ∣θ)▽logP(τ∣θ)\bigtriangledown{R(\theta)} = \sum_\tau{R(\tau)\bigtriangledown{P(\tau|\theta)}} = \sum_\tau{R(\tau)P(\tau|\theta)\frac{\bigtriangledown{P(\tau|\theta)}}{P(\tau|\theta)}} = \sum_\tau{R(\tau)P(\tau|\theta)\bigtriangledown{logP(\tau|\theta)}} ▽R(θ)=τ∑​R(τ)▽P(τ∣θ)=τ∑​R(τ)P(τ∣θ)P(τ∣θ)▽P(τ∣θ)​=τ∑​R(τ)P(τ∣θ)▽logP(τ∣θ)

Note: dlog(f(x))dx=1f(x)df(x)dx\frac{dlog(f(x))}{dx} = \frac{1}{f(x)}\frac{df(x)}{dx}dxdlog(f(x))​=f(x)1​dxdf(x)​

Use θ\thetaθ policy play the game N times, obtain {τ1,τ2,τ3,...\tau_1, \tau_2, \tau_3, ...τ1​,τ2​,τ3​,...}:

▽R(θ)≈1N∑n=1NR(τn)▽logP(τ∣θ)\bigtriangledown{R(\theta)} \approx \frac{1}{N}\sum_{n=1}^N{R(\tau^n)\bigtriangledown{logP(\tau|\theta)}} ▽R(θ)≈N1​n=1∑N​R(τn)▽logP(τ∣θ)

How to Calculate the ▽logP(τ∣θ)\bigtriangledown{logP(\tau|\theta)}▽logP(τ∣θ)?

Since τ\tauτ is the history of one episode, so:

KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ P(\tau|\theta)…

Ignore the terms which not related to θ\thetaθ:

▽logP(τ∣θ)=∑t=1T▽logP(at∣st,θ)\bigtriangledown{logP(\tau|\theta)} = \sum_{t=1}^T{\bigtriangledown{logP(a_t|s_t, \theta)}} ▽logP(τ∣θ)=t=1∑T​▽logP(at​∣st​,θ)

So the final result of ▽R(θ)‾\bigtriangledown\overline{R(\theta)}▽R(θ)​ is :

▽R(θ)‾=1N∑n=1N∑t=1TR(τn)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{R(\tau^n)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} ▽R(θ)​=N1​n=1∑N​t=1∑T​R(τn)▽logP(atn​∣stn​,θ)

The meaning of this equation is very clear:

if R(τn)R(\tau^n)R(τn) is positive →\rightarrow→ tune θ\thetaθ to increase the P(atn∣stn)P(a_t^n|s_t^n)P(atn​∣stn​).if R(τn)R(\tau^n)R(τn) is negative →\rightarrow→ tune θ\thetaθ to decrease the P(atn∣stn)P(a_t^n|s_t^n)P(atn​∣stn​)

Use this method can resolve the [Reward Delay Problem](#Difficulties in RL) inDifficulties in RLchapter, because here we use thecumulative rewardof one entire episode R(τn)R(\tau^n)R(τn), not just the immediate reward after taking one action.

Add a Baseline - b

To avoid all of R(τn)R(\tau^n)R(τn) is positive (there should be some negative reward to tell model don’t take this action at this state), we can add a baseline. So the equation changes to:

▽R(θ)‾=1N∑n=1N∑t=1T(R(τn)−b)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(R(\tau^n) - b)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} ▽R(θ)​=N1​n=1∑N​t=1∑T​(R(τn)−b)▽logP(atn​∣stn​,θ)

Assign Suitable Weight of each Action

Use the total reward R(τ)R(\tau)R(τ) to tune the all actions’ probability in this episode also has some disadvantage, show as below:

The left picture show one episode whose total reward R is 5, so the probabilities of all actions in this episode will be increased (such as x5), but the main positive reward obtained from the a1a_1a1​, while a2a_2a2​ and a3a_3a3​ didn’t give any positive reward, but the probability of a2a_2a2​ and a3a_3a3​ also be increased in this example. Same as right picture, a1a_1a1​ is a bad action, but a2a_2a2​ may not be a bad action, so probability of a2a_2a2​ shouldn’t be decreased.

To avoid this problem, we assign different RRR to each ata_tat​, the RRR is the cumulation of rtr_trt​ which is the sum of all rewards obtained after ata_tat​, now the equation becomes:

▽R(θ)‾=1N∑n=1N∑t=1T(∑t′=tTγt′−trt′n−b)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n} - b)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} ▽R(θ)​=N1​n=1∑N​t=1∑T​(t′=t∑T​γt′−trt′n​−b)▽logP(atn​∣stn​,θ)

Note: γ\gammaγ called discount factor, γ<1\gamma < 1γ<1.

We can use Aθ(st,at)A^\theta(s_t, a_t)Aθ(st​,at​) to express the (∑t′=tTγt′−trt′n−b)(\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n} - b)(∑t′=tT​γt′−trt′n​−b) in above equation, which calledAdvantage Function. This function evaluate how good it is if we take ata_tat​ at this state sts_tst​ rather than other actions.

On-Policy v.s. Off-Policy

On-Policy and Off-Policy are two different modes of learning:

On-Policy: The agent learn the rules byinteractingwith environment. (learn from itself)Off-Policy: The agent learn the rules bywatchingothers’ interacting with environment. (learn from others)

Our Policy Gradient Method is an On-Policy learning mode, so why we need Off-Policy mode? This is because we use sampling N times and get the mean value to approximate the expect R(θ)‾=∑τP(τ∣θ)R(τ)\overline{R(\theta)} = \sum_\tau{P(\tau|\theta)R(\tau)}R(θ)​=∑τ​P(τ∣θ)R(τ). But when we update the θ\thetaθ, the P(τ∣θ)P(\tau|\theta)P(τ∣θ) changed, so we need to do N sampling again and get the mean value. This will take a lot of time to do sampling after we update θ\thetaθ. The resolution is, we build a model πθ\pi_\thetaπθ​, this model accept the training data from the other model πθ′\pi_{\theta'}πθ′​. Use πθ′\pi_{\theta'}πθ′​ to collect data, and train the θ\thetaθ with θ′\theta'θ′, since don’t change θ′\theta'θ′, the sampling data can be reused.

Importance Sampling (On-Policy →\rightarrow→ Off-Policy)

Importance Sampling is a method to get the expect of one function Ex∼p(p(x))E_{x\sim{p}}(p(x))Ex∼p​(p(x)) by sampling another function q(x)q(x)q(x). Since we have already known:

Ex∼p[f(x)]≈1N∑i=1Nf(xi)E_{x\sim{p}}[f(x)] \approx \frac{1}{N}\sum_{i=1}^N{f(x^i)} Ex∼p​[f(x)]≈N1​i=1∑N​f(xi)

But if we only have {xix^ixi} sampled from q(x)q(x)q(x), how to use this samples to calculate the E[p(x)]E[p(x)]E[p(x)]? We can change equation above:

Ex∼p[f(x)]=∫p(x)f(x)dx=∫f(x)p(x)q(x)q(x)dx=Ex∼q[f(x)p(x)q(x)]E_{x\sim{p}}[f(x)] = \int{p(x)f(x)}dx = \int{f(x)\frac{p(x)}{q(x)}q(x)}dx = E_{x\sim{q}}[f(x)\frac{p(x)}{q(x)}] Ex∼p​[f(x)]=∫p(x)f(x)dx=∫f(x)q(x)p(x)​q(x)dx=Ex∼q​[f(x)q(x)p(x)​]

That means we can get the expect of distribution p(x)p(x)p(x) by sampling the {xix^ixi} from another distribution q(x)q(x)q(x), only need to do some rectification, p(x)q(x)\frac{p(x)}{q(x)}q(x)p(x)​ called rectification term. Now we can consider our πθ\pi_\thetaπθ​ model as p(x)p(x)p(x), the πθ′\pi_{\theta'}πθ′​ as q(x)q(x)q(x), use the q(x)q(x)q(x) to sample data to tune p(x)p(x)p(x).

▽R(θ)‾=Eτ∼pθ(τ)[R(τ)▽logpθ(τ)]=Eτ∼pθ′(τ)[pθ(τ)pθ′(τ)R(τ)▽logpθ(τ)]\bigtriangledown{\overline{R(\theta)}} = E_{\tau\sim{p_\theta(\tau)}}[R(\tau)\bigtriangledown{logp_\theta(\tau)}] = E_{\tau\sim{p_{\theta'(\tau)}}}[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)}R(\tau)\bigtriangledown{logp_\theta(\tau)}] ▽R(θ)​=Eτ∼pθ​(τ)​[R(τ)▽logpθ​(τ)]=Eτ∼pθ′(τ)​​[pθ′​(τ)pθ​(τ)​R(τ)▽logpθ​(τ)]

then we can use θ′\theta'θ′ to sample many times and train θ\thetaθ many times. After many iterations, we update θ′\theta'θ′. Continue to transform the equation:

E(st,at)∼πθ[Aθ(st,at)▽logpθ(atn∣stn)]=E(st,at)∼πθ′[Pθ(st,at)Pθ′(st,at)Aθ′(st,at)▽logpθ(atn∣stn)]E_{(s_t, a_t)\sim{\pi_\theta}}[A^{\theta}(s_t, a_t)\bigtriangledown{logp_\theta(a_t^n|s_t^n)}] = E_{(s_t, a_t)\sim{\pi_{\theta'}}}[\frac{P_\theta(s_t, a_t)}{P_{\theta'}(s_t, a_t)}A^{\theta'}(s_t, a_t)\bigtriangledown{logp_\theta(a_t^n|s_t^n)}] E(st​,at​)∼πθ​​[Aθ(st​,at​)▽logpθ​(atn​∣stn​)]=E(st​,at​)∼πθ′​​[Pθ′​(st​,at​)Pθ​(st​,at​)​Aθ′(st​,at​)▽logpθ​(atn​∣stn​)]

Let the Pθ′(st,at)=Pθ′(at∣st)Pθ′(st)P_{\theta'}(s_t, a_t) = P_{\theta'}(a_t|s_t)P_{\theta'}(s_t)Pθ′​(st​,at​)=Pθ′​(at​∣st​)Pθ′​(st​), and Pθ(st,at)=Pθ(at∣st)Pθ(st)P_{\theta}(s_t, a_t) = P_{\theta}(a_t|s_t)P_{\theta}(s_t)Pθ​(st​,at​)=Pθ​(at​∣st​)Pθ​(st​). We consider the environment observation sss is not related to actor θ\thetaθ (ignore the environment changing by action), then Pθ(st)=Pθ′(st)P_{\theta}(s_t) = P_{\theta'}(s_t)Pθ​(st​)=Pθ′​(st​), equation becomes:

E(st,at)∼πθ′[Pθ(at∣st)Pθ′(at∣st)Aθ′(st,at)▽logpθ(atn∣stn)]E_{(s_t, a_t)\sim{\pi_{\theta'}}}[\frac{P_{\theta}(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t, a_t)\bigtriangledown{logp_\theta(a_t^n|s_t^n)}] E(st​,at​)∼πθ′​​[Pθ′​(at​∣st​)Pθ​(at​∣st​)​Aθ′(st​,at​)▽logpθ​(atn​∣stn​)]

Here defines:

Jθ′(θ)=E(st,at)∼πθ′[Pθ(at∣st)Pθ′(at∣st)Aθ′(st,at)]J^{\theta'}(\theta) = E_{(s_t, a_t)\sim{\pi_{\theta'}}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t, a_t)] Jθ′(θ)=E(st​,at​)∼πθ′​​[Pθ′​(at​∣st​)Pθ​(at​∣st​)​Aθ′(st​,at​)]

Note: Since we use θ′\theta'θ′ to sample data for θ\thetaθ, the distribution of θ\thetaθ can’t be very different from θ′\theta'θ′, how to determine the difference between two distribution and end the model training if θ′\theta'θ′ is distinct from θ\thetaθ? Now let’s start to learn PPO Algorithm.

PPO Algorithm —— Proximal Policy Optimization

PPO is the resolution of above question, it can avoid the problem which raised from the difference between θ\thetaθ and θ′\theta'θ′ . The target function shows as below:

JPPOθ′(θ)=Jθ′(θ)−βKL(θ,θ′)J_{PPO}^{\theta'}(\theta) = J^{\theta'}(\theta) - \beta KL(\theta, \theta') JPPOθ′​(θ)=Jθ′(θ)−βKL(θ,θ′)

which the KL(θ,θ′)KL(\theta, \theta')KL(θ,θ′) is the divergence of output action from policy θ\thetaθ and policy θ′\theta'θ′. The algorithm flow is:

Initial Policy parameters θ\thetaθ

In each iteration:

Using θk\theta^kθk to interact with the environment, and collect {st,at{s_t, a_t}st​,at​} to calculate the Aθk(st,at)A^{\theta^k}(s_t, a_t)Aθk(st​,at​)

Update the JPPOθ′(θ)J_{PPO}^{\theta'}(\theta)JPPOθ′​(θ)severaltimes: $ J_{PPO}{\thetak}(\theta) = J{\thetak}(\theta) - \beta KL(\theta, \theta^k)$

If KL(θ,θk)>KLmaxKL(\theta, \theta^k) > KL_{max}KL(θ,θk)>KLmax​, that means KL part takes too big importance of this equation, increase β\betaβ

If KL(θ,θk)<KLminKL(\theta, \theta^k) < KL_{min}KL(θ,θk)<KLmin​, that means KL part takes lower importance of this equation, decrease β\betaβ

Value-based Approach - Learn an Critic

A critic doesn’t choose an action (it’s different from actor), itevaluates the performanceof a given actor. So an actor can be found from a critic.

Q-learning

Q-Learning is a classical value-based method, it evaluates the score of an observation under an actor π\piπ, this function is calledstate value functionVπ(s)V^\pi(s)Vπ(s). The score is calculated as the total reward from current observation to the end of this episode.

How to estimate Vπ(s)V^\pi(s)Vπ(s)?

We know we need to calculate the total reward to express the performance of current actor πθ\pi_\thetaπθ​, but how to get this value?

Monte-Carlo based approach

In the current state SaS_aSa​ (observation), until the end of this episode, the cumulated reward is GaG_aGa​; In the current state SbS_bSb​ (observation), until the end of this episode, the cumulated reward is GbG_bGb​. That means we can estimate the value of an observation sas_asa​ under an actor πθ\pi_ \thetaπθ​, the low value could be explain as two possibilities:

a) the current observation is bad, even if a good actor can not get a high value.

b) the actor has a bad performance.

In many cases, we can’t enumerate all observations to calculate the all rewards GiG_iGi​. The resolution is using a Neural-Network to fit the function from observation to value GGG.

Fit the NN with (Sa,G(a))(S_a, G(a))(Sa​,G(a)), try to minimize the difference between the NN output Vπ(Sa)V_\pi(S_a)Vπ​(Sa​) and Monte-Carlo reward G(a)G(a)G(a).

Temporal-Difference approach

MC approach is worked, but the problem is you must get the total reward in the end of one episode. It may be a very long way to reach the end state in some cases, Temporal-Difference approach could address this problem.

here is a trajectory {...,st,at,rt,st+1,......, s_t, a_t, r_t, s_{t+1}, ......,st​,at​,rt​,st+1​,... }, there should be:

Vπ(st)=Vπ(st+1)+rtV^\pi(s_t) = V^\pi(s_{t+1}) + r_t Vπ(st​)=Vπ(st+1​)+rt​

so we can fit the NN by minimize the difference between Vπ(st)−Vπ(st+1)V^\pi(s_t) - V^\pi(s_{t+1})Vπ(st​)−Vπ(st+1​) and rtr_trt​.

Here is a tip in practice: we are training the same model VπV^\piVπ, so the two outputs Vπ(st)V_\pi(s_t)Vπ​(st​) and Vπ(st+1)V_\pi(s_{t+1})Vπ​(st+1​) are all generate from one parameter group θ\thetaθ. When we update the θ\thetaθ after one iteration, both Vπ(st)V_\pi(s_t)Vπ​(st​) and Vπ(st+1)V_\pi(s_{t+1})Vπ​(st+1​) are changed in next iteration, which makes the model unstable.

The tip is: fix the parameter group θ′\theta'θ′ to generate the Vπ(St+1)V_\pi(S_{t+1})Vπ​(St+1​), and update the θ\thetaθ for Vπ(St)V_\pi(S_t)Vπ​(St​). After N iterations, let the θ′\theta'θ′ equal to θ\thetaθ. Fixed parameter Network (right) is called Target Network.

MC v.s. TD

Monte-Carlo has largervariance. This is caused by the randomness of G(a)G(a)G(a), since G(a)G(a)G(a) is the sum of all reward rtr_trt​, each rtr_trt​ is a random variable, the sum of these variable must have a larger variance. Playing N times of one game with the same policy, the reward set {G(a),G(b),G(c),...G(a), G(b), G(c), ...G(a),G(b),G(c),...} has a large variance.Temporal-Difference also has a problem, which is Vπ(st+1)V^\pi(s_{t+1})Vπ(st+1​) may estimateincorrectly(cause it’s not like Monte-Carlo approach to cumulative the reward until the end of this episode), so even the rtr_trt​ is correct, the Vπ(st)−Vπ(st+1)V^\pi(s_t) - V^\pi(s_{t+1})Vπ(st​)−Vπ(st+1​) may not correct.

In the practice, people prefer to use TD method.

Q-value approach →\rightarrow→ Qπ(s,a)Q^\pi(s, a)Qπ(s,a)

In current state (observation), enumerate all valid actions and calculate the Q-value of each action.

note: In current state we force the actor to take the specific action to calculate the value this action, but random choose actions according to the πθ\pi_\thetaπθ​ actor in next steps until the end of episode.

Use Q-value to learn an actor

We can learn an actor π\piπ with the Q-value function, here is the algorithm flow:

the question is: how to estimate the $\pi'$ is better than $\pi$?

If π′\pi'π′ is better than π\piπ, then:

Vπ′(si)⩾Vπ(si),∀si∈SV^{\pi'}(s_i) \geqslant V^{\pi}(s_i), \qquad \forall s_i \in S Vπ′(si​)⩾Vπ(si​),∀si​∈S

We can use equation below to calculate the π′\pi'π′ from π\piπ:

π′(s)=argmaxaQπ(s,a)\pi'(s) = argmax_aQ^\pi(s, a) π′(s)=argmaxa​Qπ(s,a)

Note: This approach not suitable for continuous action, only fordiscrete action.

But if we always choose the best action according to the QπQ^\piQπ, some other better actions we can never detect. So we infer useExplorationmethod when we choose action to do.

Epsilon Greedy

Set a probability ε\varepsilonε, take max Q-value action or take random action show as below. Typically , ε\varepsilonε decreases as time goes by.

KaTeX parse error: No such environment: align at position 20: … \left\{ \begin{̲a̲l̲i̲g̲n̲}̲ argmaxQ(s, a)&…

Boltzmann Exploration

Since the QπQ^\piQπis an Neural Network, the output of this Network is the probability of each action. Use this probability to decide which action should take, show as below:

P(ai∣s)=exp(Q(s,ai))∑aexp(Q(s,a))P(a_i|s) = \frac{exp(Q(s, a_i))}{\sum_aexp(Q(s, a))} P(ai​∣s)=∑a​exp(Q(s,a))exp(Q(s,ai​))​

Q-value may be negative, so we take exp-function to let them be positive.

Replay Buffer

Replay buffer is a buffer which stores a lot ofexperiencedata. When you train your Q-Network, random choose a batch from buffer to fit it.

An experience is a set which looks like {st,at,rt,st+1s_t, a_t, r_t, s_{t+1}st​,at​,rt​,st+1​}.

The experience in buffer may comes from different policy {πθ1,πθ2,πθ3,...\pi_{\theta_1}, \pi_{\theta_2}, \pi_{\theta_3}, ...πθ1​​,πθ2​​,πθ3​​,...}.

Drop the old experience when buffer is full.

Typical Q-Learning Algorithm

Here is the main algorithm flow of Q-learning:

Initialize Q-function Q, Initialize target Q-function Q^=Q\hat{Q} = QQ^​=Qin each episode for each step t Given state sts_tst​, take an action ata_tat​ based on Q (ε\varepsilonε-greedy exploration)Obtain the reward rtr_trt​ and next state st+1s_{t+1}st+1​Store this experience {st,at,rt,st+1s_t, a_t, r_t, s_{t+1}st​,at​,rt​,st+1​} into the replay bufferSample a batch of experience {(si,ai,ri,si+1),(sj,aj,rj,sj+1),...(s_i, a_i, r_i, s_{i+1}), (s_j, a_j, r_j, s_{j+1}), ...(si​,ai​,ri​,si+1​),(sj​,aj​,rj​,sj+1​),...} from bufferCompute target y=ri+maxaQ^(si+1,a)y = r_i + max_a\hat{Q}(s_{i+1}, a_)y=ri​+maxa​Q^​(si+1​,a)​Update the parameters in QQQ to make Q(si,ai)Q(s_i, a_i)Q(si​,ai​) close to yyy.After N steps set Q^=Q\hat{Q} = QQ^​=Q

Double DQN

Double DQN is designed to solve the problem of DQN. Problem of DQN show as below:

Q-value are always over estimate in DQN training (Orange curve is DQN Neural Network output reward, Blue curve is Double DQN Neural Network output reward; Orange line is the real cumulative reward of DQN, Blue line is the real cumulative reward of Double DQN). Notes that Blue lines are over than Orange lines which means Double DQN has a greater true value than DQN.

Why DQN always over-estimate Q-value?

This because when we calculate the target yyy which equals rt+maxaQπ(st+1,a)r_t + max_aQ_\pi(s_{t+1}, a)rt​+maxa​Qπ​(st+1​,a), we always choose the best action and compute the highest Q-value. This may over-estimate the target value, so the real cumulative reward may lower than that target value. While Q function is try to close the target value, this results the output of Q-Network is higher than the actual cumulative reward.

Q(st,at)⟺rt+maxaQ(st+1,a)Q(s_t, a_t) \qquad \Longleftrightarrow \qquad r_t + max_aQ(s_{t+1}, a) Q(st​,at​)⟺rt​+maxa​Q(st+1​,a)

Double DQN resolution

To avoid above problem, we use two Q-Network in training, one is in charge of choose the best action and the other is to estimate Q-value.

Q(st,at)⟺rt+Q′(st+1,argmaxaQ(st+1,a))Q(s_t, a_t) \qquad \Longleftrightarrow \qquad r_t + Q'(s_{t+1}, argmax_aQ(s_{t+1}, a)) Q(st​,at​)⟺rt​+Q′(st+1​,argmaxa​Q(st+1​,a))

Here use QQQ to select the best action in each state but use Q′Q'Q′ to estimate the Q-value of this action. This method has two advantages:

If QQQ over-estimate the Q-value of action aaa, although this action is selected, the final Q-value of this action won’t be over estimated (because we use Q′Q'Q′ to estimate the Q-value of this action).If Q′Q'Q′ over-estimate one action aaa, it’s also safe. Because the QQQ policy won’t select the action aaa (because aaa is not the best action in Policy QQQ).

In DQN algorithm, we already have two Network:origin Networkθ\thetaθ andtarget Networkθ′\theta'θ′ (need to be fixed). So here useorigin Network$\theta $ to select the action, andtarget Network$\theta’ $ to estimate the Q-value.

Other Advanced Structure of Q-Learning

Dueling DQN

Change the output as two parts: Qπ(st,at)=V(st)+A(st,at)Q^\pi(s_t, a_t) = V(s_t) + A(s_t, a_t)Qπ(st​,at​)=V(st​)+A(st​,at​), which means the final Q-value is the sum of environment value and action value.

Prioritized Replay

When we sample a batch of experience from replay buffer, we don’t use random select. Prioritized Replay marked those experience which has a high loss after one iteration, and increase the probability of selecting those experience in the next batch.

Multi-Step

Change the experience format in the Replay Buffer, not only store one step {$ s_t, a_t, r_t, s_{t+1} $}, store N steps { st,at,rt,st+1,...,st+N,at+N,rt+N,st+N+1s_t, a_t, r_t, s_{t+1}, ..., s_{t+N}, a_{t+N}, r_{t+N}, s_{t+N+1}st​,at​,rt​,st+1​,...,st+N​,at+N​,rt+N​,st+N+1​ }.

Noise Net

This method used to explore more action. Add some noise in current Network QQQ at the beginning of one episode.

Here is the comparison of different algorithms:

A3C Method - Asynchronous Advantage Actor-Critic

Why we need A3C method? This is designed to solve the variance problem of Policy Gradient. In Policy Gradient method, even in the same state $ s_t $ and take the same action $ a_t $ N times, we may get very different result total reward GGG. This because randomness existed when we calculate the cumulative reward in below equation:

▽R(θ)‾=1N∑n=1N∑t=1T(∑t′=tTγt′−trt′n−b)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n} - b)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} ▽R(θ)​=N1​n=1∑N​t=1∑T​(t′=t∑T​γt′−trt′n​−b)▽logP(atn​∣stn​,θ)

This part $ (\sum_{t’=t}T{\gamma{t’ -t}r_{t’}^n} - b) $ could be very different for $ r_t $ is a random variable with big variance, the result may be like this:

unless we sample enough times to cover all possible rewards, the model could be stable —— but it’s hard to do this. If we can replace ∑t′=tTγt′−trt′n\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n}∑t′=tT​γt′−trt′n​ with the expect E(rt′n)E(r_{t'}^n)E(rt′n​), then we can solve this problem.

Advantage Actor-Critic (A2C Method)

We have already introduced value-based method, the definition of Qπθ(st,at)Q^{\pi_\theta}(s_t, a_t)Qπθ​(st​,at​) is the expect of total reward of taking action ata_tat​ at current state sts_tst​. The definition of $ V^{\pi_\theta}(s_t) $ is the expect reward of current state sts_tst​ (just the state value without specific which action should take). Now we change the equation:

▽R(θ)‾=1N∑n=1N∑t=1T(Qπθ(stn,atn)−Vπθ(stn))▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(Q^{\pi_\theta}(s_t^n, a_t^n) - V^{\pi_\theta}(s_t^n))\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} ▽R(θ)​=N1​n=1∑N​t=1∑T​(Qπθ​(stn​,atn​)−Vπθ​(stn​))▽logP(atn​∣stn​,θ)

note: Here we use state value to replace the baseline b.

We can infer the value of Qπθ(st,at)Q^{\pi_\theta}(s_t, a_t)Qπθ​(st​,at​) from Vπθ(st)V^{\pi_\theta}(s_t)Vπθ​(st​) :

Qπ(st,at)=E[rt+Vπ(st+1)]→rt+Vπ(st+1)Q^{\pi}(s_t, a_t) \quad = \quad E[r_t + V^\pi(s_{t+1})] \quad \rightarrow \quad r_t + V^\pi(s_{t+1}) Qπ(st​,at​)=E[rt​+Vπ(st+1​)]→rt​+Vπ(st+1​)

We should use the expect because rtr_trt​ is a random variable, but it’s hard to calculate, so we take off the expect. Now the equation becomes:

▽R(θ)‾=1N∑n=1N∑t=1T(rt+Vπθ(st+1n)−Vπθ(stn))▽logPθ(atn∣stn)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(r_t + V^{\pi_\theta}(s_{t+1}^n) - V^{\pi_\theta}(s_t^n))\bigtriangledown{logP_{\theta}(a_t^n|s_t^n)}} ▽R(θ)​=N1​n=1∑N​t=1∑T​(rt​+Vπθ​(st+1n​)−Vπθ​(stn​))▽logPθ​(atn​∣stn​)

Algorithm flow of Advantage Actor-Critic method show as below:

Tips

There are two Networks to train in this algorithm: Actor $ \pi_\theta $ and Critic Vπ(s)V^\pi(s)Vπ(s). But two networks accept the same input sss, only different in output —— scaler V(s)V(s)V(s) for Critic Network and Probability Distribution P(a∣s)P(a|s)P(a∣s) for Actor Network. So two networks can share some layers in the front of structure, looks like this: Use output entropy as regularization for π(s)\pi(s)π(s), this could make the probability of each action more even so that the model can do more exploration.

Asynchronous Advantage Actor-Critic (A3C Method)

A3C is designed to speed up A2C. It maintain one global Network and create N workers, each worker interact with different environment, calculate the gradient and update the global Network.

Copy global parameters θ1\theta_1θ1​Sampling some dataCompute gradientsUpdate global model

note: All workers are parallelized, which means when ▽θ1\bigtriangledown{\theta_1}▽θ1​ finish compute and send back to global model, θ\thetaθ may changed (updated by other worker, so it may not remain $\theta_1 $). But we still use the ▽θ1\bigtriangledown{\theta_1}▽θ1​ to update the current parameters θ\thetaθ →\rightarrow→ θ+η▽θ1\theta + \eta\bigtriangledown{\theta_1}θ+η▽θ1​.

Sparse Reward

In reinforcement learning, reward is very important for agent to know which actions are good. But only few action could obtain a positive reward (ex. only fire and destroy the enemy could obtain a positive reward in Spaceship game), most of actions have no reward (ex. move left or move right). This phenomenon is called Sparse Reward.

Reward Shaping

Typically, few state could get the positive reward in training, thus we can create some extra rewards to guide the agent do some action in current state. For example, if we want to train an plane agent to destroy the enemy plane, the actual reward should be obtained from “fire and destroy the enemy”. But in the start of the game, our plane don’t know how to find enemy, so we can create an extra positive reward if our plane is fly toward enemy plane.

note: This method needs domain knowledge to design which rules desire positive reward and how much reward should be assigned.

Curriculum Learning

Typically, a hard task could be split into many simple tasks. Curriculum Learning Algorithm is starting from simple training examples, and then becoming harder and harder.

The most common technique isReverse Curriculum Learning, explain as below:

Given a goal state sgs_gsg​.Sample some states {s1,s2,...s_1, s_2, ...s1​,s2​,...} close to $s_g $.Each state has a reward to goal state sgs_gsg​, compute {R(s1),R(s2),...R(s_1), R(s_2), ...R(s1​),R(s2​),...}.Delete those state whose reward is too large (it’s too easy from this state to goal state) or too small (it’s too difficult from this state to goal state).Sample more states near the {$s_1, s_2, … }.

Hierarchical Reinforcement Learning

A entire model could be split into different hierarchies, top-level model only give the top-level order and low-level model choose the actual actions. For example, if we wanna train a plane agent, top-level model only give the way point of next target while the low-level model control the plane to fly to that target (turn left or turn right).

Here is a game example, blue point is the agent which is asked to reach the yellow point. Pink point is the temporary target given by high-level model while the low-level mode follow this instruction and control the agent to reach pink point.

ICM —— Intrinsic Curiosity Module

ICM Algorithm can let model to do more exploration. It adds an extra Reward function rtir^i_trti​ which accept three parameters (st,at,st+1)(s_t, a_t, s_{t+1})(st​,at​,st+1​). The Network need to maximize the sum value ∑t=1N(rti+rt)\sum_{t=1}^N(r^i_t + r_t)∑t=1N​(rti​+rt​).

Now let’s see how ICM calculate reward rtir^i_trti​ in each step ttt :

Here are two Networks in ICM module:

Network 1: This Network is used to predict the next state st+1s_{t+1}st+1​ after taking action aaa, if st+1s_{t+1}st+1​ is hard to predict, then the reward rir^iri is work 2: There may be a lot of features are not related to do actions (ex. the position of sun in the Spaceship War game). If we just maximize the reward of state which hard to predict, then the agent will stay and watch the sun moving. So we need to find the useful features in action chosen, this is the work of Network 2. It predict the action $a_t $ according to sts_tst​ and st+1s_{t+1}st+1​.

如果觉得《李宏毅Reinforcement Learning强化学习入门笔记》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。