Reinforcement Learning with Code 【Chapter 10. Actor Critic】

Reinforcement Learning with Code 【Chapter 10. Actor Critic】

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning.
This code refers to Mofan’s reinforcement learning course.

文章目录

  • Reinforcement Learning with Code 【Chapter 10. Actor Critic】
      • 10.1 The simplest actor-critic algorithm (QAC)
      • 10.2 Advantage Actor-Critic (A2C)
      • 10.3 Off-policy Actor-Critic
    • Reference

10.1 The simplest actor-critic algorithm (QAC)

​ Recall the idea of policy gradient method is to search for an optimal policy by maximizing a scalar metric J ( θ ) J(\theta) J(θ). The metric has three options, average state value E [ v π ( S ) ] \mathbb{E}[v_\pi(S)] E[vπ(S)], average one step reward E [ r π ( S ) ] \mathbb{E}[r_\pi(S)] E[rπ(S)] or average state value from a specific state s 0 s_0 s0

​ According to policy gradient theorem in chapter 9, we are informed that

θ t + 1 = θ t + α ∇ θ J ( θ t ) = θ t + α E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) q π ( S , A ) ] \begin{aligned} \theta_{t+1} & = \theta_t + \alpha \nabla_\theta J(\theta_t) \\ & = \theta_t + \alpha \mathbb{E}_{S\sim\eta, A\sim\pi} [\nabla_\theta \ln \pi(A|S,\theta_t)q_\pi(S,A)] \end{aligned} θt+1=θt+αθJ(θt)=θt+αESη,Aπ[θlnπ(AS,θt)qπ(S,A)]

where η \eta η is a distribution of the states. Since the true gradient is unknwon, we can use a stochastic gradient ot approximated it, hence we have

θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) q t ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln\pi(a_t|s_t,\theta_t)q_t(s_t,a_t) θt+1=θt+αθlnπ(atst,θt)qt(st,at)

  • In policy gradient method such as REINFORCE, we use the idea of Monte-Carlo to approximate the true value q t ( s t , a t ) q_t(s_t,a_t) qt(st,at). Where q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) is approximated by an episode return that is q t ( s t , a t ) = ∑ k = t + 1 T γ k − t − 1 r q_t(s_t,a_t)=\sum_{k=t+1}^T \gamma^{k-t-1}r qt(st,at)=k=t+1Tγkt1r.
  • If q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) is estimated by value function approximationg, and the value function is updated using the idea of TD learning. The corresponding algorithms are usually called actor-critic. Therefore, actor-critic methods can be seen as one of the policy gradient method.

​ When we use the parameterized value function q ( s , a ; w ) q(s,a;w) q(s,a;w) to approximate the q t ( s t , a t ) q_t(s_t,a_t) qt(st,at), and the value function is updated by the idea of Sarsa of TD learning, the algorithm is called Q actor-critic (QAC). The core idea of QAC is that

QAC: { Actor : θ t + 1 = θ t + α θ ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) q ( s t , a t ; w t ) Critic : w t + 1 = w t + α w [ r t + 1 + γ q ( s t + 1 , a t + 1 ; w t ) − q ( s t , a t ; w t ) ] ∇ w q ( s t , a t ; w t ) \text{QAC:} \left\{ \begin{aligned} \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \nabla_\theta \ln\pi(a_t|s_t;\theta) q(s_t,a_t;w_t) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w [r_{t+1}+\gamma q(s_{t+1},a_{t+1};w_t) - q(s_t,a_t;w_t)]\nabla_w q(s_t,a_t;w_t) \end{aligned} \right. QAC:{Actor:θt+1Critic:wt+1=θt+αθθlnπ(atst;θ)q(st,at;wt)=wt+αw[rt+1+γq(st+1,at+1;wt)q(st,at;wt)]wq(st,at;wt)

We use value function approximation to approximate true q value q t ( s t , a t ) q_t(s_t,a_t) qt(st,at). Meanwhile we use the idea of Sarsa to update our value function.

​ We can write the objective function of the update rule

QAC: { Actor : max ⁡ θ J ( θ ) = E S ∼ η , A ∼ π [ ln ⁡ π ( A ∣ S ; θ t ) q ( S , A ; w t ) ] Critic : min ⁡ w J ( w ) = E S ∼ η , A ∼ π [ ( R + γ q ( S ′ , A ; w t ) − q ( S t , A ; w t ) ) 2 ] \text{QAC:} \left \{ \begin{aligned} \textcolor{red}{\text{Actor}: \max_\theta J(\theta)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[\ln\pi(A|S;\theta_t)q(S,A;w_t)]} \\ \textcolor{red}{\text{Critic}: \min_w J(w)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[(R + \gamma q(S^\prime,A;w_t) - q(S_t,A;w_t))^2]} \end{aligned} \right. QAC: Actor:θmaxJ(θ)Critic:wminJ(w)=ESη,Aπ[lnπ(AS;θt)q(S,A;wt)]=ESη,Aπ[(R+γq(S,A;wt)q(St,A;wt))2]

Pesudocode

Image

10.2 Advantage Actor-Critic (A2C)

​ The core idea of A2C is to introduce a baseline to reduce estimation variance. That is
E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) q π ( S , A ) ] = E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) ( q π ( S , A ) − b ( S ) ) ] \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)q_\pi(S,A)] = \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)(q_\pi(S,A)-b(S))] ESη,Aπ[θlnπ(AS;θt)qπ(S,A)]=ESη,Aπ[θlnπ(AS;θt)(qπ(S,A)b(S))]
where the additional baseline b ( S ) b(S) b(S) is a scalar function of S S S. Add a baseline doesn’t affect the expectation of the above equation that is
E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) b ( S ) ] = 0 = ∑ s ∈ S η ( s ) ∑ a ∈ A π ( a ∣ s ; θ t ) ∇ θ ln ⁡ π ( a ∣ s ; θ t ) b ( s ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s ; θ t ) b ( s ) = ∑ s ∈ S η ( s ) b ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s ; θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ π ∑ a ∈ A ( a ∣ s ; θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ 1 = 0 \begin{aligned} \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)b(S)] & = 0 \\ & = \sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\pi(a|s;\theta_t) \nabla_\theta \ln \pi(a|s;\theta_t)b(s) \\ & = \sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s;\theta_t)b(s) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s;\theta_t) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\nabla_\theta\pi\sum_{a\in\mathcal{A}}(a|s;\theta_t) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\nabla_\theta1 = 0 \end{aligned} ESη,Aπ[θlnπ(AS;θt)b(S)]=0=sSη(s)aAπ(as;θt)θlnπ(as;θt)b(s)=sSη(s)aAθπ(as;θt)b(s)=sSη(s)b(s)aAθπ(as;θt)=sSη(s)b(s)θπaA(as;θt)=sSη(s)b(s)θ1=0

How to find the optimal baseline? The derivation is omitted. The optimal baseline is

b ∗ ( s ) = E A ∼ π [ ∣ ∣ ∇ θ ln ⁡ π ( A ∣ s ; θ t ) ∣ ∣ 2 q π ( s , A ) ] E A ∼ π [ ∣ ∣ ∇ θ ln ⁡ π ( A ∣ s ; θ t ) ∣ ∣ 2 ] b^*(s) = \frac{\mathbb{E}_{A\sim\pi}[||\nabla_\theta \ln\pi(A|s;\theta_t)||^2 q_\pi(s,A)]}{\mathbb{E}_{A\sim\pi}[||\nabla_\theta \ln\pi(A|s;\theta_t)||^2]} b(s)=EAπ[∣∣θlnπ(As;θt)2]EAπ[∣∣θlnπ(As;θt)2qπ(s,A)]

But its too complex to use in practice. If the weight ∣ ∣ ∇ θ ln ⁡ π ( A ∣ s ; θ t ) ∣ ∣ 2 ||\nabla_\theta \ln\pi(A|s;\theta_t)||^2 ∣∣θlnπ(As;θt)2 is removed, we can obtain a suboptimal baseline that has a concise expression:

b † ( s ) = E [ q π ( s , A ) ] = v π ( s ) \textcolor{red}{b^\dagger (s) = \mathbb{E}[q_\pi(s,A)] = v_\pi(s)} b(s)=E[qπ(s,A)]=vπ(s)

The suboptimal baseline is the state value of state s s s.

When b ( s ) = v π ( s ) b(s)=v_\pi(s) b(s)=vπ(s), the gradient-ascent becomes
θ t + 1 = θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) [ q π ( S , A ) − v π ( S ) ] ] = θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) δ π ( S , A ) ] \begin{aligned} \theta_{t+1} & = \theta_t + \alpha\mathbb{E}[\nabla_\theta \ln\pi(A|S;\theta_t)[q_\pi(S,A)-v_\pi(S)]] \\ & = \theta_t + \alpha\mathbb{E}[\nabla_\theta\ln\pi(A|S;\theta_t) \delta_\pi(S,A)] \end{aligned} θt+1=θt+αE[θlnπ(AS;θt)[qπ(S,A)vπ(S)]]=θt+αE[θlnπ(AS;θt)δπ(S,A)]
Here,
δ π ( S , A ) = q π ( S , A ) − v π ( S ) \textcolor{red}{\delta_\pi(S,A) = q_\pi(S,A) - v_\pi(S)} δπ(S,A)=qπ(S,A)vπ(S)
is called advantage function, which reflects the advantage of one action over the others. More specifically, note that v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) v_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a|s)q_\pi(s,a) vπ(s)=aAπ(as)qπ(s,a) is the mean of the action value. If δ π ( s , a ) > 0 \delta_\pi(s,a)>0 δπ(s,a)>0, it means that the corresponding action has a greater value than the mean value.

The stochastic version is
θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t ; θ t ) [ q t ( s t , a t ) − v t ( s t ) ] = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t ; θ t ) δ t ( s t , a t ) \begin{aligned} \theta_{t+1} & = \theta_t + \alpha\nabla_\theta \ln\pi(a_t|s_t;\theta_t)[q_t(s_t,a_t)-v_t(s_t)] \\ & = \theta_t + \alpha\nabla_\theta\ln\pi(a_t|s_t;\theta_t) \delta_t(s_t,a_t) \end{aligned} θt+1=θt+αθlnπ(atst;θt)[qt(st,at)vt(st)]=θt+αθlnπ(atst;θt)δt(st,at)
we need to estimate the true q-value q t ( s t , a t ) q_t(s_t,a_t) qt(st,at). There are many ways:

  • If q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) and v t ( s t ) v_t(s_t) vt(st) are estimated by Monte-Carlo learning, the algorithm is called REINFORCE with a baseline.
  • If q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) and v t ( s t ) v_t(s_t) vt(st) are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C).

q t ( s t , a t ) − v t ( s t ) = r t + 1 + γ q t ( s t + 1 , a t + 1 ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) \begin{aligned} q_t(s_t,a_t) - v_t(s_t) & = r_{t+1} +\gamma q_t(s_{t+1},a_{t+1}) - v_t(s_t) \\ & \textcolor{red}{\approx r_{t+1} +\gamma v_t(s_{t+1}) - v_t(s_t)} \end{aligned} qt(st,at)vt(st)=rt+1+γqt(st+1,at+1)vt(st)rt+1+γvt(st+1)vt(st)

Hence, we don’t need to maintain two networks to represent v π ( s ) v_\pi(s) vπ(s) and q π ( s , a ) q_\pi(s,a) qπ(s,a). We just need one network to represent v π ( s ) v_\pi(s) vπ(s).

In A2C we use one policy network π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ) and one state value network v ( s ; w ) v(s;w) v(s;w). The core idea of A2C is that

A2C : { Advantage : δ t = r t + 1 + γ v ( s t + 1 ; w t ) − v ( s t ; w t ) Actor : θ t + 1 = θ t + α θ δ t ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) Critic : w t + 1 = w t + α w δ t ∇ w v ( s t , ; w t ) \text{A2C}: \left \{ \begin{aligned} \text{Advantage}: \delta_t & = r_{t+1} + \gamma v(s_{t+1};w_t) - v(s_t;w_t) \\ \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \textcolor{blue}{\delta_t}\nabla_\theta \ln\pi(a_t|s_t;\theta) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w \textcolor{blue}{\delta_t} \nabla_w v(s_t,;w_t) \end{aligned} \right. A2C: Advantage:δtActor:θt+1Critic:wt+1=rt+1+γv(st+1;wt)v(st;wt)=θt+αθδtθlnπ(atst;θ)=wt+αwδtwv(st,;wt)

We can write the objective function of the update rule

A2C : { Advantage: Δ ( S ) = R + γ v ( S ′ ; w t ) − v ( S ; w t ) Actor : max ⁡ θ J ( θ ) = E S ∼ η , A ∼ π [ ln ⁡ π ( A ∣ S ; θ ) Δ ( S ) ] Critic : min ⁡ w J ( w ) = E S ∼ η [ ( R + γ v ( S ′ ; w ) − v ( S ; w ) ) 2 ] = E S ∼ η [ Δ ( S ) ] \text{A2C}: \left\{ \begin{aligned} \text{Advantage:} \Delta(S) & = R+\gamma v(S^\prime;w_t) - v(S;w_t) \\ \textcolor{red}{\text{Actor}: \max_\theta J(\theta)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[\ln\pi(A|S;\theta)\Delta(S)]} \\ \textcolor{red}{\text{Critic}: \min_w J(w)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta}[(R + \gamma v(S^\prime;w) - v(S;w))^2]} = \mathbb{E}_{S\sim_\eta}[\Delta(S)] \end{aligned} \right. A2C: Advantage:Δ(S)Actor:θmaxJ(θ)Critic:wminJ(w)=R+γv(S;wt)v(S;wt)=ESη,Aπ[lnπ(AS;θ)Δ(S)]=ESη[(R+γv(S;w)v(S;w))2]=ESη[Δ(S)]

Pesudocode

Image

10.3 Off-policy Actor-Critic

Importance Sampling

​ The key technique to convert the AC algorithm to off-policy is importance sampling. Consider a random variable X ∈ X X\in\mathcal{X} XX. Suppose that p 0 ( X ) p_0(X) p0(X) is a probability distribution. Our goal is to estimate E X ∼ p 0 [ X ] \mathbb{E}_{X\sim p_0}[X] EXp0[X]. We also known the p 1 ( X ) p_1(X) p1(X) is a probability distribution of X X X. How can we use the probability p 1 ( X ) p_1(X) p1(X) to sample data to estimate E X ∼ p 0 [ X ] \mathbb{E}_{X\sim p_0}[X] EXp0[X]. The technique is importance sampling. Suppose we have some i.i.d. samples { x i } i = 1 n \{x_i\}^n_{i=1} {xi}i=1n generated by distribution p 1 ( X ) p_1(X) p1(X).
E X ∼ p 0 [ X ] = ∑ x ∈ X p 0 ( x ) x = ∑ x ∈ X p 1 ( x ) p 0 ( x ) p 1 ( x ) x ⏟ f ( x ) = E X ∼ p 1 [ f ( X ) ] E X ∼ p 0 [ X ] = E X ∼ p 1 [ f ( X ) ] ≈ f ˉ = 1 n ∑ i = 1 n f ( x i ) = 1 n ∑ i = 1 n p 0 ( x i ) p 1 ( x i ) ⏟ importance weight x i \mathbb{E}_{X\sim p_0}[X] = \sum_{x\in\mathcal{X}}p_0(x)x = \sum_{x\in\mathcal{X}}p_1(x)\underbrace{\frac{p_0(x)}{p_1(x)}x}_{f(x)} = \mathbb{E}_{X\sim p_1}[f(X)] \\ \mathbb{E}_{X\sim p_0}[X] = \mathbb{E}_{X\sim p_1}[f(X)] \approx \bar{f} = \frac{1}{n} \sum^n_{i=1}f(x_i) = \frac{1}{n} \sum^n_{i=1} \underbrace{\frac{p_0(x_i)}{p_1(x_i)}}_{\text{importance weight}}x_i EXp0[X]=xXp0(x)x=xXp1(x)f(x) p1(x)p0(x)x=EXp1[f(X)]EXp0[X]=EXp1[f(X)]fˉ=n1i=1nf(xi)=n1i=1nimportance weight p1(xi)p0(xi)xi
An Example

​ Consider X ∈ X = + 1 , − 1 X\in\mathcal{X}={+1,-1} XX=+1,1 Suppose the p 0 p_0 p0 is a probability distribution satisfying
p 0 ( X = + 1 ) = 0.5 , p 0 ( X = − 1 ) = 0.5 p_0(X=+1)=0.5, p_0(X=-1)=0.5 p0(X=+1)=0.5,p0(X=1)=0.5
The expectaton of X X X over p 0 p_0 p0 is
E X ∼ p 0 [ X ] = ( + 1 ) × 0.5 + ( − 1 ) × 0.5 = 0 \mathbb{E}_{X\sim p_0}[X] = (+1)\times 0.5 + (-1) \times 0.5 = 0 EXp0[X]=(+1)×0.5+(1)×0.5=0
Suppose p 1 p_1 p1 is a probability distribution satisfying
p 0 ( X = + 1 ) = 0.8 , p 0 ( X = − 1 ) = 0.2 p_0(X=+1)=0.8, p_0(X=-1)=0.2 p0(X=+1)=0.8,p0(X=1)=0.2
The expectaton of X X X over p 0 p_0 p0 is
E X ∼ p 1 [ X ] = ( + 1 ) × 0.8 + ( − 1 ) × 0.2 = 0.6 \mathbb{E}_{X\sim p_1}[X] = (+1)\times 0.8 + (-1) \times 0.2 = 0.6 EXp1[X]=(+1)×0.8+(1)×0.2=0.6
We can use the importance samping techique to sample data under distribution p 1 p_1 p1 to compute E X ∼ p 0 [ X ] \mathbb{E}_{X\sim p_0}[X] EXp0[X]
E X ∼ p 0 [ X ] = 1 n ∑ i = 1 n p 0 ( x i ) p 1 ( x i ) x i \mathbb{E}_{X\sim p_0}[X] = \frac{1}{n}\sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)}x_i EXp0[X]=n1i=1np1(xi)p0(xi)xi

import numpy as np
import matplotlib.pyplot as plt
# reproducible
np.random.seed(0)# 定义元素和对应的概率
elements = [1, -1]
probs1 = [0.5, 0.5]
probs2 = [0.8, 0.2]# 重要性采样 importance sample
sample_times = 300
sample_list = []
i_sample_list = []
average_list = []
importance_list = []
for i in range(sample_times):sample = np.random.choice(elements, p=probs2)sample_list.append(sample)average_list.append(np.mean(sample_list))if sample == elements[0]:i_sample_list.append(probs1[0] / probs2[0] * sample)elif sample == elements[1]:i_sample_list.append(probs1[1] / probs2[1] * sample)importance_list.append(np.mean(i_sample_list))plt.plot(range(len(sample_list)), sample_list, 'o', markerfacecolor='none', label='sample data')
plt.plot(range(len(average_list)), average_list, 'b--', label='average')
plt.plot(range(len(importance_list)), importance_list, 'g--', label='importance sampling')
plt.axhline(y=0.6, color='r', linestyle='--')
plt.axhline(y=0, color='r', linestyle='--')
plt.ylim(-1.5, 2.5) # 限制y轴显示范围
plt.xlim(0,sample_times) # 限制x轴显示范围
plt.legend(loc='upper right')
plt.show()
Image

Off-policy policy gradient theorem

​ With importance sampling, we are ready to present the off-policy gradient theorem. Suppose that the β \beta β is a behavior policy. Our goal is to use the samples generated by behavoir policy β \beta β to learn a target policy π \pi π that can maximize the following metric
max ⁡ θ J ( θ ) = E S ∼ d β [ v π ( S ) ] \max_\theta J(\theta) = \mathbb{E}_{S\sim d_\beta}[v_\pi(S)] θmaxJ(θ)=ESdβ[vπ(S)]
Theorem 10.1 (Stochastic off-policy policy gradient theorem). In the discounted case where γ ∈ ( 0 , 1 ) \gamma\in(0,1) γ(0,1), the gradient of J ( θ ) J(\theta) J(θ) is
∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S ; θ ) β ( A ∣ S ) ⏟ importance weight ∇ θ ln ⁡ π ( A ∣ S ; θ ) q π ( S , A ) ] \textcolor{red}{\nabla_\theta J(\theta) = \mathbb{E}_{S\sim\rho, A\sim\beta}\Big[\underbrace{\frac{\pi(A|S;\theta)}{\beta(A|S)}}_{\text{importance weight}} \nabla_\theta \ln \pi(A|S;\theta) q_\pi(S,A) \Big]} θJ(θ)=ESρ,Aβ[importance weight β(AS)π(AS;θ)θlnπ(AS;θ)qπ(S,A)]
where the state distribution ρ \rho ρ is
ρ ( s ) ≜ ∑ s ′ ∈ S d β ( s ′ ) Pr ⁡ π ( s ∣ s ′ ) \rho(s) \triangleq \sum_{s^\prime\in\mathcal{S}} d_\beta(s^\prime) \Pr_\pi(s|s^\prime) ρ(s)sSdβ(s)πPr(ss)
where Pr ⁡ π ( s ∣ s ′ ) = ∑ k = 0 ∞ γ k [ P π k ] s ′ , s = [ ( I − γ P π ) − 1 ] s ′ , s \Pr_\pi(s|s^\prime)=\sum_{k=0}^\infty \gamma^k[P^k_\pi]_{s^\prime,s}=[(I-\gamma P_\pi)^{-1}]_{s^\prime,s} Prπ(ss)=k=0γk[Pπk]s,s=[(IγPπ)1]s,s is the discounted total probability of transitioning from s ′ s^\prime s to s s s under policy π \pi π.

​ The off-policy policy gradient is invariant to any additional baseline b ( s ) b(s) b(s). In particular, we have
∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S ; θ ) β ( A ∣ S ) ∇ θ ln ⁡ π ( A ∣ S ; θ ) ( q π ( S , A ) − b ( S ) ) ] \nabla_\theta J(\theta) = \mathbb{E}_{S\sim\rho, A\sim\beta}\Big[\frac{\pi(A|S;\theta)}{\beta(A|S)} \nabla_\theta \ln \pi(A|S;\theta) \big( q_\pi(S,A) - b(S) \big) \Big] θJ(θ)=ESρ,Aβ[β(AS)π(AS;θ)θlnπ(AS;θ)(qπ(S,A)b(S))]
when we take the state value as the baseline v π ( S ) = b ( S ) v_\pi(S)=b(S) vπ(S)=b(S), there comes the advantage function.
δ π ( S , A ) = q π ( S , A ) − v π ( S ) \delta_\pi(S,A) = q_\pi(S,A) - v_\pi(S) δπ(S,A)=qπ(S,A)vπ(S)
The corresponding stochastic gradient-ascent algorithm is
θ t + 1 = θ t + α θ π ( a t ∣ s t ; θ ) β ( a t ∣ s t ) ∇ θ ln ⁡ π ( a t ∣ s t ; θ t ) ( q t ( s t , a t ) − v t ( s t ) ) \theta_{t+1} = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)} \nabla_\theta \ln\pi(a_t|s_t;\theta_t)(q_t(s_t,a_t)-v_t(s_t)) θt+1=θt+αθβ(atst)π(atst;θ)θlnπ(atst;θt)(qt(st,at)vt(st))
The advantage function can be replaced by the TD error. That is
q t ( s t , a t ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) ≜ δ t ( s t , a t ) q_t(s_t,a_t)-v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \triangleq \delta_t(s_t,a_t) qt(st,at)vt(st)rt+1+γvt(st+1)vt(st)δt(st,at)
In off-policy A2C we use the behavior policy β \beta β to obtain samples, to learn a policy network π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ), and one value network v ( s ; w ) v(s;w) v(s;w). The core idea of off-policy A2C is

off-policy A2C : { Behavior policy: S ∼ β Advantage : δ t = r t + 1 + γ v ( s t + 1 ; w t ) − v ( s t ; w t ) Actor : θ t + 1 = θ t + α θ π ( a t ∣ s t ; θ ) β ( a t ∣ s t ) δ t ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) Critic : w t + 1 = w t + α w π ( a t ∣ s t ; θ ) β ( a t ∣ s t ) δ t ∇ w v ( s t , ; w t ) \text{off-policy A2C}: \left \{ \begin{aligned} \text{Behavior policy:} S & \sim\beta \\ \text{Advantage}: \delta_t & = r_{t+1} + \gamma v(s_{t+1};w_t) - v(s_t;w_t) \\ \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)} \delta_t\nabla_\theta \ln\pi(a_t|s_t;\theta) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)}\delta_t \nabla_w v(s_t,;w_t) \end{aligned} \right. off-policy A2C: Behavior policy:SAdvantage:δtActor:θt+1Critic:wt+1β=rt+1+γv(st+1;wt)v(st;wt)=θt+αθβ(atst)π(atst;θ)δtθlnπ(atst;θ)=wt+αwβ(atst)π(atst;θ)δtwv(st,;wt)

We rewrite the objective function of the update rule

off-policy A2C : { Behavior policy: S ∼ β Advantage : Δ ( S ) = R + γ v ( S ′ ; w ) − v ( S ; w ) Actor : max ⁡ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S ; θ ) β ( A ∣ S ) Δ ( S ) ln ⁡ π ( A ∣ S ; θ ) ] Critic : min ⁡ w J ( w ) = E S ∼ ρ [ ( R + γ v ( S ′ ; w t ) − v ( S t ; w t ) ) 2 ] = E S ∼ η [ Δ ( S ) ] \text{off-policy A2C}: \left \{ \begin{aligned} \text{Behavior policy:} S & \sim\beta \\ \text{Advantage}: \Delta(S) & = R + \gamma v(S^\prime;w) - v(S;w) \\ \text{Actor}: \max_\theta J(\theta) & = \mathbb{E}_{S\sim\rho,A\sim\beta}[\frac{\pi(A|S;\theta)}{\beta(A|S)}\Delta(S) \ln\pi(A|S;\theta)] \\ \text{Critic}: \min_w J(w) & = \mathbb{E}_{S\sim\rho}[(R + \gamma v(S^\prime;w_t) - v(S_t;w_t))^2] = \mathbb{E}_{S\sim_\eta}[\Delta(S)] \end{aligned} \right. off-policy A2C: Behavior policy:SAdvantage:Δ(S)Actor:θmaxJ(θ)Critic:wminJ(w)β=R+γv(S;w)v(S;w)=ESρ,Aβ[β(AS)π(AS;θ)Δ(S)lnπ(AS;θ)]=ESρ[(R+γv(S;wt)v(St;wt))2]=ESη[Δ(S)]

Pesudocode

Image

Reference

赵世钰老师的课程

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/86826.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

iview 日期 datetimerange

问题&#xff1a;每次点击编辑按钮进入到编辑页面&#xff0c;活动时间明明有值&#xff0c;却还是提示请选择活动时间。 原因&#xff1a;值没绑定上 解决办法&#xff1a;v-model 修改为 :value <Form-item label"活动时间" prop"timeRange"><d…

VUE+ElementUI的表单验证二选一必填项,并且满足条件后清除表单验证提示

上代码 <el-form-item label"出库单号" prop"ecode" ref"ecode" :rules"rules.ecode"><el-input v-model"queryParams.ecode" placeholder"出库单号和出库箱号至少填写一项" clearable style"width…

Spring Cloud 的版本和SpringBoot的版本

Spring Cloud 的版本选择 Spring Cloud 和SpringBoot的版本存在对应关系 Spring Cloud 的版本和SpringBoot的版本&#xff0c;存在对应关系。最新的SpringCloud版本&#xff08;发布文章时为2022.0.3&#xff09;&#xff0c;需要SpringBoot&#xff08;3.0.9&#xff09; 的…

vscode关闭绑定元素“xxx”隐式具有“any”类型这类错误

在ts的项目里面&#xff0c;真的经常看到any类型的报错&#xff0c;真的很烦的 所以为了眼不见心不乱&#xff0c;我决定消除这个错误提示 在tsconfig.json里面配置 "noImplicitAny": false 就可以了 {"compilerOptions": {"target": "E…

Mac超好用软件推荐

没有广告&#xff0c;良心推荐哦 刷到有福啦 非常非常感谢一路支持的大佬&#xff0c;你们的支持是我的荣幸 目录 Keka Free Download Manager Noizio Lite Microsoft 365 ​编辑 LocalSand Hidden Bar Obsidian iWork VMware Fusion SwitchHosts Xmind Listen…

Linux命令200例:ls用于列出指定目录下的文件和子目录

&#x1f3c6;作者简介&#xff0c;黑夜开发者&#xff0c;全栈领域新星创作者✌。CSDN专家博主&#xff0c;阿里云社区专家博主&#xff0c;2023年6月csdn上海赛道top4。 &#x1f3c6;数年电商行业从业经验&#xff0c;历任核心研发工程师&#xff0c;项目技术负责人。 &…

解决Vue+Element UI使用表单rules国际化时From表单验证信息不能实时更新

说明&#xff1a;该篇博客是博主一字一码编写的&#xff0c;实属不易&#xff0c;请尊重原创&#xff0c;谢谢大家&#xff01; 博主在工作之余开始进行自动化测试平台的开发&#xff0c;虽然已经996一个月了但是还是在使劲挤时间做这件事情&#xff0c;目前平台使用前端框架vu…

STM32F429IGT6使用CubeMX配置IIC通信(AT2402芯片)

1、硬件电路 写地址&#xff1a;0xA0 读地址&#xff1a;0xA1 存储容量&#xff1a;256Byte 2、设置RCC&#xff0c;选择高速外部时钟HSE,时钟设置为180MHz 3、配置IIC 4、生成工程配置 5、部分代码 #define IIC_WRITE_ADDR 0xA0 // IIC写地址 #define IIC_READ_ADDR 0xA1 …

推荐系统工作小结

最初的构想 由于我们的技术团队中并没有人真正用大数据的方法做过推荐系统。所以我们定的步骤是先解决有没有的问题。然后再持续地进行效果优化的工作。 现状 但一方面考虑到要快速上线。另一方面也希望对推荐系统的效果有一个合理的参照。我们打算先使用达观数据的推荐系统云…

爬虫015_python异常_页面结构介绍_爬虫概念介绍---python工作笔记034

来看python中的异常 可以看到不做异常处理因为没有这个文件所以报错了 来看一下异常的写法

【css】渐变

渐变是设置一种颜色或者多种颜色之间的过度变化。 两种渐变类型&#xff1a; 线性渐变&#xff08;向下/向上/向左/向右/对角线&#xff09; 径向渐变&#xff08;由其中心定义&#xff09; 1、线性渐变 语法&#xff1a;background-image: linear-gradient(direction, co…

原子css 和 组件化css如何搭配使用

如果让你来实现下面这种页面&#xff0c;该怎么实现呢 原子化和css组件化方式写法&#xff0c;可以搭配起来使用&#xff0c;常用的css 原子css 比如 下面这些类似flex 布局&#xff0c;lstn curser-pointer 等常用的或者 具备一定规律性的padding margin 样式可以抽取为单独…

阿里云服务器搭建Magento电子商务网站图文教程

本文阿里云百科分享使用阿里云服务器手动搭建Magento电子商务网站全流程&#xff0c;Magento是一款开源电商网站框架&#xff0c;其丰富的模块化架构体系及拓展功能可为大中型站点提供解决方案。Magento使用PHP开发&#xff0c;支持版本范围从PHP 5.6到PHP 7.1&#xff0c;并使…

Selenium图片滑块验证码

因为种种原因没能实现愿景的目标&#xff0c;在这里记录一下中间结果&#xff0c;也算是一个收场吧。这篇文章主要是用selenium解决滑块验证码的个别案列。 思路&#xff1a; 用selenium打开浏览器指定网站 将残缺块图片和背景图片下载到本地 对比两张图片的相似地方&#…

【果树农药喷洒机器人】Part6:基于深度相机与分割掩膜的果树冠层体积探测方法

&#x1f4e2;&#xff1a;如果你也对机器人、人工智能感兴趣&#xff0c;看来我们志同道合✨ &#x1f4e2;&#xff1a;不妨浏览一下我的博客主页【https://blog.csdn.net/weixin_51244852】 &#x1f4e2;&#xff1a;文章若有幸对你有帮助&#xff0c;可点赞 &#x1f44d;…

JVM垃圾回收

如何确定垃圾 对堆垃圾回收前的第一步就是要判断哪些对象已经死亡&#xff08;即不能再被任何途径使用的对象&#xff09; 引用计数法 这个方法就是为对象添加计数器来标识引用个数&#xff0c;计数器为 0 的对象就是不可能再被使用的。但是这种方法存在循环引用问题&#x…

IntelliJ IDEA快捷键大全

文章目录 1、构建/编译2、文本编辑3、光标操作4、文本选择5、代码折叠6、辅助编码7、上下文导航8、查找操作9、符号导航10、代码分析11、运行和调试12、代码重构13、全局 CVS 操作14、差异查看器15、工具窗口 本文参考了 IntelliJ IDEA 的官网&#xff0c;列举了IntelliJ IDEA&…

提速Rust编译器!

Nethercote是一位研究Rust编译器的软件工程师。最近&#xff0c;他正在探索如何提升Rust编译器的性能&#xff0c;在他的博客文章中介绍了Rust编译器是如何将代码分割成代码生成单元&#xff08;CGU&#xff09;的以及rustc的性能加速。 他解释了不同数量和大小的CGU之间的权衡…

git命令使用

君子拙于不知己,而信于知己。——司马迁 清屏&#xff1a;clear 查看当前面板的路径&#xff1a;pwd 查看当前面板的文件&#xff1a;ls 创建文件夹&#xff1a;mkdir 文件夹名 创建文件&#xff1a;touch 文件名 删除文件夹&#xff1a;rm -rf 文件夹名 删除文件&#xff1a;r…

ChatGPT: 人机交互的未来

ChatGPT: 人机交互的未来 ChatGPT背景ChatGPT的特点ChatGPT的应用场景结论 ChatGPT ChatGPT是一种基于大数据和机器学习的人工智能聊天机器人模型。它由国内团队发明、开发&#xff0c;并被命名为Mental AI。ChatGPT的目标是通过模拟自然对话的方式&#xff0c;提供高效、智能…