强化学习:基础开发

基本就是把看到有用的资料整合在一起了

资料

https://blog.csdn.net/weixin_48878618/article/details/133590646

https://blog.csdn.net/weixin_42769131/article/details/104783188?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522166792845916800182132771%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fall.%2522%257D&request_id=166792845916800182132771&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2allfirst_rank_ecpm_v1~rank_v31_ecpm-1-104783188-null-null.142%5Ev63%5Ewechat,201%5Ev3%5Eadd_ask,213%5Ev2%5Et3_esquery_v3&utm_term=gym%E5%A4%9A%E5%B1%82%E6%84%9F%E7%9F%A5%E6%9C%BA&spm=1018.2226.3001.4187

https://blog.csdn.net/Scc_hy/article/details/128297350

一、gym自定义环境

1、定义环境类

"""
http://incompleteideas.net/sutton/MountainCar/MountainCar1.cp
permalink: https://perma.cc/6Z2N-PFWC
"""import math
import numpy as np
import gym
from gym import spaces
from gym.utils import seedingclass GridEnv(gym.Env):metadata = {'render.modes': ['human', 'rgb_array'],'video.frames_per_second': 30}def __init__(self, goal_velocity=0):self.min_position = -1.2self.max_position = 0.6self.max_speed = 0.07self.goal_position = 0.5self.goal_velocity = goal_velocityself.force = 0.001self.gravity = 0.0025self.low = np.array([self.min_position, -self.max_speed], dtype=np.float32)self.high = np.array([self.max_position, self.max_speed], dtype=np.float32)self.viewer = Noneself.action_space = spaces.Discrete(3)self.observation_space = spaces.Box(self.low, self.high, dtype=np.float32)self.seed()def seed(self, seed=None):self.np_random, seed = seeding.np_random(seed)return [seed]def step(self, action):assert self.action_space.contains(action), "%r (%s) invalid" % (action, type(action))position, velocity = self.statevelocity += (action - 1) * self.force + math.cos(3 * position) * (-self.gravity)velocity = np.clip(velocity, -self.max_speed, self.max_speed)position += velocityposition = np.clip(position, self.min_position, self.max_position)if (position == self.min_position and velocity < 0): velocity = 0done = bool(position >= self.goal_position and velocity >= self.goal_velocity)reward = -1.0self.state = (position, velocity)return np.array(self.state), reward, done, action, {}def reset(self):self.state = np.array([self.np_random.uniform(low=-0.6, high=-0.4), 0])return np.array(self.state)def _height(self, xs):return np.sin(3 * xs) * .45 + .55def render(self, mode='human'):screen_width = 600screen_height = 400world_width = self.max_position - self.min_positionscale = screen_width / world_widthcarwidth = 40carheight = 20if self.viewer is None:from gym.envs.classic_control import renderingself.viewer = rendering.Viewer(screen_width, screen_height)xs = np.linspace(self.min_position, self.max_position, 100)ys = self._height(xs)xys = list(zip((xs - self.min_position) * scale, ys * scale))self.track = rendering.make_polyline(xys)self.track.set_linewidth(4)self.viewer.add_geom(self.track)clearance = 10l, r, t, b = -carwidth / 2, carwidth / 2, carheight, 0car = rendering.FilledPolygon([(l, b), (l, t), (r, t), (r, b)])car.add_attr(rendering.Transform(translation=(0, clearance)))self.cartrans = rendering.Transform()car.add_attr(self.cartrans)self.viewer.add_geom(car)frontwheel = rendering.make_circle(carheight / 2.5)frontwheel.set_color(.5, .5, .5)frontwheel.add_attr(rendering.Transform(translation=(carwidth / 4, clearance)))frontwheel.add_attr(self.cartrans)self.viewer.add_geom(frontwheel)backwheel = rendering.make_circle(carheight / 2.5)backwheel.add_attr(rendering.Transform(translation=(-carwidth / 4, clearance)))backwheel.add_attr(self.cartrans)backwheel.set_color(.5, .5, .5)self.viewer.add_geom(backwheel)flagx = (self.goal_position - self.min_position) * scaleflagy1 = self._height(self.goal_position) * scaleflagy2 = flagy1 + 50flagpole = rendering.Line((flagx, flagy1), (flagx, flagy2))self.viewer.add_geom(flagpole)flag = rendering.FilledPolygon([(flagx, flagy2), (flagx, flagy2 - 10), (flagx + 25, flagy2 - 5)])flag.set_color(.8, .8, 0)self.viewer.add_geom(flag)pos = self.state[0]self.cartrans.set_translation((pos - self.min_position) * scale, self._height(pos) * scale)self.cartrans.set_rotation(math.cos(3 * pos))return self.viewer.render(return_rgb_array=mode == 'rgb_array')def get_keys_to_action(self):return {(): 1, (276,): 0, (275,): 2, (275, 276): 1}  # control with left and right arrow keysdef close(self):if self.viewer:self.viewer.close()self.viewer = None

2、修改环境类所处的模块的初始化代码

文件地址:‘/home/xxx/gym/gym/envs/classic_control/__init__.py’

from gym.envs.classic_control import grid_mdp
from gym.envs.classic_control.grid_mdp import GridEnv

3、在gym中注册

文件地址:‘/home/xxx/gym/gym/envs/__init__.py’

register (
id= 'GridWorld-v0',
entry_point='gym.envs.classic_control:GridEnv', 
max_episode_steps=200, reward_threshold=100.0,
)

4、测试代码

 
import gym
import time
env = gym.make('GridWorld-v0')for eq in range(10):obs = env.reset()done = Falserewards = 0while not done:action = env.action_space.sample()observation, reward, terminated, action, info= env.step(action)env.render()rewards += rewardprint(rewards)

在这里插入图片描述

二、gym自定义mujoco环境

1、调用gym中的mujoco环境

  • gym中有使用mujoco环境的例子,其中使用mujoco模型,并将相关的api改成gym风格
import gymenv = gym.make('Ant-v4',render_mode='human')  # 创建Humanoid环境实例
observation = env.reset()  # 重置环境并获取初始观测for _ in range(1000):  # 执行1000个步骤env.render()  # 渲染环境图像action = env.action_space.sample()  # 从动作空间中随机采样一个动作print(a)next_observation, reward, terminated,done, info = env.step(action)  # 执行动作并获取下一个观测、奖励等信息print(_)if done:breakenv.close()  # 关闭环境

在这里插入图片描述

2、gym中的mujoco自带例子

  • 主要就是调用了from gym.envs.mujoco import MujocoEnv这个模块
import numpy as npfrom gym import utils
from gym.envs.mujoco import MujocoEnv
from gym.spaces import BoxDEFAULT_CAMERA_CONFIG = {"distance": 4.0,
}class AntEnv(MujocoEnv, utils.EzPickle):metadata = {"render_modes": ["human","rgb_array","depth_array",],"render_fps": 20,}def __init__(self,xml_file="ant.xml",ctrl_cost_weight=0.5,use_contact_forces=False,contact_cost_weight=5e-4,healthy_reward=1.0,terminate_when_unhealthy=True,healthy_z_range=(0.2, 1.0),contact_force_range=(-1.0, 1.0),reset_noise_scale=0.1,exclude_current_positions_from_observation=True,**kwargs):utils.EzPickle.__init__(self,xml_file,ctrl_cost_weight,use_contact_forces,contact_cost_weight,healthy_reward,terminate_when_unhealthy,healthy_z_range,contact_force_range,reset_noise_scale,exclude_current_positions_from_observation,**kwargs)self._ctrl_cost_weight = ctrl_cost_weightself._contact_cost_weight = contact_cost_weightself._healthy_reward = healthy_rewardself._terminate_when_unhealthy = terminate_when_unhealthyself._healthy_z_range = healthy_z_rangeself._contact_force_range = contact_force_rangeself._reset_noise_scale = reset_noise_scaleself._use_contact_forces = use_contact_forcesself._exclude_current_positions_from_observation = (exclude_current_positions_from_observation)obs_shape = 27if not exclude_current_positions_from_observation:obs_shape += 2if use_contact_forces:obs_shape += 84observation_space = Box(low=-np.inf, high=np.inf, shape=(obs_shape,), dtype=np.float64)MujocoEnv.__init__(self, xml_file, 5, observation_space=observation_space, **kwargs)@propertydef healthy_reward(self):return (float(self.is_healthy or self._terminate_when_unhealthy)* self._healthy_reward)def control_cost(self, action):control_cost = self._ctrl_cost_weight * np.sum(np.square(action))return control_cost@propertydef contact_forces(self):raw_contact_forces = self.data.cfrc_extmin_value, max_value = self._contact_force_rangecontact_forces = np.clip(raw_contact_forces, min_value, max_value)return contact_forces@propertydef contact_cost(self):contact_cost = self._contact_cost_weight * np.sum(np.square(self.contact_forces))return contact_cost@propertydef is_healthy(self):state = self.state_vector()min_z, max_z = self._healthy_z_rangeis_healthy = np.isfinite(state).all() and min_z <= state[2] <= max_zreturn is_healthy@propertydef terminated(self):terminated = not self.is_healthy if self._terminate_when_unhealthy else Falsereturn terminateddef step(self, action):xy_position_before = self.get_body_com("torso")[:2].copy()self.do_simulation(action, self.frame_skip)xy_position_after = self.get_body_com("torso")[:2].copy()xy_velocity = (xy_position_after - xy_position_before) / self.dtx_velocity, y_velocity = xy_velocityforward_reward = x_velocityhealthy_reward = self.healthy_rewardrewards = forward_reward + healthy_rewardcosts = ctrl_cost = self.control_cost(action)terminated = self.terminatedobservation = self._get_obs()info = {"reward_forward": forward_reward,"reward_ctrl": -ctrl_cost,"reward_survive": healthy_reward,"x_position": xy_position_after[0],"y_position": xy_position_after[1],"distance_from_origin": np.linalg.norm(xy_position_after, ord=2),"x_velocity": x_velocity,"y_velocity": y_velocity,"forward_reward": forward_reward,}if self._use_contact_forces:contact_cost = self.contact_costcosts += contact_costinfo["reward_ctrl"] = -contact_costreward = rewards - costsif self.render_mode == "human":self.render()return observation, reward, terminated, False, infodef _get_obs(self):position = self.data.qpos.flat.copy()velocity = self.data.qvel.flat.copy()if self._exclude_current_positions_from_observation:position = position[2:]if self._use_contact_forces:contact_force = self.contact_forces.flat.copy()return np.concatenate((position, velocity, contact_force))else:return np.concatenate((position, velocity))def reset_model(self):noise_low = -self._reset_noise_scalenoise_high = self._reset_noise_scaleqpos = self.init_qpos + self.np_random.uniform(low=noise_low, high=noise_high, size=self.model.nq)qvel = (self.init_qvel+ self._reset_noise_scale * self.np_random.standard_normal(self.model.nv))self.set_state(qpos, qvel)observation = self._get_obs()return observationdef viewer_setup(self):assert self.viewer is not Nonefor key, value in DEFAULT_CAMERA_CONFIG.items():if isinstance(value, np.ndarray):getattr(self.viewer.cam, key)[:] = valueelse:setattr(self.viewer.cam, key, value)

三、mujoco定义倒立摆模型

  • 在mujoco自带的例子中,inverted_pendulum.xml可以参考,这里重新写了一个
<?xml version="1.0" ?>
<mujoco><default><geom rgba=".8 .6 .4 1"/></default><asset><texture type="skybox" builtin="gradient" rgb1="1 1 1" rgb2=".6 .8 1" width="256" height="256"/></asset><worldbody><body name="cart" pos="0 0 0"><geom type="box" size="0.2 0.1 0.1" rgba="1 0 0 1"/><joint type="slide" axis="1 0 0" name="slide_joint"/><body name="pole" pos="0 0 -0.5"><joint type="hinge" pos="0 0 0.5" axis="0 1 0" name="hinge_joint"/><geom type="cylinder" size="0.05 0.5" rgba="0 1 0 1" /><inertial pos="0 0 0" mass="10" diaginertia="0.1 0.1 0.1"/></body></body></worldbody><actuator><motor ctrllimited="true" ctrlrange="-1 1" joint="slide_joint" gear="100"/><position name="position_servo" joint="slide_joint" kp="500" /><velocity name="velocity_servo" joint="slide_joint" kv="100" /></actuator>
</mujoco>

在这里插入图片描述

四、倒立摆强化学习算法

1、离散动作空间

import logging
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import gymfrom torch.distributions import Bernoulli
from torch.autograd import Variable
from itertools import countlogging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)class PGN(nn.Module):def __init__(self):super(PGN, self).__init__()self.linear1 = nn.Linear(4, 24)self.linear2 = nn.Linear(24, 36)self.linear3 = nn.Linear(36, 1)def forward(self, x):x = F.relu(self.linear1(x))x = F.relu(self.linear2(x))x = torch.sigmoid(self.linear3(x))return xclass CartAgent(object):def __init__(self, learning_rate, gamma):self.pgn = PGN()self.gamma = gammaself._init_memory()self.optimizer = torch.optim.RMSprop(self.pgn.parameters(), lr=learning_rate)def memorize(self, state, action, reward):# save to memory for mini-batch gradient descentself.state_pool.append(state)self.action_pool.append(action)self.reward_pool.append(reward)self.steps += 1def learn(self):self._adjust_reward()# policy gradientself.optimizer.zero_grad()for i in range(self.steps):# all steps in multi games state = self.state_pool[i]action = torch.FloatTensor([self.action_pool[i]])reward = self.reward_pool[i]probs = self.act(state)m = Bernoulli(probs)loss = -m.log_prob(action) * rewardloss.backward()self.optimizer.step()self._init_memory()def act(self, state):return self.pgn(state) def _init_memory(self):self.state_pool = []self.action_pool = []self.reward_pool = []self.steps = 0def _adjust_reward(self):# backward weightrunning_add = 0for i in reversed(range(self.steps)):if self.reward_pool[i] == 0:running_add = 0else:running_add = running_add * self.gamma + self.reward_pool[i]self.reward_pool[i] = running_add# normalize rewardreward_mean = np.mean(self.reward_pool)reward_std = np.std(self.reward_pool)for i in range(self.steps):self.reward_pool[i] = (self.reward_pool[i] - reward_mean) / reward_stddef train():# hyper parameterBATCH_SIZE = 5LEARNING_RATE = 0.01GAMMA = 0.99NUM_EPISODES = 500env = gym.make('CartPole-v1',render_mode='human')cart_agent = CartAgent(learning_rate=LEARNING_RATE, gamma=GAMMA)for i_episode in range(NUM_EPISODES):next_state = env.reset()[0]for t in count():state = torch.from_numpy(next_state).float()probs = cart_agent.act(state)m = Bernoulli(probs)action = m.sample()action = action.data.numpy().astype(int).item()next_state, reward, done, _,_ = env.step(action)# end action's reward equals 0if done:reward = 0cart_agent.memorize(state, action, reward)if done:logger.info({'Episode {}: durations {}'.format(i_episode, t)})break# update parameter every batch sizeif i_episode > 0 and i_episode % BATCH_SIZE == 0:cart_agent.learn()if __name__ == '__main__':train()

在这里插入图片描述

2、连续动作空间

# python3
# Create Dat3: 2022-12-27
# Func: PPO 输出action为连续变量
# =====================================================================================================import torch
import torch.nn as nn
from torch.nn import functional as F
import numpy as np
import gym
import copy
import random
from collections import deque
from tqdm import tqdm
import typing as typclass policyNet(nn.Module):"""continuity action:normal distribution (mean, std) """def __init__(self, state_dim: int, hidden_layers_dim: typ.List, action_dim: int):super(policyNet, self).__init__()self.features = nn.ModuleList()for idx, h in enumerate(hidden_layers_dim):self.features.append(nn.ModuleDict({'linear': nn.Linear(hidden_layers_dim[idx-1] if idx else state_dim, h),'linear_action': nn.ReLU(inplace=True)}))self.fc_mu = nn.Linear(hidden_layers_dim[-1], action_dim)self.fc_std = nn.Linear(hidden_layers_dim[-1], action_dim)def forward(self, x):for layer in self.features:x = layer['linear_action'](layer['linear'](x))mean_ = 2.0 * torch.tanh(self.fc_mu(x))# np.log(1 + np.exp(2))std = F.softplus(self.fc_std(x))return mean_, stdclass valueNet(nn.Module):def __init__(self, state_dim, hidden_layers_dim):super(valueNet, self).__init__()self.features = nn.ModuleList()for idx, h in enumerate(hidden_layers_dim):self.features.append(nn.ModuleDict({'linear': nn.Linear(hidden_layers_dim[idx-1] if idx else state_dim, h),'linear_activation': nn.ReLU(inplace=True)}))self.head = nn.Linear(hidden_layers_dim[-1] , 1)def forward(self, x):for layer in self.features:x = layer['linear_activation'](layer['linear'](x))return self.head(x)def compute_advantage(gamma, lmbda, td_delta):td_delta = td_delta.detach().numpy()adv_list = []adv = 0for delta in td_delta[::-1]:adv = gamma * lmbda * adv + deltaadv_list.append(adv)adv_list.reverse()return torch.FloatTensor(adv_list)class PPO:"""PPO算法, 采用截断方式"""def __init__(self,state_dim: int,hidden_layers_dim: typ.List,action_dim: int,actor_lr: float,critic_lr: float,gamma: float,PPO_kwargs: typ.Dict,device: torch.device):self.actor = policyNet(state_dim, hidden_layers_dim, action_dim).to(device)self.critic = valueNet(state_dim, hidden_layers_dim).to(device)self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=critic_lr)self.gamma = gammaself.lmbda = PPO_kwargs['lmbda']self.ppo_epochs = PPO_kwargs['ppo_epochs'] # 一条序列的数据用来训练的轮次self.eps = PPO_kwargs['eps'] # PPO中截断范围的参数self.count = 0 self.device = devicedef policy(self, state):state = torch.FloatTensor([state]).to(self.device)mu, std = self.actor(state)action_dist = torch.distributions.Normal(mu, std)action = action_dist.sample()return [action.item()]def update(self, samples: deque):self.count += 1state, action, reward, next_state, done = zip(*samples)state = torch.FloatTensor(state).to(self.device)action = torch.tensor(action).view(-1, 1).to(self.device)reward = torch.tensor(reward).view(-1, 1).to(self.device)reward = (reward + 8.0) / 8.0  # 和TRPO一样,对奖励进行修改,方便训练next_state = torch.FloatTensor(next_state).to(self.device)done = torch.FloatTensor(done).view(-1, 1).to(self.device)td_target = reward + self.gamma * self.critic(next_state) * (1 - done)td_delta = td_target - self.critic(state)advantage = compute_advantage(self.gamma, self.lmbda, td_delta.cpu()).to(self.device)mu, std = self.actor(state)action_dists = torch.distributions.Normal(mu.detach(), std.detach())# 动作是正态分布old_log_probs = action_dists.log_prob(action)for _ in range(self.ppo_epochs):mu, std = self.actor(state)action_dists = torch.distributions.Normal(mu, std)log_prob = action_dists.log_prob(action)# e(log(a/b))ratio = torch.exp(log_prob - old_log_probs)surr1 = ratio * advantagesurr2 = torch.clamp(ratio, 1 - self.eps, 1 + self.eps) * advantageactor_loss = torch.mean(-torch.min(surr1, surr2)).float()critic_loss = torch.mean(F.mse_loss(self.critic(state).float(), td_target.detach().float())).float()self.actor_opt.zero_grad()self.critic_opt.zero_grad()actor_loss.backward()critic_loss.backward()self.actor_opt.step()self.critic_opt.step()class replayBuffer:def __init__(self, capacity: int):self.buffer = deque(maxlen=capacity)def add(self, state, action, reward, next_state, done):self.buffer.append( (state, action, reward, next_state, done) )def __len__(self):return len(self.buffer)def sample(self, batch_size: int):return random.sample(self.buffer, batch_size)def play(env, env_agent, cfg, episode_count=2):for e in range(episode_count):s, _ = env.reset()done = Falseepisode_reward = 0episode_cnt = 0while not done:env.render()a = env_agent.policy(s)n_state, reward, done, _, _ = env.step(a)episode_reward += rewardepisode_cnt += 1s = n_stateif (episode_cnt >= 3 * cfg.max_episode_steps) or (episode_reward >= 3*cfg.max_episode_rewards):breakprint(f'Get reward {episode_reward}. Last {episode_cnt} times')env.close()class Config:num_episode = 1200state_dim = Nonehidden_layers_dim = [ 128, 128 ]action_dim = 20actor_lr = 1e-4critic_lr = 5e-3PPO_kwargs = {'lmbda': 0.9,'eps': 0.2,'ppo_epochs': 10}gamma = 0.9device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')buffer_size = 20480minimal_size = 1024batch_size = 128save_path = r'./ac_model.ckpt'# 回合停止控制max_episode_rewards = 260max_episode_steps = 260def __init__(self, env):self.state_dim = env.observation_space.shape[0]try:self.action_dim = env.action_space.nexcept Exception as e:self.action_dim = env.action_space.shape[0]print(f'device={self.device} | env={str(env)}')def train_agent(env, cfg):ac_agent = PPO(state_dim=cfg.state_dim,hidden_layers_dim=cfg.hidden_layers_dim,action_dim=cfg.action_dim,actor_lr=cfg.actor_lr,critic_lr=cfg.critic_lr,gamma=cfg.gamma,PPO_kwargs=cfg.PPO_kwargs,device=cfg.device)           tq_bar = tqdm(range(cfg.num_episode))rewards_list = []now_reward = 0bf_reward = -np.inffor i in tq_bar:buffer_ = replayBuffer(cfg.buffer_size)tq_bar.set_description(f'Episode [ {i+1} / {cfg.num_episode} ]')    s, _ = env.reset()done = Falseepisode_rewards = 0steps = 0while not done:a = ac_agent.policy(s)n_s, r, done, _, _ = env.step(a)buffer_.add(s, a, r, n_s, done)s = n_sepisode_rewards += rsteps += 1if (episode_rewards >= cfg.max_episode_rewards) or (steps >= cfg.max_episode_steps):breakac_agent.update(buffer_.buffer)rewards_list.append(episode_rewards)now_reward = np.mean(rewards_list[-10:])if bf_reward < now_reward:torch.save(ac_agent.actor.state_dict(), cfg.save_path)bf_reward = now_rewardtq_bar.set_postfix({'lastMeanRewards': f'{now_reward:.2f}', 'BEST': f'{bf_reward:.2f}'})env.close()return ac_agentif __name__ == '__main__':print('=='*35)print('Training Pendulum-v1')env = gym.make('Pendulum-v1')cfg = Config(env)ac_agent = train_agent(env, cfg)ac_agent.actor.load_state_dict(torch.load(cfg.save_path))play(gym.make('Pendulum-v1', render_mode="human"), ac_agent, cfg)

在这里插入图片描述

3、倒立摆连续空间

import logging
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import gymfrom torch.distributions import Normal
from torch.autograd import Variable
from itertools import countlogging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)class PGN(nn.Module):def __init__(self):super(PGN, self).__init__()self.linear1 = nn.Linear(4, 24)self.linear2 = nn.Linear(24, 36)self.mean = nn.Linear(36, 1)  # 输出均值self.std = nn.Parameter(torch.tensor(0.1))  # 输出标准差def forward(self, x):x = F.relu(self.linear1(x))x = F.relu(self.linear2(x))mean = self.mean(x)return meanclass CartAgent(object):def __init__(self, learning_rate, gamma):self.pgn = PGN()self.gamma = gammaself._init_memory()self.optimizer = torch.optim.RMSprop(self.pgn.parameters(), lr=learning_rate)def memorize(self, state, action, reward):# save to memory for mini-batch gradient descentself.state_pool.append(state)self.action_pool.append(action)self.reward_pool.append(reward)self.steps += 1def learn(self):self._adjust_reward()# policy gradientself.optimizer.zero_grad()for i in range(self.steps):state = self.state_pool[i]action = torch.FloatTensor([self.action_pool[i]])reward = self.reward_pool[i]mean = self.act(state)m = Normal(mean, self.pgn.std)loss = -m.log_prob(action) * rewardloss.backward()self.optimizer.step()self._init_memory()def act(self, state):return self.pgn(state) def _init_memory(self):self.state_pool = []self.action_pool = []self.reward_pool = []self.steps = 0def _adjust_reward(self):running_add = 0for i in reversed(range(self.steps)):if self.reward_pool[i] == 0:running_add = 0else:running_add = running_add * self.gamma + self.reward_pool[i]self.reward_pool[i] = running_addreward_mean = np.mean(self.reward_pool)reward_std = np.std(self.reward_pool)for i in range(self.steps):self.reward_pool[i] = (self.reward_pool[i] - reward_mean) / reward_stddef train():BATCH_SIZE = 5LEARNING_RATE = 0.01GAMMA = 0.99NUM_EPISODES = 500env = gym.make('InvertedPendulum-v2', render_mode='human')cart_agent = CartAgent(learning_rate=LEARNING_RATE, gamma=GAMMA)for i_episode in range(NUM_EPISODES):next_state = env.reset()[0]for t in count():# print("++++++++++++++:",t)state = torch.from_numpy(next_state).float()mean = cart_agent.act(state)# print(cart_agent.pgn.std)if(cart_agent.pgn.std <=0):cart_agent.pgn.std = nn.Parameter(torch.tensor(0.01)) m = Normal(mean, cart_agent.pgn.std)action = m.sample()action = action.data.numpy().astype(float).item()action = np.clip(action, -3.0, 3.0)  # 限制力矩范围在[-1.0, 1.0]action = [action]next_state, reward, done, _, _ = env.step(action)if done:reward = 0cart_agent.memorize(state, action, reward)if done:logger.info({'Episode {}: durations {}'.format(i_episode, t)})breakif i_episode > 0 and i_episode % BATCH_SIZE == 0:cart_agent.learn()if __name__ == '__main__':train()

在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/302969.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【原创】springboot+vue个人财务记账管理系统设计与实现

个人主页&#xff1a;程序猿小小杨 个人简介&#xff1a;从事开发多年&#xff0c;Java、Php、Python、前端开发均有涉猎 博客内容&#xff1a;Java项目实战、项目演示、技术分享 文末有作者名片&#xff0c;希望和大家一起共同进步&#xff0c;你只管努力&#xff0c;剩下的交…

创建个人百度百科需要什么条件?

互联网时代&#xff0c;创建百度百科词条可以给个人带来更多的曝光和展现&#xff0c;相当于一个镀金的网络名片&#xff0c;人人都想上百度百科&#xff0c;但并不是人人都能创建上去的。 个人百度百科词条的创建需要满足一定的条件&#xff0c;今天伯乐网络传媒就来给大家聊聊…

Vitalik Buterin香港主旨演讲:协议过去10年迅速发展,但存在效率、安全两大问题

2024 香港 Web3 嘉年华期间&#xff0c;以太坊联合创始人 Vitalik Buterin 在由DRK Lab主办的“Web3 学者峰会 2024”上发表主旨演讲《Reaching the Limits of Protocol Design》。 他介绍到&#xff0c;2010年代&#xff0c;基于基本密码学的协议是哈希、签名。随后&#xff…

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

《当人工智能遇上安全》系列将详细介绍人工智能与安全相关的论文、实践&#xff0c;并分享各种案例&#xff0c;涉及恶意代码检测、恶意请求识别、入侵检测、对抗样本等等。只想更好地帮助初学者&#xff0c;更加成体系的分享新知识。该系列文章会更加聚焦&#xff0c;更加学术…

目标检测——YOLO系列学习(一)YOLOv1

YOLO可以说是单阶段的目标检测方法的集大成之作&#xff0c;必学的经典论文&#xff0c;从准备面试的角度来学习一下yolo系列。 YOLOv1 1.RCNN系列回顾 RCNN系列&#xff0c;无论哪种算法&#xff0c;核心思路都是Region Proposal&#xff08;定位&#xff09; classifier&am…

【蓝桥杯嵌入式】串口通信与RTC时钟

【蓝桥杯嵌入式】串口通信与RTC时钟 串口通信cubemx配置串口通信程序设计 RTC时钟cubemx配置程序设计 串口通信 cubemx配置 打开串口通信&#xff0c;并配置波特率为9600 打开串口中断 重定义串口接收与发送引脚&#xff0c;默认是PC4&#xff0c;PC5&#xff0c;需要改为P…

UVA12538 Version Controlled IDE 题解 crope

Version Controlled IDE 传送门 题面翻译 维护一种数据结构&#xff0c;资磁三种操作。 1.在p位置插入一个字符串s 2.从p位置开始删除长度为c的字符串 3.输出第v个历史版本中从p位置开始的长度为c的字符串 1 ≤ n ≤ 50000 1 \leq n \leq 50000 1≤n≤50000&#xff0c;所…

Jmeter如何录制https的系统性能脚本

在使用jmeter录制性能测试脚本时&#xff0c;会遇到网站为http和https两种情况&#xff0c;略有不同&#xff0c;下面介绍一下&#xff1a; 1.Jmeter录制http 1.测试计划–>添加–>非测试元件–>HTTP(S)测试脚本记录器 【HTTP(S)测试脚本记录器】有的版本叫【HTTP代…

element UI table合并单元格方法

废话不多讲&#xff0c;直接上代码&#xff0c;希望能帮到需要的朋友 // 合并单元格function spanMethod({ row, column, rowIndex, columnIndex }) {//定义需要合并的列字段&#xff0c;有哪些列需要合并&#xff0c;就自定义添加字段即可const fields [declareRegion] // …

python课后习题三

题目&#xff1a; 解题过程&#xff1a; 模式A&#xff1a; num int(input("&#xff08;模式A&#xff09;输入数字&#xff1a;")) for i in range(num): for j in range(num): if j < i 1: …

【Flutter】三个Channel(Android-java / Ios-swift)

Channel 实现与原生通信 【1】MethodChannel flutter MethodChannel官方文档 通过MethodChannel来传递数据&#xff0c;调用方法 案例 分别调用Android和Ios原生的获取电量的方法 Flutter端 实例一个MethodChannel&#xff0c; 唯一标识name&#xff0c;定义方法名称get…

微信小程序Skyline模式下瀑布长列表优化成虚拟列表,解决内存问题

微信小程序长列表&#xff0c;渲染的越多就会导致内存吃的越多。特别是长列表的图片组件和广告组件。 为了解决内存问题&#xff0c;所以看了很多人的资料&#xff0c;都不太符合通用的解决方式&#xff0c;很多需要固定子组件高度&#xff0c;但是瀑布流是无法固定的&#xf…

MYSQL 8.0版本修改用户密码(知道登录密码)和Sqlyog错误码2058一案

今天准备使用sqlyog连接一下我Linux上面的mysql数据库&#xff0c;然后就报如下错误 有一个简单的办法就是修改密码为password就完事!然后我就开始查找如何修改密码! 如果是需要解决Sqlyog错误码2058的话&#xff0c;执行以下命令&#xff0c;但是注意root对应host是不是loca…

开源铱塔切换MySQL数据库启动报异常

1.错误日志&#xff1a; 铱塔切换数据库配置为MySQL之后&#xff0c;启动后报错如下&#xff1a; SqlExceptionHelper - Table iotkit.task_info doesnt exist SqlExceptionHelper - Table iotkit.rule_info doesnt exist SqlExceptionHelper - Table iotkit.device_info does…

Word 画三线表模板---一键套用

1、制作三线表 1&#xff09;设置为无边框 选中表格&#xff0c;点击「右键」——「边框」——「无框线」。 2&#xff09;添加上下边框线 选中表格后&#xff0c;点击【右键】——【表格属性】——【边框和底纹】&#xff0c;边框线选择【1.5磅】&#xff0c;然后点击【上框…

JavaScript - 请你为数组自定义一个方法myFind,使其实现find方法的功能

难度级别:中级及以上 提问概率:50% 我们知道数组的find方法是ES6之后出现的,它强调找到第一个符合条件的元素后即跳出循环,不再继续执行,那么如果不用ES6的知识,为数组添加一个自定义方法实现find方法的功能,首先要想到在数组的原型pro…

解决MySQL服务无法启动问题

本地计算机上的 mysq 服务启动后停止。某些服务在未由其他服务或程序使用时将自动停止。 笔记本电脑上的MySQL也是经常在使用的&#xff0c;有一天使用dbeaver连接时突然就连接不上了&#xff01;&#xff01;&#xff01;分析报错信息&#xff0c;也不是口令问题&#xff01;于…

抓住风口,快速上手RAG应用开发!

免责声明~ 任何文章不要过度深思&#xff01; 万事万物都经不起审视&#xff0c;因为世上没有同样的成长环境&#xff0c;也没有同样的认知水平&#xff0c;更「没有适用于所有人的解决方案」&#xff1b; 不要急着评判文章列出的观点&#xff0c;只需代入其中&#xff0c;适度…

c++的学习之路:20、继承(1)

摘要 本章主要是讲以一下继承的一些概念以及使用方法等等。 目录 摘要 一、继承的概念及定义 1、继承的概念 2、继承定义 1.2.1、定义格式 1.2.2、继承关系和访问限定符 1.2.3、继承基类成员访问方式的变化 3、总结 二、基类和派生类对象赋值转换 三、继承中的作用…

【Qt 学习笔记】Qt信号和槽的其他说明及Lambda表达式

博客主页&#xff1a;Duck Bro 博客主页系列专栏&#xff1a;Qt 专栏关注博主&#xff0c;后期持续更新系列文章如果有错误感谢请大家批评指出&#xff0c;及时修改感谢大家点赞&#x1f44d;收藏⭐评论✍ Qt信号和槽的其他说明及Lambda表达式 文章编号&#xff1a;Qt 学习笔记…