Reinforcement Learning with Code 【Code 1. Tabular Q-learning】

Reinforcement Learning with Code 【Code 1. Tabular Q-learning】

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning.
This code refers to Mofan’s reinforcement learning course.

文章目录

  • Reinforcement Learning with Code 【Code 1. Tabular Q-learning】
    • 1.1 Problem and result
    • 1.2 Environment
    • 1.3 Tabular Q-learning Algorithm
    • 1.4 Run this main
    • 1.5 Check the Q table
    • Reference

1.1 Problem and result

Please consider the problem that a little mouse (denoted by red block) wants to avoid trap (denoted by black block) to get the cheese (denoted by yellow circle). As the figure shows.

Image

This chapter aims to realize tabular Q-learning algorithm sovle this problem.

1.2 Environment

We use the tkinter package of python to build our environment to interact with agent.

import numpy as np
import time
import sys
import tkinter as tk
# if sys.version_info.major == 2: # 检查python版本是否是python2
#     import Tkinter as tk
# else:
#     import tkinter as tkUNIT = 40   # pixels
MAZE_H = 4  # grid height
MAZE_W = 4  # grid widthclass Maze(tk.Tk, object):def __init__(self):super(Maze, self).__init__()# Action Spaceself.action_space = ['up', 'down', 'right', 'left'] # action space self.n_actions = len(self.action_space)# 绘制GUIself.title('Maze env')self.geometry('{0}x{1}'.format(MAZE_W * UNIT, MAZE_H * UNIT))   # 指定窗口大小 "width x height"self._build_maze()def _build_maze(self):self.canvas = tk.Canvas(self, bg='white',height=MAZE_H * UNIT,width=MAZE_W * UNIT)     # 创建背景画布# create gridsfor c in range(UNIT, MAZE_W * UNIT, UNIT): # 绘制列分隔线x0, y0, x1, y1 = c, 0, c, MAZE_H * UNITself.canvas.create_line(x0, y0, x1, y1)for r in range(UNIT, MAZE_H * UNIT, UNIT): # 绘制行分隔线x0, y0, x1, y1 = 0, r, MAZE_W * UNIT, rself.canvas.create_line(x0, y0, x1, y1)# create origin 第一个方格的中心,origin = np.array([UNIT/2, UNIT/2]) # hell1hell1_center = origin + np.array([UNIT * 2, UNIT])self.hell1 = self.canvas.create_rectangle(hell1_center[0] - (UNIT/2 - 5), hell1_center[1] - (UNIT/2 - 5),hell1_center[0] + (UNIT/2 - 5), hell1_center[1] + (UNIT/2 - 5),fill='black')# hell2hell2_center = origin + np.array([UNIT, UNIT * 2])self.hell2 = self.canvas.create_rectangle(hell2_center[0] - (UNIT/2 - 5), hell2_center[1] - (UNIT/2 - 5),hell2_center[0] + (UNIT/2 - 5), hell2_center[1] + (UNIT/2 - 5),fill='black')# create oval 绘制终点圆形oval_center = origin + np.array([UNIT*2, UNIT*2])self.oval = self.canvas.create_oval(oval_center[0] - (UNIT/2 - 5), oval_center[1] - (UNIT/2 - 5),oval_center[0] + (UNIT/2 - 5), oval_center[1] + (UNIT/2 - 5),fill='yellow')# create red rect 绘制agent红色方块,初始在方格左上角self.rect = self.canvas.create_rectangle(origin[0] - (UNIT/2 - 5), origin[1] - (UNIT/2 - 5),origin[0] + (UNIT/2 - 5), origin[1] + (UNIT/2 - 5),fill='red')# pack all 显示所有canvasself.canvas.pack()def get_state(self, rect):# convert the coordinate observation to state tuple# use the uniformed center as the state such as # |(1,1)|(2,1)|(3,1)|...# |(1,2)|(2,2)|(3,2)|...# |(1,3)|(2,3)|(3,3)|...# |....x0,y0,x1,y1 = self.canvas.coords(rect)x_center = (x0+x1)/2y_center = (y0+y1)/2state = ((x_center-(UNIT/2))/UNIT + 1, (y_center-(UNIT/2))/UNIT + 1)return statedef reset(self):self.update()self.after(500) # delay 500msself.canvas.delete(self.rect)   # delete origin rectangleorigin = np.array([UNIT/2, UNIT/2])self.rect = self.canvas.create_rectangle(origin[0] - (UNIT/2 - 5), origin[1] - (UNIT/2 - 5),origin[0] + (UNIT/2 - 5), origin[1] + (UNIT/2 - 5),fill='red')# return observation return self.get_state(self.rect)   def step(self, action):# agent和环境进行一次交互s = self.get_state(self.rect)   # 获得智能体的坐标base_action = np.array([0, 0])reach_boundary = Falseif action == self.action_space[0]:   # upif s[1] > 1:base_action[1] -= UNITelse: # 触碰到边界reward=-1并停留在原地reach_boundary = Trueelif action == self.action_space[1]:   # downif s[1] < MAZE_H:base_action[1] += UNITelse:reach_boundary = True   elif action == self.action_space[2]:   # rightif s[0] < MAZE_W:base_action[0] += UNITelse:reach_boundary = Trueelif action == self.action_space[3]:   # leftif s[0] > 1:base_action[0] -= UNITelse:reach_boundary = Trueself.canvas.move(self.rect, base_action[0], base_action[1])  # move agents_ = self.get_state(self.rect)  # next state# reward functionif s_ == self.get_state(self.oval):     # reach the terminalreward = 1done = Trues_ = 'success'elif s_ == self.get_state(self.hell1): # reach the blockreward = -1s_ = 'block_1'done = Falseelif s_ == self.get_state(self.hell2):reward = -1s_ = 'block_2'done = Falseelse:reward = 0done = Falseif reach_boundary:reward = -1return s_, reward, donedef render(self):time.sleep(0.15)self.update()if __name__ == '__main__':def test():for t in range(10):s = env.reset()print(s)while True:env.render()a = 'right's, r, done = env.step(a)print(s)if done:breakenv = Maze()env.after(100, test)      # 在延迟100ms后调用函数testenv.mainloop()

This part is important that the reward function design is include, which is as follows

reward = { 1 , if reach the cheese − 1 , if reach the trap or reach the boundary 0 , others \text{reward} = \left \{ \begin{aligned} & 1, \quad \text{if reach the cheese} \\ & -1, \quad \text{if reach the trap or reach the boundary} \\ & 0, \quad \text{others} \end{aligned} \right. reward= 1,if reach the cheese1,if reach the trap or reach the boundary0,others

We need to explan some function of the class Maze.

  • First, the function _build_maze creates the inital maze location.
    In this example we use the left up coordination of each grid as the state of each block.
  • Second, the function get_state converts the coordination of each grid to numerical representation such as ( 1 , 1 ) , ( 1 , 2 ) , ⋯ (1,1),(1,2),\cdots (1,1),(1,2),.
  • Third, the function reset renew the state which means placing the mouse in the original grid.
  • Then, the function step we let the agent interact with envrionment for one step, ang get the reward after the action.
  • Then, the function render controls updating the window.

1.3 Tabular Q-learning Algorithm

import numpy as np
import pandas as pdclass QLearningTable():def __init__(self, actions, learning_rate=0.05, reward_decay=0.9, e_greedy=0.9):self.actions = actions  # action listself.lr = learning_rateself.gamma = reward_decayself.epsilon = e_greedy # epsilon greedy update policyself.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64)def check_state_exist(self, state):if state not in self.q_table.index:# append new state to q table, use the coordinate as the observation# self.q_table = self.q_table.append(       # DataFrame.append is invalid#     pd.Series(#         [0]*len(self.actions),#         index=self.q_table.columns,#         name=state,#     )# )self.q_table = pd.concat([self.q_table,pd.DataFrame(data=np.zeros((1,len(self.actions))),columns = self.q_table.columns,index = [state])])def choose_action(self, observation):self.check_state_exist(observation)# action selection# epsilon greedy algorithmif np.random.uniform() < self.epsilon:state_action = self.q_table.loc[observation, :]# some actions may have the same value, randomly choose on in these actions# state_action == np.max(state_action) generate bool mask# choose best actionaction = np.random.choice(state_action[state_action == np.max(state_action)].index)else:# choose random actionaction = np.random.choice(self.actions)return actiondef learn(self, s, a, r, s_):self.check_state_exist(s_)q_predict = self.q_table.loc[s, a]if s_ != 'success':q_target = r + self.gamma * self.q_table.loc[s_, :].max()  # next state is not terminalelse:q_target = r  # next state is terminalself.q_table.loc[s, a] += self.lr * (q_target - q_predict)  # update

We store the Q-table as a DataFrame of pandas. The explanation of the functions are as follows.

  • First, the function check_state_exist check the existence of one state, if not we append it to the Q-table. This is because once the state-action pair is visited, then we update it into the Q-table.
  • Second, the function choose_action is following the ϵ \epsilon ϵ-greedy algorithm

π ( a ∣ s ) = { 1 − ϵ ∣ A ( s ) ∣ ( ∣ A ( s ) ∣ − 1 ) , for the geedy action ϵ ∣ A ( s ) ∣ , for the other  ∣ A ( s ) ∣ − 1 actions \pi(a|s) = \left \{ \begin{aligned} 1 - \frac{\epsilon}{|\mathcal{A}(s)|}(|\mathcal{A(s)}|-1), & \quad \text{for the geedy action} \\ \frac{\epsilon}{|\mathcal{A}(s)|}, & \quad \text{for the other } |\mathcal{A}(s)|-1 \text{ actions} \end{aligned} \right. π(as)= 1A(s)ϵ(A(s)1),A(s)ϵ,for the geedy actionfor the other A(s)1 actions

  • Third, the function learn is update the q value as Q-learning algorithm purposed.

Q-learning : { q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q t ( s t + 1 , a ) ) ] q t + 1 ( s , a ) = q t ( s , a ) , for all  ( s , a ) ≠ ( s t , a t ) \text{Q-learning} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - (r_{t+1}+ \gamma \max_{a\in\mathcal{A}(s_{t+1})} q_t(s_{t+1},a)) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right. Q-learning: qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γaA(st+1)maxqt(st+1,a))]=qt(s,a),for all (s,a)=(st,at)

1.4 Run this main

Run this main script that we can run the all codes.

from maze_env_custom import Maze
from RL_brain import QLearningTableMAX_EPISODE = 30def update():for episode in range(MAX_EPISODE):# initial observation, observation is the rect's coordiante# observation is [x0,y0, x1,y1]observation = env.reset()   while True:# fresh envenv.render()# RL choose action based on observation ['up', 'down', 'right', 'left']action = RL.choose_action(str(observation))# RL take action and get next observation and rewardobservation_, reward, done = env.step(action)# RL learn from this transitionRL.learn(str(observation), action, reward, str(observation_))# swap observationobservation = observation_# break while loop when end of this episodeif done:break# show q_tableprint(RL.q_table)print('\n')# end of gameprint('game over')env.destroy()if __name__ == "__main__":env = Maze()RL = QLearningTable(env.action_space)env.after(100, update)env.mainloop()

1.5 Check the Q table

After a long run we can check the q-table to judge wheter the learning is reasonable. The q-table is as follows:

                  up      down     right          left
(1.0, 1.0) -0.226208  0.000963  0.000000 -9.750000e-02
(1.0, 2.0)  0.000024  0.005773  0.000000 -5.000000e-02
(2.0, 1.0) -0.050000  0.000000  0.000000  5.247904e-07
(2.0, 2.0)  0.000000 -0.050000 -0.050000  0.000000e+00
block_2     0.000000  0.000000  0.000000  1.793534e-04
(2.0, 4.0) -0.097500 -0.050000  0.336315  2.916072e-03
(1.0, 4.0)  0.002162 -0.140781  0.112337 -5.000000e-02
(1.0, 3.0)  0.000008  0.033479 -0.050000 -9.739821e-02
block_1     0.000000  0.097500  0.000000  0.000000e+00
(4.0, 2.0)  0.000000  0.006525 -0.050000 -5.000000e-02
success     0.000000  0.000000  0.000000  0.000000e+00
(3.0, 1.0) -0.050000 -0.047750  0.000000  0.000000e+00
(3.0, 4.0)  0.722610 -0.050000  0.000000  1.298347e-02
(4.0, 1.0) -0.050000  0.000101 -0.050000  0.000000e+00
(4.0, 3.0)  0.000000  0.000000  0.000000  1.426250e-01

For example, when at the original place if the mouse wants to move up or move left it will reach the boundary and get reward − 1 -1 1. Hence the state value in q-table is minus.


Reference

赵世钰老师的课程
莫烦ReinforcementLearning course

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/74391.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

leetcode 135. 分发糖果

2023.8.1 这道题只从前向后遍历会出各种问题&#xff0c;所以最后决定向前向后各遍历一次。 先定义一个饼干数组biscuits&#xff0c;记录每个孩子的饼干数量&#xff0c;初始化每个孩子饼干数量为1。 然后从前向后遍历、从后向前遍历&#xff0c;使其满足“相邻两孩子评分更高…

前端Vue入门-day08-vant组件库

(创作不易&#xff0c;感谢有你&#xff0c;你的支持&#xff0c;就是我前行的最大动力&#xff0c;如果看完对你有帮助&#xff0c;请留下您的足迹&#xff09; 目录 vant 组件库 安装 导入 全部导入 按需导入 浏览器配饰 Viewport 布局 Rem 布局适配 vant 组件库 …

NASM汇编

1. 前置知识 1. 汇编语言两种风格 intel&#xff1a;我们学的NASM就属于Intel风格AT&T&#xff1a;GCC后端工具默认使用这种风格&#xff0c;当然我们也可以加选项改成intel风格 2. 代码 1. 段分布 .text: 存放的是二进制机器码&#xff0c;只读.data: 存放有初始化的…

sublime配置less的一些坑(1)

仅在sublime的Install Package安装保存less报错 在sublime的Install Package安装less 打开sublime软件,按住CtrlShiftP组合键,弹出的界面中选择Install Package 选中后enter或者回车。等会弹出一个弹窗,大致意思是说你已经成功安装了package control。如果你在此之前已经安装了…

MySQL之深入InnoDB存储引擎——物理文件

文章目录 一、参数文件二、日志文件三、表结构定义文件四、InnoDB 存储引擎文件1、表空间文件2、重做日志文件 一、参数文件 当 MySQL 实例启动时&#xff0c;数据库会先去读一个配置参数文件&#xff0c;用来寻找数据库的各种文件所在位置以及指定某些初始化参数。在默认情况…

怎么才能远程控制笔记本电脑?

为什么选择AnyViewer远程控制软件&#xff1f; 为什么AnyViewer是远程控制笔记本电脑软件的首选&#xff1f;以下是选择AnyViewer成为笔记本电脑远程控制软件的主要因素。 跨平台能力 AnyViewer作为一款跨平台远程控制软件&#xff0c;不仅可以用于从一台Windows电…

【ARM Coresight 系列文章 2.4 - Coresight 寄存器:DEVARCH,DEVID, DEVTYPE】

文章目录 1.1 DEVARCH(device architecture register)1.2 DEVID(Device configuration Register)1.3 DEVTYPE(Device Type Identifier Register) 1.1 DEVARCH(device architecture register) DEVARCH 寄存器标识了coresight 组件的架构信息。 bits[31:21] 定义了组件架构&…

我对排序算法的理解

排序算法一直是一个很困惑我的问题&#xff0c;早在刚开始接触 数据结构的时候&#xff0c;这个地方就很让我不解。就是那种&#xff0c;总是感觉少了些什么的感觉。一开始&#xff0c;重新来过&#xff0c;认真来学习这一部分&#xff0c;也总是学着学着就把概念记住了。过了一…

【Mybatis】Mybatis架构简介

文章目录 1.整体架构图2. 基础支撑层2.1 类型转换模块2.2 日志模块2.3 反射工具模块2.4 Binding 模块2.5 数据源模块2.6缓存模块2.7 解析器模块2.8 事务管理模块 3. 核心处理层3.1 配置解析3.2 SQL 解析与 scripting 模块3.3 SQL 执行3.4 插件 4. 接口层 1.整体架构图 MyBatis…

Redis 数据库高可用

Redis 数据库的高可用 一.Redis 数据库的持久化 1.Redis 高可用概念 &#xff08;1&#xff09;在web服务器中&#xff0c;高可用是指服务器可以正常访问的时间&#xff0c;衡量的标准是在多长时间内可以提供正常服务&#xff08;99.9%、99.99%、99.999%等等&#xff09;。 …

2023年第四届“华数杯”数学建模思路 - 案例:随机森林

## 0 赛题思路 &#xff08;赛题出来以后第一时间在CSDN分享&#xff09; https://blog.csdn.net/dc_sinor?typeblog 1 什么是随机森林&#xff1f; 随机森林属于 集成学习 中的 Bagging&#xff08;Bootstrap AGgregation 的简称&#xff09; 方法。如果用图来表示他们之…

【微信支付V3】

微信支付V3 微信支付V3 开发文档&#xff1a; https://pay.weixin.qq.com/wiki/doc/apiv3/wxpay/pages/index.shtml 1. 查看文档 使用微信提供的SDK&#xff0c;在文档中进入SDK 2. 开发 1. 添加jar包 <dependency><groupId>com.github.wechatpay-apiv3<…

使用adb通过电脑给安卓设备安装apk文件

最近碰到要在开发板上安装软件的问题&#xff0c;由于是开发板上的安卓系统没有解析apk文件的工具&#xff0c;所以无法通过直接打开apk文件来安装软件。因此查询各种资料后发现可以使用adb工具&#xff0c;这样一来可以在电脑上给安卓设备安装软件。 ADB 就是连接 Android 手…

Java进阶——数据结构与算法之哈希表与树的入门小结(四)

文章大纲 引言一、哈希表1、哈希表概述2、哈希表的基本设计思想3、JDK中的哈希表的设计思想概述 二、树1、树的概述2、树的特点3、树的相关术语4、树的存储结构4.1、双亲表示法4.2、孩子兄弟表示法&#xff1a;4.3、孩子表示法&#xff1a;4.4、双亲孩子表示法 三、二叉树1、二…

SAM在医学图像分割的一些研究(Segment Anything Model for Medical Images?(2023))

使用预训练模型通过两种主要模式进行分割&#xff0c;包括自动一切和手动提示(例如&#xff0c;点和框)。SAM在各种自然图像分割任务上取得了令人印象深刻的效果。然而&#xff0c;由于医学图像的形态复杂、解剖结构精细、物体边界不确定和复杂、物体尺度大&#xff0c;使得医学…

【第一阶段】kotlin的when表达式

1.Java 的if /when是语句 kotlin的if/when是表达式&#xff0c;表达式是有返回值的 java中void是个关键字&#xff0c;Unit在kotlin中是个类 2.当使用when语句的时候必须有一个不满足的值即else: fun main() {var week:Int5val info when(week){1->"今天是星期一"…

【iOS】—— UIKit相关问题

文章目录 UIKit常用的UIKit组件懒加载的优势 CALayer和UIView区别关系 UITableViewUITableView遵循的两个delegate以及必须实现的方法上述四个必须实现方法执行顺序其他方法的执行顺序&#xff1a; UICollectionView和UITableView的区别UICollectionViewFlowLayout和UICollecti…

em3288 linux_4.19 第一次烧写无法进入内核的情况

1. 情况一&#xff1a; /DDR Version 1.11 20210818 In SRX Channel a: DDR3 400MHz Bus Width32 Col10 Bank8 Row15 CS1 Die Bus-Width16 Size1024MB Channel b: DDR3 400MHz Bus Width32 Col10 Bank8 Row15 CS1 Die Bus-Width16 Size1024MB OUT Boot1 Release Time: Jul 22 2…

Jenkins插件管理切换国内源地址

一、替换国内插件下载地址 选择系统管理–>插件管理–> Available Plugins 并等待页面完全加载完成、这样做是为了把jenkins官方的插件列表下载到本地、接着修改地址文件、替换为国内插件地址 进入插件文件目录 cd /var/lib/jenkins/updatesdefault.json 为插件源地址…

STM32 LWIP UDP 一对一 一对多发送

STM32 LWIP UDP通信 前言设置 IP 地址UDP函数配置实验结果单播发送&#xff0c;一对一发送广播发送&#xff0c;一对多发送 可能遇到的问题总结 前言 之前没有接触过网络的通信&#xff0c;工作需要 UDP 接收和发送通信&#xff0c;在网上没有找到一对一、一对多的相关例程&am…