Model for OpenAI gym‘s Lunar Lander not converging

题意：OpenAI Gym 的 Lunar Lander 模型未收敛

问题背景：

I am trying to use deep reinforcement learning with keras to train an agent to learn how to play the Lunar Lander OpenAI gym environment. The problem is that my model is not converging. Here is my code:

我正在尝试使用 Keras 深度强化学习训练一个智能体，学习如何在 OpenAI Gym 环境中玩 Lunar Lander。问题是我的模型没有收敛。以下是我的代码：

import numpy as np
import gymfrom keras.models import Sequential
from keras.layers import Dense
from keras import optimizersdef get_random_action(epsilon):return np.random.rand(1) < epsilondef get_reward_prediction(q, a):qs_a = np.concatenate((q, table[a]), axis=0)x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))x[0] = qs_aguess = model.predict(x[0].reshape(1, x.shape[1]))r = guess[0][0]return rresults = []
epsilon = 0.05
alpha = 0.003
gamma = 0.3
environment_parameters = 8
num_of_possible_actions = 4
obs = 15
mem_max = 100000
epochs = 3
total_episodes = 15000possible_actions = np.arange(0, num_of_possible_actions)
table = np.zeros((num_of_possible_actions, num_of_possible_actions))
table[np.arange(num_of_possible_actions), possible_actions] = 1env = gym.make('LunarLander-v2')
env.reset()i_x = np.random.random((5, environment_parameters + num_of_possible_actions))
i_y = np.random.random((5, 1))model = Sequential()
model.add(Dense(512, activation='relu', input_dim=i_x.shape[1]))
model.add(Dense(i_y.shape[1]))opt = optimizers.adam(lr=alpha)model.compile(loss='mse', optimizer=opt, metrics=['accuracy'])total_steps = 0
i_x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
i_y = np.zeros(shape=(1, 1))mem_x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
mem_y = np.zeros(shape=(1, 1))
max_steps = 40000for episode in range(total_episodes):g_x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))g_y = np.zeros(shape=(1, 1))q_t = env.reset()episode_reward = 0for step_number in range(max_steps):if episode < obs:a = env.action_space.sample()else:if get_random_action(epsilon, total_episodes, episode):a = env.action_space.sample()else:actions = np.zeros(shape=num_of_possible_actions)for i in range(4):actions[i] = get_reward_prediction(q_t, i)a = np.argmax(actions)# env.render()qa = np.concatenate((q_t, table[a]), axis=0)s, r, episode_complete, data = env.step(a)episode_reward += rif step_number is 0:g_x[0] = qag_y[0] = np.array([r])mem_x[0] = qamem_y[0] = np.array([r])g_x = np.vstack((g_x, qa))g_y = np.vstack((g_y, np.array([r])))if episode_complete:for i in range(0, g_y.shape[0]):if i is 0:g_y[(g_y.shape[0] - 1) - i][0] = g_y[(g_y.shape[0] - 1) - i][0]else:g_y[(g_y.shape[0] - 1) - i][0] = g_y[(g_y.shape[0] - 1) - i][0] + gamma * g_y[(g_y.shape[0] - 1) - i + 1][0]if mem_x.shape[0] is 1:mem_x = g_xmem_y = g_yelse:mem_x = np.concatenate((mem_x, g_x), axis=0)mem_y = np.concatenate((mem_y, g_y), axis=0)if np.alen(mem_x) >= mem_max:for l in range(np.alen(g_x)):mem_x = np.delete(mem_x, 0, axis=0)mem_y = np.delete(mem_y, 0, axis=0)q_t = sif episode_complete and episode >= obs:if episode%10 == 0:model.fit(mem_x, mem_y, batch_size=32, epochs=epochs, verbose=0)if episode_complete:results.append(episode_reward)break

I am running tens of thousands of episodes and my model still won't converge. It will begin to reduce average change in policy over ~5000 episodes while increasing the average reward, but then it goes off the deep end and the average reward per episode actually goes down after that. I've tried messing with the hyperparameters, but I haven't gotten anywhere with that. I'm trying to model my code after the DeepMind DQN paper.

我运行了成千上万次训练回合，但我的模型仍然没有收敛。大约在5000个回合后，它会开始减少策略变化的平均值，并增加平均奖励，但随后情况急剧恶化，每个回合的平均奖励实际上反而下降了。我尝试调整超参数，但没有取得任何进展。我正在按照 DeepMind 的 DQN 论文来构建我的代码。

问题解决：

You might want to change your get_random_action function to decay epsilon with each episode. After all, assuming your agent can learn an optimal policy, at some point you won't want to take random actions at all, right? Here's a slightly different version of get_random_action that would do this for you:

你可能需要修改你的 `get_random_action` 函数，使 epsilon 随着每个回合逐渐衰减。毕竟，假设你的智能体能够学习到最优策略，到了某个时候你就不会再希望采取随机动作了，对吧？下面是一个稍微不同的 `get_random_action` 版本，可以为你实现这一点：

def get_random_action(epsilon, total_episodes, episode):explore_prob = epsilon - (epsilon * (episode / total_episodes))return np.random.rand(1) < explore_prob

In this modified version of your function, epsilon will decrease slightly with each episode. This may help your model converge.

在这个修改版的函数中，epsilon 会在每个回合中稍微减少。这可能有助于你的模型收敛。

There are a handful of ways to decay a parameter. For more info, check out this Wikipedia article.

有几种方法可以使参数逐渐衰减。有关更多信息，请查看这篇维基百科文章