%matplotlib inline
Based on the tutorial by:
Author: Adam Paszke
https://github.com/apaszke
This tutorial shows how to use PyTorch to train a Deep Q Learning (DQN) agent
on the CartPole-v0 task from the OpenAI Gym
https://gym.openai.com/
Task
You can find an official leaderboard with various algorithms and visualizations at the
Gym website
https://gym.openai.com/envs/CartPole-v0
The player to decide between two actions - moving the cart left or right - so that the pole attached to it stays upright.
In this task, rewards are:
This means better performing scenarios will run for longer duration, accumulating larger return.
Neural networks can solve the task purely by looking at the scene.
Strictly speaking, we will present the state as the difference between the current screen patch and the previous one. This will allow the agent to take the velocity of the pole into account from one image.
Packages
First, let's import needed packages. Firstly, we need
gym
https://gym.openai.com/docs for the environment
(Install using pip install gym
).
We'll also use the following from PyTorch:
torch.nn
)torch.optim
)torch.autograd
)torchvision
- a separate package https://github.com/pytorch/vision).!pip install gym[atari]
!pip install pyglet==1.5.0
#!apt-get install python-opengl -y
#!pip install PyOpenGL
#!pip install PyOpenGL_accelerate
!pip install pyvirtualdisplay
import gym
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple, deque
from itertools import count
from PIL import Image
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T
# to display things
import os
from pyvirtualdisplay import Display
from matplotlib import animation , rc
display = Display(visible=0, size=(1400, 900))
display.start()
os.environ["DISPLAY"] = ":" + str(display.display) + "." + str(display._obj._screen)
# setup the environment
env = gym.make('CartPole-v0').unwrapped
# set up matplotlib
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
from IPython import display
plt.ion()
# if gpu is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
env.reset()
plt.imshow(env.render('rgb_array'))
plt.grid(False)
frame = []
env.reset()
total_reward = 0
for i in range(100):
action = env.action_space.sample()
state, reward, done, info = env.step(action)
total_reward += reward
img = plt.imshow(env.render('rgb_array'))
frame.append([img])
if done:
break
print("Game terminated after", len(frame), " steps with reward ", total_reward)
fig = plt.figure()
anim = animation.ArtistAnimation(fig, frame, interval=100, repeat_delay=1000, blit=True)
rc('animation', html='jshtml')
anim
We'll be using experience replay memory for training our DQN. It stores the transitions that the agent observes, allowing us to reuse this data later. By sampling from it randomly, the transitions that build up a batch are decorrelated. It has been shown that this greatly stabilizes and improves the DQN training procedure.
For this, we're going to need two classses:
Transition
- a named tuple representing a single transition in
our environment. It essentially maps (state, action) pairs
to their (next_state, reward) result, with the state being the
screen difference image as described later on.ReplayMemory
- a cyclic buffer of bounded size that holds the
transitions observed recently. It also implements a .sample()
method for selecting a random batch of transitions for training.# the structure of the transition that we store
Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward'))
# stores the Experience Replay buffer
class ReplayMemory(object):
def __init__(self, capacity):
self.cap = capacity
self.memory = deque([],maxlen=capacity)
def push(self, *args):
self.memory.append(Transition(*args))
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
def __len__(self):
return len(self.memory)
Now, let's define our model. But first, let's quickly recap what a DQN is.
Our environment is deterministic, so all equations presented here are also formulated deterministically for the sake of simplicity. In the reinforcement learning literature, they would also contain expectations over stochastic transitions in the environment.
Our aim will be to train a policy that tries to maximize the discounted, cumulative reward , where is also known as the return. The discount, , should be a constant between and that ensures the sum converges. It makes rewards from the uncertain far future less important for our agent than the ones in the near future that it can be fairly confident about.
The main idea behind Q-learning is that if we had a function , that could tell us what our return would be, if we were to take an action in a given state, then we could easily construct a policy that maximizes our rewards:
However, we don't know everything about the world, so we don't have access to . But, since neural networks are universal function approximators, we can simply create one and train it to resemble .
For our training update rule, we'll use a fact that every function for some policy obeys the Bellman equation:
The difference between the two sides of the equality is known as the temporal difference error, :
To minimise this error, we will use the Smooth L1 Loss aka Huber loss https://en.wikipedia.org/wiki/Huber_loss. The Huber loss acts like the mean squared error when the error is small, but like the mean absolute error when the e