In [1]:

```
%matplotlib inline
```

Based on the tutorial by:

**Author**: `Adam Paszke`

https://github.com/apaszke

This tutorial shows how to use PyTorch to train a Deep Q Learning (DQN) agent
on the CartPole-v0 task from the `OpenAI Gym`

https://gym.openai.com/

**Task**

You can find an official leaderboard with various algorithms and visualizations at the
`Gym website`

https://gym.openai.com/envs/CartPole-v0

The player to decide between two actions - moving the cart left or right - so that the pole attached to it stays upright.

In this task, rewards are:

- +1 for every incremental timestep
- and the environment terminates if
- the pole falls over too far
- or the cart moves more then 2.4 units away from center.

This means better performing scenarios will run for longer duration, accumulating larger return.

Neural networks can solve the task purely by looking at the scene.

- we'll use a patch of the screen centered on the cart as the observation of the current state
- our actions are move left or move right

Strictly speaking, we will present the state as the difference between the current screen patch and the previous one. This will allow the agent to take the velocity of the pole into account from one image.

**Packages**

First, let's import needed packages. Firstly, we need
`gym`

https://gym.openai.com/docs for the environment
(Install using `pip install gym`

).
We'll also use the following from PyTorch:

- neural networks (
`torch.nn`

) - optimization (
`torch.optim`

) - automatic differentiation (
`torch.autograd`

) - utilities for vision tasks (
`torchvision`

- a separate package https://github.com/pytorch/vision).

In [2]:

```
!pip install gym[atari]
```

In [3]:

```
!pip install pyglet==1.5.0
```

In [4]:

```
#!apt-get install python-opengl -y
#!pip install PyOpenGL
#!pip install PyOpenGL_accelerate
!pip install pyvirtualdisplay
```

In [5]:

```
import gym
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple, deque
from itertools import count
from PIL import Image
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T
# to display things
import os
from pyvirtualdisplay import Display
from matplotlib import animation , rc
display = Display(visible=0, size=(1400, 900))
display.start()
os.environ["DISPLAY"] = ":" + str(display.display) + "." + str(display._obj._screen)
# setup the environment
env = gym.make('CartPole-v0').unwrapped
# set up matplotlib
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
from IPython import display
plt.ion()
# if gpu is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
```

In [6]:

```
env.reset()
plt.imshow(env.render('rgb_array'))
plt.grid(False)
```

In [7]:

```
frame = []
env.reset()
total_reward = 0
for i in range(100):
action = env.action_space.sample()
state, reward, done, info = env.step(action)
total_reward += reward
img = plt.imshow(env.render('rgb_array'))
frame.append([img])
if done:
break
print("Game terminated after", len(frame), " steps with reward ", total_reward)
```

In [8]:

```
fig = plt.figure()
anim = animation.ArtistAnimation(fig, frame, interval=100, repeat_delay=1000, blit=True)
rc('animation', html='jshtml')
anim
```

Out[8]:

We'll be using experience replay memory for training our DQN. It stores the transitions that the agent observes, allowing us to reuse this data later. By sampling from it randomly, the transitions that build up a batch are decorrelated. It has been shown that this greatly stabilizes and improves the DQN training procedure.

For this, we're going to need two classses:

`Transition`

- a named tuple representing a single transition in our environment. It essentially maps (state, action) pairs to their (next_state, reward) result, with the state being the screen difference image as described later on.`ReplayMemory`

- a cyclic buffer of bounded size that holds the transitions observed recently. It also implements a`.sample()`

method for selecting a random batch of transitions for training.

In [9]:

```
# the structure of the transition that we store
Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward'))
# stores the Experience Replay buffer
class ReplayMemory(object):
def __init__(self, capacity):
self.cap = capacity
self.memory = deque([],maxlen=capacity)
def push(self, *args):
self.memory.append(Transition(*args))
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
def __len__(self):
return len(self.memory)
```

Now, let's define our model. But first, let's quickly recap what a DQN is.

Our environment is deterministic, so all equations presented here are also formulated deterministically for the sake of simplicity. In the reinforcement learning literature, they would also contain expectations over stochastic transitions in the environment.

Our aim will be to train a policy that tries to maximize the discounted,
cumulative reward
${R}_{{t}_{0}}=\sum _{t={t}_{0}}^{\mathrm{\infty}}{\gamma}^{t-{t}_{0}}{r}_{t}$, where
${R}_{{t}_{0}}$ is also known as the *return*. The discount,
$\gamma $, should be a constant between $0$ and $1$
that ensures the sum converges. It makes rewards from the uncertain far
future less important for our agent than the ones in the near future
that it can be fairly confident about.

The main idea behind Q-learning is that if we had a function ${Q}^{\ast}:State\times Action\to \mathbb{R}$, that could tell us what our return would be, if we were to take an action in a given state, then we could easily construct a policy that maximizes our rewards:

$$\begin{array}{}\text{(1)}& {\pi}^{\ast}(s)=\mathrm{arg}\phantom{\rule{negativethinmathspace}{0ex}}\underset{a}{max}\text{}{Q}^{\ast}(s,a)\end{array}$$However, we don't know everything about the world, so we don't have access to ${Q}^{\ast}$. But, since neural networks are universal function approximators, we can simply create one and train it to resemble ${Q}^{\ast}$.

For our training update rule, we'll use a fact that every $Q$ function for some policy obeys the Bellman equation:

$$\begin{array}{}\text{(2)}& {Q}^{\pi}(s,a)=r+\gamma {Q}^{\pi}({s}^{\prime},\pi ({s}^{\prime}))\end{array}$$The difference between the two sides of the equality is known as the temporal difference error, $\delta $:

$$\begin{array}{}\text{(3)}& \delta =Q(s,a)-(r+\gamma \underset{a}{max}Q({s}^{\prime},a))\end{array}$$To minimise this error, we will use the **Smooth L1 Loss** aka **Huber loss** https://en.wikipedia.org/wiki/Huber_loss.
The Huber loss acts like the mean squared error when the error is small, but like the mean
absolute error when the e