# Reinforcement Learning

## Changes from e225668b821c43574efe1530c4341bac29b1e9c5 to b9c0a3f544372c5436e8cd6d1ad29c8c7502ec9a

---
title: Reinformcement Learning Part 1
title: Reinformcement Learning
categories: session
---

**To be completed**

+ **Goal** Understand and be able to implement Q-learning
+ **Reading** Russel and Norvig Chapter 23
+ [Eirik's slides from 2022](Reinforcement Learning Slides 2.pdf)

# Exercises

Last session we discussedl  the Q-Function,
$$Q(s,a) = \sum_{s'}P(s'|s,a)[R(s,a,s') + \gamma \max_{a'}Q(s',a')]$$
and the function for the optimal policy based on the results from the Q-Function:
$$\pi^*(s) = \mathop{\text{argmax }}\limits_aQ(s,a)$$
We also discussed iterative estimation of the utilities and the policies.
This session, we will implement an iterative estimation algorithm for
the Q-values, knowns as Q-learning.
This is a model-free, off-policy reinforcement learning algorithm.

The exercise outline below is based partly on Eirik's assigment in 2022
and partly on the Gymnasium
[tutorial on Blackjack](https://www.gymlibrary.dev/environments/toy_text/blackjack/).

Note that I have not asked you explicitly to output any diagnostics
on the way.  You almost certainly have to do this yourself, so that
you know what is going on.

## Goal and overvew

The goal for this session is to implement an agent that can solve the
Frozen Lake problem as well as possible, using Q-learning.
The skeleton for the Agent will look like this:
python
class Agent:
def __init__( self, env, learning_rate=0.1,
initial_epsilon=1.0, epsilon_decay=10**(-50000),
final_epsilon=0.1, discount_factor=0.95):
pass
def get_action(self, obs):
pass
def update( self, obs, action, reward, terminated, next_obs):
pass
def decay_epsilon(self):
pass

Thus we need four methods.  The most obvious ones are
the constructor, the move generator, and model updater.
The last method reduces $\epsilon$ which is the probability
of making a random move instead of the best move according to
the model.

In order to run the model, you can use the following script:
python
import matplotlib.pyplot as plt
from tqdm import tqdm
from Agent import Agent

import gymnasium as gym

env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False,render_mode="human")

done = False
observation, info = env.reset()

action = env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(action)

agent = Agent( env )

for episode in range(30):
obs, info = env.reset()
done = False

# play one episode
while not done:
action = agent.get_action(obs)
next_obs, reward, terminated, truncated, info = env.step(action)

# update the agent
agent.update(obs, action, reward, terminated, next_obs)

# update if the environment is done and the current obs
done = terminated or truncated
obs = next_obs

agent.decay_epsilon()


**Note**
We have set is_slippery=False above.
That's useful for the initial testing;
we will change it to True later.

**Note 2**
We use 30 episodes.  This is ridiculously little, but the animation
is slow, and we need to be able to run it several times in testing.

**Note 3**
You can turn off the animation by changhing to
render_mode="array".  This is a lot faster, but you will need some other
way to see what is going on.

## 1. Constructor

Implement the constructor.
You need to record all the parameters and initialise the Q-table.
You can use Eirik's initial Q-values below, or it is also possible
to use a defaultdict as does the
[Blackjack tutorial](https://www.gymlibrary.dev/environments/toy_text/blackjack/).
python
initalQ = np.array([
[0.009, 0.192, 0.007, 0.009],
[0.003, 0.002, 0.003, 0.17],
[0.003, 0.002, 0.001, 0.067],
[0.001, 0.001, 0.002, 0.037],
[0.526, 0.002, 0.001, 0.002],
[0., 0., 0., 0.],
[0.046, 0., 0., 0.],
[0., 0., 0., 0.],
[0.002, 0.002, 0.002, 0.709],
[0.001, 0.597, 0.001, 0.001],
[0.945, 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0.02, 0.012, 0.898, 0.016],
[0.061, 0.991, 0.092, 0.068],
[0., 0., 0., 0.]
])

In this format initialQ[state][action] is the tenative value for
$Q$(state,action).

## 2. Move generator

The move generator get_action() has to return a valid action,
that is an integer in the 0--3 range for the Frozen Lake problem.
With probability $\epsilon$ you want to return a random action
(see last session for code example), and with probability $1-\epsilon$,
the action which maximises $Q$ according to the current estimate.

+ Implement get_action().
+ Test the simulator.  It should work already at this stage.

## 3. Diagnostic output

+ Add code to count the number of times you win the game.
+ Turn off the animation and increase the number of episodes.
+ Is the default strategy able to win the game ever?

## 4. Model updater

Now we need some way to update the Q-table.
Q-learning is based on one very simple update rule:
$$Q(s,a) \leftarrow Q(s,a) + \alpha\left(\left[ R(s,a,s') + \gamma \max\limits_{a'}Q(s',a')\right] - Q(s,a)\right),$$
where $\alpha$ is the learning rate, which controls the speed of convergence.

+ Implement update()

## 5. Epsilon decay

python
self.epsilon = max(self.final_epsilon, self.epsilon - self.epsilon_decay)


+ What does the above line do?
+ Do the attribute name match the ones you have used?
+ Implement decay_epsilon().

## 6. Testing

+ Test the system.   Add more diagnostic output as required..
+ Turn the animation off to be able to test realistically.
+ Try both Eirik's default Q-table and one initialised with zeroes only.
Does this matter a lot?
+ Does the Q-table change a lot during training?
+ What happens when you change the training parameters (input to the Agent
constructor)?

## 7.  The slippery ice

Change the environment to be slippery
python
environment = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False)


+ Repeat the tests from 1F.  What do you observe?

## 8.  The slippery ice