Revision b9c0a3f544372c5436e8cd6d1ad29c8c7502ec9a (click the page title to view the current version)

Reinforcement Learning

Changes from b9c0a3f544372c5436e8cd6d1ad29c8c7502ec9a to current

---
title: Reinformcement Learning
categories: session
---

+ **Goal** Understand and be able to implement Q-learning
    + Builds on the Markov Decision Process Model [MDP]()
      from last week.
    + Next week: [Deep Q-Learning]()
+ **Reading** Russel and Norvig Chapter 23
+ [Eirik's slides from 2022](Reinforcement Learning Slides 2.pdf)
+ [Eirik's Jupyter Notebook 2022](https://github.com/eiriksfa/pai-rl/blob/main/pai_rl/notebooks/frozen_lake.ipynb)

# Exercises

![RL Agent](rlagent.png)

In this session we will create a Reinforcement Learning Agent, as 
depicted in the diagram.  Note the internal model, with one function
to update it and one function using it to choose an action.
You should keep this picture in your mind throughout.
  
Last session we discussedl  the Q-Function,
Last session we discussed the Q-Function,
$$Q(s,a) = \sum_{s'}P(s'|s,a)[R(s,a,s') + \gamma \max_{a'}Q(s',a')]$$
and the function for the optimal policy based on the results from the Q-Function:  
$$\pi^*(s) = \mathop{\text{argmax }}\limits_aQ(s,a)$$
We also discussed iterative estimation of the utilities and the policies.
This session, we will implement an iterative estimation algorithm for
the Q-values, knowns as Q-learning.
This is a model-free, off-policy reinforcement learning algorithm.

The exercise outline below is based partly on Eirik's assigment in 2022
and partly on the Gymnasium 
[tutorial on Blackjack](https://www.gymlibrary.dev/environments/toy_text/blackjack/).

Note that I have not asked you explicitly to output any diagnostics 
on the way.  You almost certainly have to do this yourself, so that 
you know what is going on.

## Goal and overvew

The goal for this session is to implement an agent that can solve the
Frozen Lake problem as well as possible, using Q-learning.
The skeleton for the Agent will look like this:
```python  
class Agent:
    def __init__( self, env, learning_rate=0.1,
        initial_epsilon=1.0, epsilon_decay=10**(-50000),
        final_epsilon=0.1, discount_factor=0.95):
        pass
    def get_action(self, obs):
        pass
    def update( self, obs, action, reward, terminated, next_obs):
        pass
    def decay_epsilon(self):
        pass
```
Thus we need four methods.  The most obvious ones are 
the constructor, the move generator, and model updater.
The last method reduces $\epsilon$ which is the probability
of making a random move instead of the best move according to
the model.

In order to run the model, you can use the following script:
```python
import matplotlib.pyplot as plt
from tqdm import tqdm
from Agent import Agent

import gymnasium as gym

env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False,render_mode="human")

done = False
observation, info = env.reset()

action = env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(action)

agent = Agent( env )

for episode in range(30):
    obs, info = env.reset()
    done = False

    # play one episode
    while not done:
        action = agent.get_action(obs)
        next_obs, reward, terminated, truncated, info = env.step(action)

        # update the agent
        agent.update(obs, action, reward, terminated, next_obs)

        # update if the environment is done and the current obs
        done = terminated or truncated
        obs = next_obs

    agent.decay_epsilon()
```

**Note**
We have set `is_slippery=False` above.
That's useful for the initial testing;
we will change it to `True` later.

**Note 2**
We use 30 episodes.  This is ridiculously little, but the animation
is slow, and we need to be able to run it several times in testing.

**Note 3**
You can turn off the animation by changhing to
`render_mode="array"`.  This is a lot faster, but you will need some other
way to see what is going on. 

## 1. Constructor

Implement the constructor. 
You need to record all the parameters and initialise the Q-table.
You can use Eirik's initial Q-values below, or it is also possible
to use a `defaultdict` as does the 
[Blackjack tutorial](https://www.gymlibrary.dev/environments/toy_text/blackjack/).
```python  
initalQ = np.array([  
		[0.009, 0.192, 0.007, 0.009],  
		[0.003, 0.002, 0.003, 0.17],  
		[0.003, 0.002, 0.001, 0.067],  
		[0.001, 0.001, 0.002, 0.037],  
		[0.526, 0.002, 0.001, 0.002],  
		[0., 0., 0., 0.],  
		[0.046, 0., 0., 0.],  
		[0., 0., 0., 0.],  
		[0.002, 0.002, 0.002, 0.709],  
		[0.001, 0.597, 0.001, 0.001],  
		[0.945, 0., 0., 0.],  
		[0., 0., 0., 0.],  
		[0., 0., 0., 0.],  
		[0.02, 0.012, 0.898, 0.016],  
		[0.061, 0.991, 0.092, 0.068],  
		[0., 0., 0., 0.]  
	])  
```  
In this format `initialQ[state][action]` is the tenative value for 
$Q$(`state`,`action`).
  
## 2. Move generator
  
The move generator `get_action()` has to return a valid action,
that is an integer in the 0--3 range for the Frozen Lake problem.
With probability $\epsilon$ you want to return a random action
(see last session for code example), and with probability $1-\epsilon$,
the action which maximises $Q$ according to the current estimate.

+ Implement `get_action()`.
+ Test the simulator.  It should work already at this stage.
+ Test it also `starting_epsilon=0`, so that you can see what happens
  when it avoids the random moves altogether.

This move generator is known as the *Epsilon-Greedy Algorithm*.
It makes a greedy choice except with $\epsilon$ probability.
  

## 3. Diagnostic output

+ Add code to count the number of times you win the game.
+ Turn off the animation and increase the number of episodes.
+ Is the default strategy able to win the game ever?

## 4. Model updater

Now we need some way to update the Q-table.  
Q-learning is based on one very simple update rule:
$$Q(s,a) \leftarrow Q(s,a) + \alpha\left(\left[
      R(s,a,s') + \gamma \max\limits_{a'}Q(s',a')\right] - Q(s,a)\right),$$
where $\alpha$ is the learning rate, which controls the speed of convergence.

+ Look at the correction term in the update rule
  (the parenthesis multiplied by $\alpha$).
  Are all the quantities known to the agent?
  What are the corresponding variables in python?
+ Implement `update()`
  
## 5. Epsilon decay

```python
self.epsilon = max(self.final_epsilon, self.epsilon - self.epsilon_decay)
```

+ What does the above line do?  
+ Do the attribute name match the ones you have used?
+ Implement `decay_epsilon()`.
+ Reflect: Why should $\epsilon$ decline during training?

## 6. Testing

+ Test the system.   Add more diagnostic output as required..
    + Turn the animation off to be able to test realistically.
+ Try both Eirik's default Q-table and one initialised with zeroes only.
  Does this matter a lot?
+ Does the Q-table change a lot during training?
+ What happens when you change the training parameters (input to the Agent
  constructor)? 
  
## 7.  The slippery ice

Change the environment to be slippery
```python  
environment = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False)  
```  
  
+ Repeat the tests from 1F.  What do you observe?

## 8.  The slippery ice

+ Reflect upon your solution.
+ What are the key elements of Q-learning?
+ Which design decisions are critically to make Q-learning work?
+ Is Q-learning an appropriate solution to the problem?

## 9.  Optional

Adapt your solution for other problems in Gymnasium, such as 
[Blackjack](https://www.gymlibrary.dev/environments/toy_text/blackjack/).