Revision baccf62c764978f0597ffc8b7604e1b7580af8c6 (click the page title to view the current version)

Reinforcement Learning

Changes from baccf62c764978f0597ffc8b7604e1b7580af8c6 to e225668b821c43574efe1530c4341bac29b1e9c5

---
title: Reinformcement Learning Part 1
categories: session
---

**To be completed**

+ **Goal** Understand and be able to implement Q-learning
+ **Reading** Russel and Norvig Chapter 23
+ [Eirik's slides from 2022](Reinforcement Learning Slides 2.pdf)

# Exercises

  
Last session we discussedl  the Q-Function,
$$Q(s,a) = \sum_{s'}P(s'|s,a)[R(s,a,s') + \gamma \max_{a'}Q(s',a')]$$
and the function for the optimal policy based on the results from the Q-Function:  
$$\pi^*(s) = \mathop{\text{argmax }}\limits_aQ(s,a)$$
We also discussed iterative estimation of the utilities and the policies.
This session, we will implement an iterative estimation algorithm for
the Q-values, knowns as Q-learning.
This is a model-free, off-policy reinforcement learning algorithm.

The exercise outline below is based partly on Eirik's assigment in 2022
and partly on the Gymnasium 
[tutorial on Blackjack](https://www.gymlibrary.dev/environments/toy_text/blackjack/).

Note that I have not asked you explicitly to output any diagnostics 
on the way.  You almost certainly have to do this yourself, so that 
you know what is going on.

## Task 1  
## Goal and overvew

The goal for this session is to implement an agent that can solve the
Frozen Lake problem as well as possible, using Q-learning.
The skeleton for the Agent will look like this:
```python  
class Agent:
    def __init__( self, env, learning_rate=0.1,
        initial_epsilon=1.0, epsilon_decay=10**(-50000),
        final_epsilon=0.1, discount_factor=0.95):
        pass
    def get_action(self, obs):
        pass
    def update( self, obs, action, reward, terminated, next_obs):
        pass
    def decay_epsilon(self):
        pass
```
Thus we need four methods.  The most obvious ones are 
the constructor, the move generator, and model updater.
The last method reduces $\epsilon$ which is the probability
of making a random move instead of the best move according to
the model.

In order to run the model, you can use the following script:
```python
import matplotlib.pyplot as plt
from tqdm import tqdm
from Agent import Agent

import gymnasium as gym

env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True,render_mode="human")
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False,render_mode="human")

done = False
observation, info = env.reset()

action = env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(action)

agent = Agent( env )

for episode in range(10**5):
for episode in range(30):
    obs, info = env.reset()
    done = False

    # play one episode
    while not done:
        action = agent.get_action(obs)
        next_obs, reward, terminated, truncated, info = env.step(action)

        # update the agent
        agent.update(obs, action, reward, terminated, next_obs)

        # update if the environment is done and the current obs
        done = terminated or truncated
        obs = next_obs

    agent.decay_epsilon()
```

### 1A Constructor
**Note**
We have set `is_slippery=False` above.
That's useful for the initial testing;
we will change it to `True` later.

**Note 2**
We use 30 episodes.  This is ridiculously little, but the animation
is slow, and we need to be able to run it several times in testing.

**Note 3**
You can turn off the animation by changhing to
`render_mode="array"`.  This is a lot faster, but you will need some other
way to see what is going on. 

## 1. Constructor

Implement the constructor. 
You need to record all the parameters and initialise the Q-table.
You can use Eirik's initial Q-values below, or it is also possible
to use a `defaultdict` as does the 
[Blackjack tutorial](https://www.gymlibrary.dev/environments/toy_text/blackjack/).
```python  
initalQ = np.array([  
		[0.009, 0.192, 0.007, 0.009],  
		[0.003, 0.002, 0.003, 0.17],  
		[0.003, 0.002, 0.001, 0.067],  
		[0.001, 0.001, 0.002, 0.037],  
		[0.526, 0.002, 0.001, 0.002],  
		[0., 0., 0., 0.],  
		[0.046, 0., 0., 0.],  
		[0., 0., 0., 0.],  
		[0.002, 0.002, 0.002, 0.709],  
		[0.001, 0.597, 0.001, 0.001],  
		[0.945, 0., 0., 0.],  
		[0., 0., 0., 0.],  
		[0., 0., 0., 0.],  
		[0.02, 0.012, 0.898, 0.016],  
		[0.061, 0.991, 0.092, 0.068],  
		[0., 0., 0., 0.]  
	])  
```  
In this format `initialQ[state][action]` is the tenative value for 
$Q$(`state`,`action`).
  
### 2B. Move generator
## 2. Move generator
  
The move generator `get_action()` has to return a valid action,
that is an integer in the 0--3 range for the Frozen Lake problem.
With probability $\epsilon$ you want to return a random action
(see last session for code example), and with probability $1-\epsilon$,
the action which maximises $Q$ according to the current estimate.

+ Implement `get_action()`.
+ Test the simulator.  It should work already at this stage.
  
### 3C. Model updater

## 3. Diagnostic output

+ Add code to count the number of times you win the game.
+ Turn off the animation and increase the number of episodes.
+ Is the default strategy able to win the game ever?

## 4. Model updater

Now we need some way to update the Q-table.  
Q-learning is based on one very simple update rule:
$$Q(s,a) \leftarrow Q(s,a) + \alpha\left(\left[
      R(s,a,s') + \gamma \max\limits_{a'}Q(s',a')\right] - Q(s,a)\right),$$
where $\alpha$ is the learning rate, which controls the speed of convergence.

+ Implement `update()`
  
### 3D. Epislon decay
## 5. Epsilon decay

```python
self.epsilon = max(self.final_epsilon, self.epsilon - self.epsilon_decay)
```

+ What does the above line do?  
+ Do the attribute name match the ones you have used?
+ Implement `decay_epsilon()`.
  
## Task 2  
  
In this task we will create functions to update our own q-table, for now you can turn make the environment deterministic by turning of the 'slippery' argument when making the environment:  

## 6. Testing

+ Test the system.   Add more diagnostic output as required..
    + Turn the animation off to be able to test realistically.
+ Try both Eirik's default Q-table and one initialised with zeroes only.
  Does this matter a lot?
+ Does the Q-table change a lot during training?
+ What happens when you change the training parameters (input to the Agent
  constructor)? 
  
## 7.  The slippery ice

Change the environment to be slippery
```python  
environment = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False)  
```  
  
### Part A  
  
First we need to create an empty q-table, as the gym framework supports frozen-lake environment of different sizes, we need to initialize it with the size "state_space x action_space".  
You can get them from:  
```python  
n_states = env.observation_space.n # state-space  
n_actions = env.action_space.n  
```  
  
**Create a function to initialize a q-table**  
  
You can use the following 'skeleton':  
  
```python  
def initialize_q_table(env: gym.Env) -> np.array:  
	"""Creates and returns an empty q-table of size state_space x action_space.  
  
	Args:  
		env: Gym environment  
+ Repeat the tests from 1F.  What do you observe?

	Returns:  
		np.array of q-table of size state_space x action_space  
	"""  
	...  
```  
  
### Part B  
  
  
**Implement a function to calculate the value to be updated (the part on the right side of the arrow)**  
  
You can use the following 'skeleton':  
  
```python  
def q_temporal_difference(q_sa: np.array, action: int, reward: float, start_state: int, end_state: int, alpha: float = 0.85, gamma: float = 0.98) -> float:  
	"""Calculates the q-update.
  
	Args:  
		q_sa: q-table  
		action: action we are taking  
		reward: result of R(s,a,s')  
		start_state: start state  
		end_state: end state (after taking action a)  
		alpha: learning-rate
		gamma: discount  
## 8.  The slippery ice

	Returns:  
		q-td update value  
	"""  
	...  
```  
  
### Part C  
  
Using the functions we implemented above, we want to update the q-table by simply having the agent play the game a lot.  
  
***Implement a q-learning function***, you can use the following 'skeleton' and/or take inspiration from the ```test_performance``` implementation.  
  
```python  
def q_learning(env: gym.Env, policy: Callable, n_episodes: int = 10000, m_ep_length: int = 100) -> np.array:  
	"""q-learning implementation to update a q-table.  
  
	Args:  
		env: gym environment  
		policy: policy function  
		n_episodes: number of episodes to train on  
		m_ep_length: maximum episode length  
  
	Returns:  
		updated q-table  
	"""  
	...  
```  
  
Note that your agent might behave odd (or not work at all), if you use your optimal policy on an empty q-table, so you may want to edit it to take a random action if it has issues differentiating between actions.  
  
### Part D  
  
***Try out your algorithm***;  
- Use the updated q-table with the ```play``` function and the ```test_performance``` function.  
- Print out the q-table  
  
Are there any potential problems?  
What if you train it again with ```slippery=True``` ?  
  
## Task 3  
  
(From now on, we will play with a stochastic environment, set ```slippery=True```).  
  
We need some way to encourage exploration, to prevent the agent from only trying to repeat the first sequence that got him to the goal.  
  
There are multiple ways to implement this;  
  
1. We can set a static epsilon $\epsilon$ value, and set the action to some random action a if some random number n is below $\epsilon$.  
2. We usually want to encourage exploration in earlier training phases, and encourage exploitation in the later ones. We can therefore use a similar approach to 1, but with the addition of decaying $\epsilon$ over time.  
3. The third option (non-exhaustive) is to create a policy that picks an action based on a weighted probability-distribution created based on the q-values. The weighting can then change over time to encourage exploration early, and exploitation later. A modified version of this function could also be used when 'playing' the game, if you want a policy that not necessarily always picks the option with the highest utility.  
  
### Part A  
  
***Implement \_one\_ of the functions above***, you can use the following 'skeleton':  
  
```python  
  
def epsilon_policy(q_sa: np.array, state: int, env: gym.Env, eps: float = 0.2) -> int:  
	"""RL epsilon-greedy policy.  
	Policy for exploration/exploitation tradeoff.  
  
	Args:  
		q_sa: q-table  
		state: current state  
		env: gym environment  
		eps: epsilon  
  
	Returns:  
		action with a 1-eps chance of being exploitation, eps chance of being exploration  
	"""  
	...  
  
```  
  
### Part B  
  
***Try out your algorithm***;  
- Use the updated q-table with the ```play``` function and the ```test_performance``` function.  
- Print out the q-table  
  
## Task 4  
  
### Part A  
  
***Modify your q-learning algorithm to call ```test_performance``` every n-episode. Save this in a table and plot the result using matplotlib.***  
  
We now have a way to calculate the performance over time/training episodes.  
  
### Part B  
  
***Experiment with the different hyperparameters (epsilon, learning-rate, gamma, etc) and compare them using the method in part A.***  
You can also try with other versions of the frozen-lake environment (e.g. the 8x8 map), they have a function to create random maps.  
  
  
## Task 5 (Extra)  
  
Please inform me if you get to this point early, as I might change the task, but for now:  
  
***Implement SARSA, and repeat similar experiments from task 4 to compare the two.***
+ Reflect upon your solution.
+ What are the key elements of Q-learning?
+ Which design decisions are critically to make Q-learning work?
+ Is Q-learning an appropriate solution to the problem?

## 9.  Optional

Adapt your solution for other problems in Gymnasium, such as 
[Blackjack](https://www.gymlibrary.dev/environments/toy_text/blackjack/).