# Reinforcement Learning

## Changes from 4aade5fa86615f3fb780245310da53a93b0ac39f to baccf62c764978f0597ffc8b7604e1b7580af8c6

---
title: Reinformcement Learning Part 1
categories: session
---

**To be completed**

+ **Goal** Understand and be able to implement Q-learning
+ **Reading** Russel and Norvig Chapter 23
+ [Eirik's slides from 2022](Reinforcement Learning Slides 2.pdf)

# Exercises

Last session we implemented  the Q-Function:

Last session we discussedl  the Q-Function,
$$Q(s,a) = \sum_{s'}P(s'|s,a)[R(s,a,s') + \gamma \max_{a'}Q(s',a')]$$

and the function for the optimal policy based on the results from the Q-Function:

$\pi^*(s) = \mathop{\text{argmax }}\limits_aQ(s,a)$

We also had an assignment where we moved based on a list of calculated utilities.

In this assignment we will be implementing table based Q-learning, a model-free, off-policy reinforcement learning algorithm. For now, we want to just try to play using a list of calculated q-values instead of the utilities, you can use the following q-values:

$$\pi^*(s) = \mathop{\text{argmax }}\limits_aQ(s,a)$$
We also discussed iterative estimation of the utilities and the policies.
This session, we will implement an iterative estimation algorithm for
the Q-values, knowns as Q-learning.
This is a model-free, off-policy reinforcement learning algorithm.

The exercise outline below is based partly on Eirik's assigment in 2022
and partly on the Gymnasium
[tutorial on Blackjack](https://www.gymlibrary.dev/environments/toy_text/blackjack/).

Note that I have not asked you explicitly to output any diagnostics
on the way.  You almost certainly have to do this yourself, so that
you know what is going on.

The goal for this session is to implement an agent that can solve the
Frozen Lake problem as well as possible, using Q-learning.
The skeleton for the Agent will look like this:
python
def create_filled_q_table() -> np.array:
"""Creates a 'filled' q-table for the default frozen-lake environment.

Returns:
Filled q-table
"""
return np.array([
class Agent:
def __init__( self, env, learning_rate=0.1,
initial_epsilon=1.0, epsilon_decay=10**(-50000),
final_epsilon=0.1, discount_factor=0.95):
pass
def get_action(self, obs):
pass
def update( self, obs, action, reward, terminated, next_obs):
pass
def decay_epsilon(self):
pass

Thus we need four methods.  The most obvious ones are
the constructor, the move generator, and model updater.
The last method reduces $\epsilon$ which is the probability
of making a random move instead of the best move according to
the model.

In order to run the model, you can use the following script:
python
import matplotlib.pyplot as plt
from tqdm import tqdm
from Agent import Agent

import gymnasium as gym

env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True,render_mode="human")

done = False
observation, info = env.reset()

action = env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(action)

agent = Agent( env )

for episode in range(10**5):
obs, info = env.reset()
done = False

# play one episode
while not done:
action = agent.get_action(obs)
next_obs, reward, terminated, truncated, info = env.step(action)

# update the agent
agent.update(obs, action, reward, terminated, next_obs)

# update if the environment is done and the current obs
done = terminated or truncated
obs = next_obs

agent.decay_epsilon()


### 1A Constructor

Implement the constructor.
You need to record all the parameters and initialise the Q-table.
You can use Eirik's initial Q-values below, or it is also possible
to use a defaultdict as does the
[Blackjack tutorial](https://www.gymlibrary.dev/environments/toy_text/blackjack/).
python
initalQ = np.array([
[0.009, 0.192, 0.007, 0.009],
[0.003, 0.002, 0.003, 0.17],
[0.003, 0.002, 0.001, 0.067],
[0.001, 0.001, 0.002, 0.037],
[0.526, 0.002, 0.001, 0.002],
[0., 0., 0., 0.],
[0.046, 0., 0., 0.],
[0., 0., 0., 0.],
[0.002, 0.002, 0.002, 0.709],
[0.001, 0.597, 0.001, 0.001],
[0.945, 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0.02, 0.012, 0.898, 0.016],
[0.061, 0.991, 0.092, 0.068],
[0., 0., 0., 0.]
])


### Part A

We will use the latter in this task, so if you did not already, ***make a python implementation of this policy***. You can use the following 'skeleton':

python
def optimal_policy(q_sa: np.array, state: int) -> int:
"""RL-policy for optimal play.

Args:
q_sa: q-table
state: current state

Returns:
optimal action for current state and q-table.
"""
...


### Part B

Below I have implemented two functions play and test_performance.

python
import numpy as np
import gym

from collections.abc import Callable

def play(env: gym.Env, q_sa: np.array, policy: Callable, m_ep_length: int = 100, render: bool = True) -> None:
"""Plays one episode of environment env using an optimal policy from the q-table q_sa.

Args:
env: Gym environment
q_sa: q-table
policy: the policy function to play with
m_ep_length: max episode length, default; 100, should be set higher for large environments
render: render environment ?

Returns:
None
"""
state, info = env.reset(return_info=True)
if render:
env.render()
j = 0
while j < m_ep_length:
j += 1
action = policy(q_sa, state) # TODO: Implement this function first!
new_state, reward, done, info = env.step(action)
state = new_state
if render:
env.render()
if done:
break

def test_performance(env: gym.Env, q_sa: np.array, policy: Callable, n_episodes: int = 1000, m_ep_length: int = 100) -> float:
"""

Args:
env: Gym environment
q_sa: q-table
policy: the policy function to play with
n_episodes: number of episodes
m_ep_length: max episode length

Returns:
average reward
"""
rewards = 0
for i in range(n_episodes):
state, info = env.reset(return_info=True)
reward = 0
j = 0
while j < m_ep_length:
j += 1
action = policy(q_sa, state) # TODO: Implement
new_state, reward, done, info = env.step(action)
state = new_state
if done:
break
rewards += reward
return rewards / n_episodes
In this format initialQ[state][action] is the tenative value for
$Q$(state,action).

### 2B. Move generator

if __name__ == "__main__":
environment = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True)
q_table = create_filled_q_table() # TODO: Implement
play(environment, q_table, optimal_policy) # TODO: Implement policy

The move generator get_action() has to return a valid action,
that is an integer in the 0--3 range for the Frozen Lake problem.
With probability $\epsilon$ you want to return a random action
(see last session for code example), and with probability $1-\epsilon$,
the action which maximises $Q$ according to the current estimate.

+ Implement get_action().

Using either this implementation, or your own:
### 3C. Model updater

Now we need some way to update the Q-table.
Q-learning is based on one very simple update rule:
$$Q(s,a) \leftarrow Q(s,a) + \alpha\left(\left[ R(s,a,s') + \gamma \max\limits_{a'}Q(s',a')\right] - Q(s,a)\right),$$
where $\alpha$ is the learning rate, which controls the speed of convergence.

+ Implement update()

- ***Using your optimal policy from part 1, play at least one episode***
-- Is it playing "optimally"?
- ***Calculate the performance of the policy using the test_performance function.***
-- What does the number here mean?
### 3D. Epislon decay

python
self.epsilon = max(self.final_epsilon, self.epsilon - self.epsilon_decay)


+ What does the above line do?
+ Do the attribute name match the ones you have used?
+ Implement decay_epsilon().

In this task we will create functions to update our own q-table, for now you can turn make the environment deterministic by turning of the 'slippery' argument when making the environment:

python
environment = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False)


### Part A

First we need to create an empty q-table, as the gym framework supports frozen-lake environment of different sizes, we need to initialize it with the size "state_space x action_space".
You can get them from:
python
n_states = env.observation_space.n # state-space
n_actions = env.action_space.n


**Create a function to initialize a q-table**

You can use the following 'skeleton':

python
def initialize_q_table(env: gym.Env) -> np.array:
"""Creates and returns an empty q-table of size state_space x action_space.

Args:
env: Gym environment

Returns:
np.array of q-table of size state_space x action_space
"""
...


### Part B

Now we need some way to update our q-table.
Recall the Q Temporal-Difference function to update the q-values:

$Q(s,a) \leftarrow Q(s,a) + \alpha[R(s,a,s') + \gamma \max\limits_{a'}Q(s',a') - Q(s,a)]$

**Implement a function to calculate the value to be updated (the part on the right side of the arrow)**

You can use the following 'skeleton':

python
def q_temporal_difference(q_sa: np.array, action: int, reward: float, start_state: int, end_state: int, alpha: float = 0.85, gamma: float = 0.98) -> float:
"""Calculates the q-update.

Args:
q_sa: q-table
action: action we are taking
reward: result of R(s,a,s')
start_state: start state
end_state: end state (after taking action a)
alpha: learning-rate
gamma: discount

Returns:
q-td update value
"""
...


### Part C

Using the functions we implemented above, we want to update the q-table by simply having the agent play the game a lot.

***Implement a q-learning function***, you can use the following 'skeleton' and/or take inspiration from the test_performance implementation.

python
def q_learning(env: gym.Env, policy: Callable, n_episodes: int = 10000, m_ep_length: int = 100) -> np.array:
"""q-learning implementation to update a q-table.

Args:
env: gym environment
policy: policy function
n_episodes: number of episodes to train on
m_ep_length: maximum episode length

Returns:
updated q-table
"""
...


Note that your agent might behave odd (or not work at all), if you use your optimal policy on an empty q-table, so you may want to edit it to take a random action if it has issues differentiating between actions.

### Part D

- Use the updated q-table with the play function and the test_performance function.
- Print out the q-table

Are there any potential problems?
What if you train it again with slippery=True ?

(From now on, we will play with a stochastic environment, set slippery=True).

We need some way to encourage exploration, to prevent the agent from only trying to repeat the first sequence that got him to the goal.

There are multiple ways to implement this;

1. We can set a static epsilon $\epsilon$ value, and set the action to some random action a if some random number n is below $\epsilon$.
2. We usually want to encourage exploration in earlier training phases, and encourage exploitation in the later ones. We can therefore use a similar approach to 1, but with the addition of decaying $\epsilon$ over time.
3. The third option (non-exhaustive) is to create a policy that picks an action based on a weighted probability-distribution created based on the q-values. The weighting can then change over time to encourage exploration early, and exploitation later. A modified version of this function could also be used when 'playing' the game, if you want a policy that not necessarily always picks the option with the highest utility.

### Part A

***Implement \_one\_ of the functions above***, you can use the following 'skeleton':

python

def epsilon_policy(q_sa: np.array, state: int, env: gym.Env, eps: float = 0.2) -> int:
"""RL epsilon-greedy policy.

Args:
q_sa: q-table
state: current state
env: gym environment
eps: epsilon

Returns:
action with a 1-eps chance of being exploitation, eps chance of being exploration
"""
...



### Part B

- Use the updated q-table with the play function and the test_performance function.
- Print out the q-table

### Part A

***Modify your q-learning algorithm to call test_performance every n-episode. Save this in a table and plot the result using matplotlib.***

We now have a way to calculate the performance over time/training episodes.

### Part B

***Experiment with the different hyperparameters (epsilon, learning-rate, gamma, etc) and compare them using the method in part A.***
You can also try with other versions of the frozen-lake environment (e.g. the 8x8 map), they have a function to create random maps.