Revision 2071c852751535b6df1d28671a9dbb31a4fb600d (click the page title to view the current version)

Reinforcement Learning

Changes from 2071c852751535b6df1d28671a9dbb31a4fb600d to 4aade5fa86615f3fb780245310da53a93b0ac39f

---
title: Reinformcement Learning Part 1
categories: session
---

**To be completed**

+ [Reinforcement Learning Slides 2.pdf]()
+ **Goal** Understand and be able to implement Q-learning
+ **Reading** Russel and Norvig Chapter 23
+ [Eirik's slides from 2022](Reinforcement Learning Slides 2.pdf)

# Exercises

## Task 1  
  
Last week we implemented  the Q-Function:  
Last session we implemented  the Q-Function:  
  
$$Q(s,a) = \sum_{s'}P(s'|s,a)[R(s,a,s') + \gamma \max_{a'}Q(s',a')]$$
  
and the function for the optimal policy based on the results from the Q-Function:  
  
$\pi^*(s) = \mathop{\text{argmax }}\limits_aQ(s,a)$  
  
We also had an assignment where we moved based on a list of calculated utilities.  
  
In this assignment we will be implementing table based Q-learning, a model-free, off-policy reinforcement learning algorithm. For now, we want to just try to play using a list of calculated q-values instead of the utilities, you can use the following q-values:  
  
```python  
def create_filled_q_table() -> np.array:  
	"""Creates a 'filled' q-table for the default frozen-lake environment.  
  
	Returns:  
		Filled q-table  
	"""  
	return np.array([  
		[0.009, 0.192, 0.007, 0.009],  
		[0.003, 0.002, 0.003, 0.17],  
		[0.003, 0.002, 0.001, 0.067],  
		[0.001, 0.001, 0.002, 0.037],  
		[0.526, 0.002, 0.001, 0.002],  
		[0., 0., 0., 0.],  
		[0.046, 0., 0., 0.],  
		[0., 0., 0., 0.],  
		[0.002, 0.002, 0.002, 0.709],  
		[0.001, 0.597, 0.001, 0.001],  
		[0.945, 0., 0., 0.],  
		[0., 0., 0., 0.],  
		[0., 0., 0., 0.],  
		[0.02, 0.012, 0.898, 0.016],  
		[0.061, 0.991, 0.092, 0.068],  
		[0., 0., 0., 0.]  
	])  
```  
  
### Part A  
  
We will use the latter in this task, so if you did not already, ***make a python implementation of this policy***. You can use the following 'skeleton':  
  
```python  
def optimal_policy(q_sa: np.array, state: int) -> int:  
	"""RL-policy for optimal play.  
  
	Args:  
		q_sa: q-table  
		state: current state  
  
	Returns:  
		optimal action for current state and q-table.  
	"""  
	...  
```  
  
### Part B  
  
Below I have implemented two functions ```play``` and ```test_performance```.  
  
```python  
import numpy as np  
import gym  
  
from collections.abc import Callable  
  
def play(env: gym.Env, q_sa: np.array, policy: Callable, m_ep_length: int = 100, render: bool = True) -> None:  
	"""Plays one episode of environment env using an optimal policy from the q-table q_sa.  
  
	Args:  
		env: Gym environment  
		q_sa: q-table  
		policy: the policy function to play with  
		m_ep_length: max episode length, default; 100, should be set higher for large environments  
		render: render environment ?  
  
	Returns:  
		None  
	"""  
	state, info = env.reset(return_info=True)  
	if render:  
		env.render()  
	j = 0  
	while j < m_ep_length:  
		j += 1  
		action = policy(q_sa, state) # TODO: Implement this function first!  
		new_state, reward, done, info = env.step(action)  
		state = new_state  
		if render:  
			env.render()  
		if done:  
			break  
  
  
def test_performance(env: gym.Env, q_sa: np.array, policy: Callable, n_episodes: int = 1000, m_ep_length: int = 100) -> float:  
	"""  
  
	Args:  
		env: Gym environment  
		q_sa: q-table  
		policy: the policy function to play with  
		n_episodes: number of episodes  
		m_ep_length: max episode length  
  
	Returns:  
		average reward  
	"""  
	rewards = 0  
	for i in range(n_episodes):  
		state, info = env.reset(return_info=True)  
		reward = 0  
		j = 0  
		while j < m_ep_length:  
			j += 1  
			action = policy(q_sa, state) # TODO: Implement  
			new_state, reward, done, info = env.step(action)  
			state = new_state  
			if done:  
				break  
		rewards += reward  
	return rewards / n_episodes  
  
  
if __name__ == "__main__":  
	environment = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True)  
	q_table = create_filled_q_table() # TODO: Implement  
	play(environment, q_table, optimal_policy) # TODO: Implement policy  
```  
  
Using either this implementation, or your own:  
  
- ***Using your optimal policy from part 1, play at least one episode***  
-- Is it playing "optimally"?  
- ***Calculate the performance of the policy using the ```test_performance``` function.***  
-- What does the number here mean?  
  
## Task 2  
  
In this task we will create functions to update our own q-table, for now you can turn make the environment deterministic by turning of the 'slippery' argument when making the environment:  
  
```python  
environment = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False)  
```  
  
### Part A  
  
First we need to create an empty q-table, as the gym framework supports frozen-lake environment of different sizes, we need to initialize it with the size "state_space x action_space".  
You can get them from:  
```python  
n_states = env.observation_space.n # state-space  
n_actions = env.action_space.n  
```  
  
**Create a function to initialize a q-table**  
  
You can use the following 'skeleton':  
  
```python  
def initialize_q_table(env: gym.Env) -> np.array:  
	"""Creates and returns an empty q-table of size state_space x action_space.  
  
	Args:  
		env: Gym environment  

	Returns:  
		np.array of q-table of size state_space x action_space  
	"""  
	...  
```  
  
### Part B  
  
Now we need some way to update our q-table.  
Recall the Q Temporal-Difference function to update the q-values:  
  
$Q(s,a) \leftarrow Q(s,a) + \alpha[R(s,a,s') + \gamma \max\limits_{a'}Q(s',a') - Q(s,a)]$  
  
**Implement a function to calculate the value to be updated (the part on the right side of the arrow)**  
  
You can use the following 'skeleton':  
  
```python  
def q_temporal_difference(q_sa: np.array, action: int, reward: float, start_state: int, end_state: int, alpha: float = 0.85, gamma: float = 0.98) -> float:  
	"""Calculates the q-update.
  
	Args:  
		q_sa: q-table  
		action: action we are taking  
		reward: result of R(s,a,s')  
		start_state: start state  
		end_state: end state (after taking action a)  
		alpha: learning-rate
		gamma: discount  

	Returns:  
		q-td update value  
	"""  
	...  
```  
  
### Part C  
  
Using the functions we implemented above, we want to update the q-table by simply having the agent play the game a lot.  
  
***Implement a q-learning function***, you can use the following 'skeleton' and/or take inspiration from the ```test_performance``` implementation.  
  
```python  
def q_learning(env: gym.Env, policy: Callable, n_episodes: int = 10000, m_ep_length: int = 100) -> np.array:  
	"""q-learning implementation to update a q-table.  
  
	Args:  
		env: gym environment  
		policy: policy function  
		n_episodes: number of episodes to train on  
		m_ep_length: maximum episode length  
  
	Returns:  
		updated q-table  
	"""  
	...  
```  
  
Note that your agent might behave odd (or not work at all), if you use your optimal policy on an empty q-table, so you may want to edit it to take a random action if it has issues differentiating between actions.  
  
### Part D  
  
***Try out your algorithm***;  
- Use the updated q-table with the ```play``` function and the ```test_performance``` function.  
- Print out the q-table  
  
Are there any potential problems?  
What if you train it again with ```slippery=True``` ?  
  
## Task 3  
  
(From now on, we will play with a stochastic environment, set ```slippery=True```).  
  
We need some way to encourage exploration, to prevent the agent from only trying to repeat the first sequence that got him to the goal.  
  
There are multiple ways to implement this;  
  
1. We can set a static epsilon $\epsilon$ value, and set the action to some random action a if some random number n is below $\epsilon$.  
2. We usually want to encourage exploration in earlier training phases, and encourage exploitation in the later ones. We can therefore use a similar approach to 1, but with the addition of decaying $\epsilon$ over time.  
3. The third option (non-exhaustive) is to create a policy that picks an action based on a weighted probability-distribution created based on the q-values. The weighting can then change over time to encourage exploration early, and exploitation later. A modified version of this function could also be used when 'playing' the game, if you want a policy that not necessarily always picks the option with the highest utility.  
  
### Part A  
  
***Implement \_one\_ of the functions above***, you can use the following 'skeleton':  
  
```python  
  
def epsilon_policy(q_sa: np.array, state: int, env: gym.Env, eps: float = 0.2) -> int:  
	"""RL epsilon-greedy policy.  
	Policy for exploration/exploitation tradeoff.  
  
	Args:  
		q_sa: q-table  
		state: current state  
		env: gym environment  
		eps: epsilon  
  
	Returns:  
		action with a 1-eps chance of being exploitation, eps chance of being exploration  
	"""  
	...  
  
```  
  
### Part B  
  
***Try out your algorithm***;  
- Use the updated q-table with the ```play``` function and the ```test_performance``` function.  
- Print out the q-table  
  
## Task 4  
  
### Part A  
  
***Modify your q-learning algorithm to call ```test_performance``` every n-episode. Save this in a table and plot the result using matplotlib.***  
  
We now have a way to calculate the performance over time/training episodes.  
  
### Part B  
  
***Experiment with the different hyperparameters (epsilon, learning-rate, gamma, etc) and compare them using the method in part A.***  
You can also try with other versions of the frozen-lake environment (e.g. the 8x8 map), they have a function to create random maps.  
  
  
## Task 5 (Extra)  
  
Please inform me if you get to this point early, as I might change the task, but for now:  
  
***Implement SARSA, and repeat similar experiments from task 4 to compare the two.***