Revision 8bb3072678e45a06ce44a4ba2ae3244914a1f857 (click the page title to view the current version)

Reinforcement Learning 1

Changes from 8bb3072678e45a06ce44a4ba2ae3244914a1f857 to 67660e1d2493ef7cf939dd3b8f73ea33bc7db678

title: Reinformcement Learning
categories: session

+ **Reading:** Russel and Norvig, Chapter 16.1 and Chapter 23.1
+ [Slides (PDF)](Reinforcement Learning Slides 1.pdf)
  [Notes (PDF)](Reinforcement Learning Notes 1.pdf)
+ [A video demo](
+ Eirik's lecture notes from 2022
    + [Slides (PDF)](Reinforcement Learning Slides 1.pdf)
    + [Notes (PDF)](Reinforcement Learning Notes 1.pdf)

# Exercises

## Task 1

Recall the requirements to model a problem with the MDP framework:

- Sequential decision problem     
- Stochastic Environment
- Fully observable
- with a Markovian Transition model
- with additive rewards

And the properties:

- A set of states $s \in S$
- A set of actions $a \in A$
- A transition function $T(s,a,s')$
- A reward function $R(s,a,s')$
- A start state $S_0$

### Part A

Find a problem on CodinGame (preferably one you have worked on), and check if it fulfills the requirements above.
If not, can you think of how you can change the problem (e.g. by adding randomness to actions)? If this is not possible, try with another problem.

### Part B

Using the properties of an MDP:

Can you make a graphical representation of the modified problem from Task A? You can chose a subset of the states (and actions if necessary) to reduce the size of the representation. Use either the Dynamic Decision Network from the book (ch 16.1.4), or a simple representation as was done on the slides. (E.g. from [here]( or [here]( )

### Part C

- Does the problem have a finite or infinite horizon? 
- If you were to attempt to solve the MDP, could the current horizon pose a problem, why/why not?

### Part D

- Does the problem have a discounted reward?
- If you were to attempt to solve the MDP, what discount factor would make sense to use for the utility function?

## Task 2
## Exploring the Frozen Lake (Task 2)

We will be using OpenAI Gym for some of the problems in the two next sessions, and should install and familiarize us with it today (to make sure that everything works ok). 
We will be using the Gymnasium framework to test concepts and ideas
from reinforcement learning.  You may want to consult the documentation,
but you should try playing with it first.

Gym can simply be installed with pip:
+ [Gymnasium](
  formerly known as Gym from OpenAI
+ [Frozen Lake](

Install with these statements.  (I am not sure if you need pygame or not.)
pip install gymnasium
pip install pygame
pip install gym

Check the version either with 
pip freeze

or from python:
import gym

it should be > 0.23.0

### Part A

**Familiarize yourself with the FrozenLake environment**
You can import and start it like this:

You can import and start the simulation of the Frozen Lake like so:
import gym
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True)
import gymnasium as gym
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True,render_mode="human")
obs, info = env.reset(return_info=True)

You should now hopefully see a render of the environment. Note that you need to call env.render() for the window to update.
You should now hopefully see a render of the environment.

**Try out some of these functions and see what they do:**
observation, reward, done, info = env.step(env.action_space.sample())
env.P # The MDP
env.P[s] # Transition matrix of state s
env.P[s][a] # Transitions from state s given action a
More information on standard actions can be found [here](
More information on the FrozenLake environment can be found [here](

**Try to make a custom Frozen Lake map**

When creating a frozen-lake environment you can add a custom-map with the ```desc``` argument, e.g:
fl_map = ["HFFFS", "FHHFF", "FFFFH", "HFFHG"] 
env = gym.make('FrozenLake-v1', desc=fl_map, is_slippery=True)
For a custom 5x4 map.

## Task 3

Recall the value/utility-function:

$$U(s) = \mathop\max\limits_{a \in A(s)} \sum\limits_{s'}P(s'|s,a)[R(s,a,s') + \gamma U(s')]$$

The Q-Function:

$$Q(s,a) = \sum\limits_{s'}P(s'|s,a)[R(s,a,s') + \gamma \mathop\max\limits_{a'}Q(s',a')]$$

And the function to extract an optimal policy from the Q-Function:

$$\pi^*(s) = \mathop{\mathrm{argmax}}\limits_aQ(s,a)$$

### Part A

Implement the above functions in Python

### Part B

Given a FrozenLake map, and a list of pre-calculated expected utilities,
utilities = [0.41,0.38,0.35,0.34,0.43,0,0.12,0,0.45,0.48,0.43,0,0,0.59,0.71,1]
for the default FrozenLake 4x4 map)

- Test out the utility-function, and see if it makes sense/work as it should.
- Use the function to extract an optimal policy to move around the map (discount factor can be e.g. 0.99).