--- title: Reinformcement Learning categories: session --- + **Goal** Understand and be able to implement Q-learning + Builds on the Markov Decision Process Model [MDP]() from last week. + Next week: [Deep Q-Learning]() + **Reading** Russel and Norvig Chapter 23 + [Eirik's slides from 2022](Reinforcement Learning Slides 2.pdf) + [Eirik's Jupyter Notebook 2022](https://github.com/eiriksfa/pai-rl/blob/main/pai_rl/notebooks/frozen_lake.ipynb) # Exercises ![RL Agent](rlagent.png) In this session we will create a Reinforcement Learning Agent, as depicted in the diagram. Note the internal model, with one function to update it and one function using it to choose an action. You should keep this picture in your mind throughout. Last session we discussed the Q-Function, $$Q(s,a) = \sum_{s'}P(s'|s,a)[R(s,a,s') + \gamma \max_{a'}Q(s',a')]$$ and the function for the optimal policy based on the results from the Q-Function: $$\pi^*(s) = \mathop{\text{argmax }}\limits_aQ(s,a)$$ We also discussed iterative estimation of the utilities and the policies. This session, we will implement an iterative estimation algorithm for the Q-values, knowns as Q-learning. This is a model-free, off-policy reinforcement learning algorithm. The exercise outline below is based partly on Eirik's assigment in 2022 and partly on the Gymnasium [tutorial on Blackjack](https://www.gymlibrary.dev/environments/toy_text/blackjack/). Note that I have not asked you explicitly to output any diagnostics on the way. You almost certainly have to do this yourself, so that you know what is going on. ## Goal and overvew The goal for this session is to implement an agent that can solve the Frozen Lake problem as well as possible, using Q-learning. The skeleton for the Agent will look like this: ```python class Agent: def __init__( self, env, learning_rate=0.1, initial_epsilon=1.0, epsilon_decay=10**(-50000), final_epsilon=0.1, discount_factor=0.95): pass def get_action(self, obs): pass def update( self, obs, action, reward, terminated, next_obs): pass def decay_epsilon(self): pass ``` Thus we need four methods. The most obvious ones are the constructor, the move generator, and model updater. The last method reduces $\epsilon$ which is the probability of making a random move instead of the best move according to the model. In order to run the model, you can use the following script: ```python import matplotlib.pyplot as plt from Agent import Agent import gymnasium as gym env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False,render_mode="human") done = False observation, info = env.reset() action = env.action_space.sample() observation, reward, terminated, truncated, info = env.step(action) agent = Agent( env ) for episode in range(30): obs, info = env.reset() done = False # play one episode while not done: action = agent.get_action(obs) next_obs, reward, terminated, truncated, info = env.step(action) # update the agent agent.update(obs, action, reward, terminated, next_obs) # update if the environment is done and the current obs done = terminated or truncated obs = next_obs agent.decay_epsilon() ``` **Note** We have set `is_slippery=False` above. That's useful for the initial testing; we will change it to `True` later. **Note 2** We use 30 episodes. This is ridiculously little, but the animation is slow, and we need to be able to run it several times in testing. **Note 3** You can turn off the animation by changhing to `render_mode="array"`. This is a lot faster, but you will need some other way to see what is going on. ## 1. Constructor Implement the constructor. You need to record all the parameters and initialise the Q-table. You can use Eirik's initial Q-values below, or it is also possible to use a `defaultdict` as does the [Blackjack tutorial](https://www.gymlibrary.dev/environments/toy_text/blackjack/). ```python initalQ = np.array([ [0.009, 0.192, 0.007, 0.009], [0.003, 0.002, 0.003, 0.17], [0.003, 0.002, 0.001, 0.067], [0.001, 0.001, 0.002, 0.037], [0.526, 0.002, 0.001, 0.002], [0., 0., 0., 0.], [0.046, 0., 0., 0.], [0., 0., 0., 0.], [0.002, 0.002, 0.002, 0.709], [0.001, 0.597, 0.001, 0.001], [0.945, 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], [0.02, 0.012, 0.898, 0.016], [0.061, 0.991, 0.092, 0.068], [0., 0., 0., 0.] ]) ``` In this format `initialQ[state][action]` is the tenative value for $Q$(`state`,`action`). ## 2. Move generator The move generator `get_action()` has to return a valid action, that is an integer in the 0--3 range for the Frozen Lake problem. With probability $\epsilon$ you want to return a random action (see last session for code example), and with probability $1-\epsilon$, the action which maximises $Q$ according to the current estimate. + Implement `get_action()`. + Test the simulator. It should work already at this stage. + Test it also `starting_epsilon=0`, so that you can see what happens when it avoids the random moves altogether. This move generator is known as the *Epsilon-Greedy Algorithm*. It makes a greedy choice except with $\epsilon$ probability. ## 3. Diagnostic output + Add code to count the number of times you win the game. + Turn off the animation and increase the number of episodes. + Is the default strategy able to win the game ever? ## 4. Model updater Now we need some way to update the Q-table. Q-learning is based on one very simple update rule: $$Q(s,a) \leftarrow Q(s,a) + \alpha\left(\left[ R(s,a,s') + \gamma \max\limits_{a'}Q(s',a')\right] - Q(s,a)\right),$$ where $\alpha$ is the learning rate, which controls the speed of convergence. + Look at the correction term in the update rule (the parenthesis multiplied by $\alpha$). Are all the quantities known to the agent? What are the corresponding variables in python? + Implement `update()` ## 5. Epsilon decay ```python self.epsilon = max(self.final_epsilon, self.epsilon - self.epsilon_decay) ``` + What does the above line do? + Do the attribute name match the ones you have used? + Implement `decay_epsilon()`. + Reflect: Why should $\epsilon$ decline during training? ## 6. Testing + Test the system. Add more diagnostic output as required.. + Turn the animation off to be able to test realistically. + Try both Eirik's default Q-table and one initialised with zeroes only. Does this matter a lot? + Does the Q-table change a lot during training? + What happens when you change the training parameters (input to the Agent constructor)? ## 7. The slippery ice Change the environment to be slippery ```python environment = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False) ``` + Repeat the tests from 1F. What do you observe? ## 8. The slippery ice + Reflect upon your solution. + What are the key elements of Q-learning? + Which design decisions are critically to make Q-learning work? + Is Q-learning an appropriate solution to the problem? ## 9. Optional Adapt your solution for other problems in Gymnasium, such as [Blackjack](https://www.gymlibrary.dev/environments/toy_text/blackjack/).