Revision 085b782cce69803fc61194e59a3bc99ce3746663 (click the page title to view the current version)

Tensorflow for Reinformcement Learning

Task 1 - Tensorflow/Keras
- Part A - Perceptron
Task 2 - Q-Values from an ANN
Task 3 - DQN
Task 4 - MountainCar

Task 1 - Tensorflow/Keras

We will start today with getting used to tensorflow/keras, you can also adapt the exercises to pytorch or similar if prefered (but code-examples will be given in tensorflow).

First, install tensorflow:

pip install tensorflow

We will only be using the Keras API, you can find the documentation here

Verify in python with:

import tensorflow as tf
print(tf.__version__)

Part A - Perceptron

We can make a single perceptron with keras like this:

from tensorflow import keras
from tensorflow.python.keras.layers import Dense

model = tf.keras.Sequential([  
  Dense(units=1, input_dim=1)  
])

and do a forward propagation with:

import numpy as np

model(np.array([[5]]))  # For x = 5

Furthermore, we can get and set the current weights with:

# Get weights
model.layers[0].get_weights()

# Set weights (TODO: replace w1 and b1)
model.layers[0].set_weights([np.array([[w1]]), np.array([b1]))

Tasks/Questions - Test out different values for the weight and bias - How do you forward-propogate multiple values at once? - Can you plot the graph for some range of inputs?

Task 2 - Q-Values from an ANN

We still want to work with Q-values, meaning that we would like some value for all possible actions as output from our neural network. Our FrozenLake environment has 4 possible actions, and we already know the q-values for all possible states, making it easy to fit a neural network.

Part A - Creating a network

The following code will create a neural network that inputs a state (one value) and outputs 4 values (one for each action), it will also assume 16 possible states (0-15):

import numpy as np   
import tensorflow as tf
from tensorflow import keras  
from tensorflow.python.keras.layers import Dense

x_data = np.linspace(0, 15, 16)
normalizer = keras.layers.Normalization(input_shape=[1,], axis=None)  
normalizer.adapt(np.array(x_data))

model = keras.Sequential([  
	 normalizer,  
	 Dense(64, activation='relu'),  
	 Dense(64, activation='relu'),  
	 Dense(4)  
])  

model.compile(  
    optimizer=tf.optimizers.Adam(learning_rate=0.001),  
	  loss='mse'  
)

Answer the following:

What does the x_data look like (data type, contents, structure)? This will be used as network input, that is, each element should be a state.
What is the design (structure) of this neural network? Look primarily on the lined defining model.

The normalizer (which is defined after x_data) scales the input so that all features have the same magnitude.

The model.compile statement defines the optimiser algorithm (tuning the weights) and the loss function (defining the cost of current errors).

Tasks/Questions

Part B - Training

As we already have Q-Values, let us train the network on the data:

y_data = np.array([  
	[0.54, 0.53, 0.53, 0.52],  
	[0.34, 0.33, 0.32, 0.50],  
	[0.44, 0.43, 0.42, 0.47],  
	[0.31, 0.31, 0.30, 0.46],  
	[0.56, 0.38, 0.37, 0.36],  
	[0., 0., 0., 0.],  
	[0.36, 0.2, 0.36, 0.16],  
	[0., 0., 0., 0.],  
	[0.38, 0.41, 0.40, 0.59],  
	[0.44, 0.64, 0.45, 0.40],  
	[0.62, 0.50, 0.40, 0.33],  
	[0., 0., 0., 0.],  
	[0., 0., 0., 0.],  
	[0.46, 0.53, 0.74, 0.50],  
	[0.73, 0.86, 0.82, 0.78],  
	[1, 1, 1, 1]  
])

model.fit(  
	x_data,  
	y_data,  
	epochs=50000,  
	verbose=0)

decisions = model(x_data)
print(decisions)

y_data is the Q-table in the format we have used before. Rows correspond to states and columns to actions.
model.fit trains the network
model(x_data) applies the network to predict the Q-values for each state in x_data.

Discuss/answer the following.

Test out the forward propagation, are the values similar to what you expect from a Q-table?
Plot the utility given optimal play. (Do this manually if you do not instantly see how to program it.)

Part C - FrozenLake

Given the model trained above and an optimal policy (argmax of output), can you move around the environment/solve the problem?

Task 3 - DQN

Given exercises from last week, we now only need an implementation of a replay-buffer to implement a DQN (Deep Q-network) agent. The replay-buffer needs two methods, one to store experiences (state, action, reward, next_state), and one to sample from the replay-buffer.

Implement these two methods

Task 4 - MountainCar

Until now we have been working on the FrozenLake environment. Try to solve the MountainCar environment using techniques we have learned in this course.

~~I will update this page with a DQN-solution later (hopefully before the end of the day).~~