Deep Q-Learning was introduced in 2014. Since then, a lot of improvements have been made. So, today we’ll see four strategies that improve — dramatically — the training and the results of our DQN agents:
We’ll implement an agent that learns to play Doom Deadly corridor. Our AI must navigate towards the fundamental goal (the vest), and make sure they survive at the same time by killing enemies.
We saw in the Deep Q Learning article that, when we want to calculate the TD error (aka the loss), we calculate the difference between the TD target (Q_target) and the current Q value (estimation of Q).
But we don’t have any idea of the real TD target. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.
However, the problem is that we using the same parameters (weights) for estimating the target and the Q value. As a consequence, there is a big correlation between the TD target and the parameters (w) we are changing.
Therefore, it means that at every step of training, our Q values shift but also the target value shifts. So, we’re getting closer to our target but the target is also moving. It’s like chasing a moving target! This leads to a big oscillation in training.
It’s like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target) you must get closer (reduce the error).
At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters).
This leads to a very strange path of chasing (a big oscillation in training).
Instead, we can use the idea of fixed Q-targets introduced by DeepMind:
Thanks to this procedure, we’ll have more stable learning because the target function stays fixed for a while.
Implementing fixed q-targets is pretty straightforward:
DQNetwork
, TargetNetwork
)DQNetwork
parameters and copy them to our TargetNetwork
DQNetwork
every tau
step (tau
is a hyper-parameter that we define).
# Instantiate the DQNetwork
DQNetwork = DQNNet(state_size, action_size, learning_rate, name="DQNetwork")
# Instantiate the target network
TargetNetwork = DQNNet(state_size, action_size, learning_rate, name="TargetNetwork")
# This function helps us to copy one set of variables to another
# In our case we use it when we want to copy the parameters of DQN to Target_network
# Thanks of the very good implementation of Arthur Juliani https://github.com/awjuliani
def update_target_graph():
# Get the parameters of our DQNNetwork
from_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "DQNetwork")
# Get the parameters of our Target_network
to_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "TargetNetwork")
op_holder = []
# Update our target_network parameters with DQNNetwork parameters
for from_var,to_var in zip(from_vars,to_vars):
op_holder.append(to_var.assign(from_var))
return op_holder
# Get Q values for next_state
Qs_next_state = sess.run(TargetNetwork.output, feed_dict = {TargetNetwork.inputs_: next_states_mb})
...
if tau > max_tau:
# Update the parameters of our TargetNetwork with DQN_weights
update_target = update_target_graph()
sess.run(update_target)
tau = 0
print("Model updated")
Double DQNs, or double Learning, was introduced by Hado van Hasselt. This method handles the problem of the overestimation of Q-values. To understand this problem, remember how we calculate the TD Target:
By calculating the TD target, we face a simple problem: how are we sure that the best action for the next state is the action with the highest Q-value?
We know that the accuracy of q values depends on what action we tried and what neighboring states we explored.
As a consequence, at the beginning of the training, we don’t have enough information about the best action to take. Therefore, taking the maximum q value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly given a higher Q value than the optimal best action, the learning will be complicated.
The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:
Therefore, Double DQN helps us reduce the overestimation of q values and, as a consequence, helps us train faster and have more stable learning.
### DOUBLE DQN Logic
# Use DQNNetwork to select the action to take at next_state (a') (action with the highest Q-value)
# Use TargetNetwork to calculate the Q_val of Q(s',a')
# Get Q values for next_state
Qs_next_state = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states_mb})
# Calculate Qtarget(s')
q_target = sess.run(TargetNetwork.output, feed_dict = {TargetNetwork.inputs_: next_states_mb})
# Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma * Qtarget(s',a')
for i in range(0, len(batch)):
terminal = dones_mb[i]
# We got a'
action = np.argmax(Qs_next_state[i])
# If we are in a terminal state, only equals reward
if terminal:
target_Qs_batch.append(rewards_mb[i])
else:
# Take the Qtarget for action a'
target = rewards_mb[i] + gamma * q_target[i][action]
target_Qs_batch.append(target)
Double Q-Learning ([1] H. van Hasselt 2010) was proposed for solving the problem of large overestimations of action value (Q-value) in basic Q-Learning. Briefly, the problem of overestimations is that the agent always chooses the non-optimal action in any given state only because it has the maximum Q-value. In basic Q-learning, the optimal policy of the Agent is always to choose the best action in any given state. The assumption behind the idea is that the best action has the maximum expected/estimated Q-value. However, the Agent knows nothing about the environment at the beginning, it needs to estimate Q(s, a) at first and update them at each iteration. Such Q-values have lots of noise and we are never sure whether the action with the maximum expected/estimated Q-value is really the best one. Unfortunately, the best action often has smaller Q-values than the non-optimal ones in most cases. According to the optimal policy in basic Q-Learning, the Agent tends to take the non-optimal action in any given state only because it has the maximum Q-value. Such a problem is called the overestimation of action value (Q-value). When such a problem occurs, the noises from the estimated Q-value will cause large positive biases in updating the procedure. As a consequence, the learning process will be very complicated and messy.
Since the best Q(s, a) is estimated by using the current Q(s, a), which is very noisy. The difference between the best and current Q(s, a) in Loss Function is also messy and contains positive biases caused by different noises. These positive biases impact tremendously on update procedure.
By the way, if the noises of all Q-values have uniform distribution, more specifically, Q-values are equally overestimated, then overestimations are not the problem since these noises don’t impact the difference between the Q(s’, a) and Q(s, a). More details are in [1] H. van Hasselt 2010, Section 2.
Double Q-Learning uses two different action-value functions, Q and Q’, as estimators. Even if Q and Q’ are noisy, these noises can be viewed as uniform distribution. That is, this algorithm solves the problem of overestimations of action value. Such proof is in [1] H. van Hasselt 2010, Section 3.
The update procedure is slightly different from the basic version.
Double DQN is proposed in [2] H. van Hasselt, 2016. Inspired by Double Q-Learning, Double DQN uses two different Deep Neural Networks, Deep Q Network (DQN) and Target Network.
Note that there is no learning rate α when updating the Q-values since it will be used in the optimization stage of updating the parameters of the Deep Q Network.
Remember that Q-values correspond to how good it is to be at that state and taking an action at that state Q(s,a).
So we can decompose Q(s,a) as the sum of:
With DDQN, we want to separate the estimator of these two elements, using two new streams:
And then we combine these two streams through a special aggregation layer to get an estimate of Q(s,a).
Wait? But why do we need to calculate these two elements separately if then we combine them?
By decoupling the estimation, intuitively our DDQN can learn which states are (or are not) valuable without having to learn the effect of each action at each state (since it’s also calculating V(s)). With our normal DQN, we need to calculate the value of each action at that state. But what’s the point if the value of the state is bad? What’s the point to calculate all actions at one state when all these actions lead to death? As a consequence, by decoupling we’re able to calculate V(s). This is particularly useful for states where their actions do not affect the environment in a relevant way. In this case, it’s unnecessary to calculate the value of each action. For instance, moving right or left only matters if there is a risk of collision. And, in most states, the choice of action has no effect on what happens.
It will be clearer if we take the example in the paper Dueling Network Architectures for Deep Reinforcement Learning.
We see that the value network streams pay attention (the orange blur) to the road, and in particular to the horizon where the cars are spawned. It also pays attention to the score.
On the other hand, the advantage stream in the first frame on the right does not pay much attention to the road, because there are no cars in front (so the action choice is practically irrelevant). But, in the second frame, it pays attention, as there is a car immediately in front of it, and making a choice of action is crucial and very relevant.
Concerning the aggregation layer, we want to generate the q values for each action at that state. We might be tempted to combine the streams as follows:
But if we do that, we’ll fall into the issue of identifiability, that is — given Q(s,a) we’re unable to find A(s,a) and V(s).
And not being able to find V(s) and A(s,a) given Q(s,a) will be a problem for our backpropagation. To avoid this problem, we can force our advantage function estimator to have 0 advantage at the chosen action.
To do that, we subtract the average advantage of all actions possible of the state.
Therefore, this architecture helps us accelerate the training. We can calculate the value of a state without calculating the Q(s,a) for each action at that state. And it can help us find much more reliable Q values for each action by decoupling the estimation between two streams.
The only thing to do is to modify the DQN architecture by adding these new streams: Link
Prioritized Experience Replay (PER) was introduced in 2015 by Tom Schaul. The idea is that some experiences may be more important than others for our training, but might occur less frequently.
Because we sample the batch uniformly (selecting the experiences randomly) these rich experiences that occur rarely have practically no chance to be selected.
That’s why, with PER, we try to change the sampling distribution by using a criterion to define the priority of each tuple of experience.
We want to take in priority experience where there is a big difference between our prediction and the TD target since it means that we have a lot to learn about it.
We use the absolute value of the magnitude of our TD error:
And we put that priority in the experience of each replay buffer.
But we can’t just do greedy prioritization, because it will lead to always training the same experiences (that have big priority), and thus over-fitting.
So we introduce stochastic prioritization, which generates the probability of being chosen for a replay.
As consequence, during each time step, we will get a batch of samples with this probability distribution and train our network on it.
But, we still have a problem here. Remember that with normal Experience Replay, we use a stochastic update rule. As a consequence, the way we sample the experiences must match the underlying distribution they came from.
When we do have normal experiences, we select our experiences in a normal distribution — simply put, we select our experiences randomly. There is no bias because each experience has the same chance to be taken, so we can update our weights normally.
But, because we use priority sampling, purely random sampling is abandoned. As a consequence, we introduce a bias toward high-priority samples (more chances to be selected).
And, if we update our weights normally, we take the risk of over-fitting. Samples that have high priority are likely to be used for training many times in comparison with low priority experiences (= bias). As a consequence, we’ll update our weights with only a small portion of experiences that we consider to be really interesting.
To correct this bias, we use importance sampling weights (IS) that will adjust the updating by reducing the weights of the often seen samples.
The weights corresponding to high-priority samples have a very little adjustment (because the network will see these experiences many times), whereas those corresponding to low-priority samples will have a full update.
The role of b is to control how much these importance sampling weights affect learning. In practice, the b parameter is annealed up to 1 over the duration of the training, because these weights are more important at the end of learning when our q values begin to converge. The unbiased nature of updates is most important near convergence, as explained in this article.
This time, the implementation will be a little bit fancier.
First of all, we can’t just implement PER by sorting all the Experience Replay Buffers according to their priorities. This will not be efficient at all due to O(nlogn) for insertion and O(n) for sampling.
As explained in this really good article, we need to use another data structure instead of sorting an array — an unsorted Sumtree.
A Sumtree is a Binary Tree, which is a tree with only a maximum of two children for each node. The leaves (deepest nodes) contain the priority values, and a data array that points to the leaves containing the experiences.
Updating the tree and sampling will be really efficient (O(log n)).
Then, we create a memory object that will contain our sumtree and data.
Next, to sample a minibatch of size k, the range [0, total_priority] will be divided into k ranges. A value is uniformly sampled from each range.
Finally, the transitions (experiences) that correspond to each of these sampled values are retrieved from the sumtree.
It will be much clearer when we dive into the complete details in the notebook.
I will use the Morvan Zhou SumTree code from this link. So, first, we create a SumTree object class:
class SumTree(object):
data_pointer = 0
# Here we initialize the tree with all nodes = 0, and initialize the data with all values = 0
def __init__(self, capacity):
# Number of leaf nodes (final nodes) that contains experiences
self.capacity = capacity
# Generate the tree with all nodes values = 0
# To understand this calculation (2 * capacity - 1) look at the schema below
# Remember we are in a binary node (each node has max 2 children) so 2x size of leaf (capacity) - 1 (root node)
# Parent nodes = capacity - 1
# Leaf nodes = capacity
self.tree = np.zeros(2 * capacity - 1)
# Contains the experiences (so the size of data is capacity)
self.data = np.zeros(capacity, dtype=object)
First, we want to build a tree with all nodes = 0 and initialize the data with all values = 0. So, we define the number of leaf nodes (final nodes) that contains experiences. Next, with self.tree = np.zeros(2 * capacity – 1) line we generate the tree with all nodes values = 0. To understand this calculation (2 * capacity – 1) look at the schema below:
0 / \ 0 0 / \ / \ 0 0 0 0
Here we are in a binary node (each node has max 2 children) so 2x size of the leaf (capacity) – 1 (root node). So, to calculate all nodes: Parent nodes = capacity – 1 and Leaf nodes = capacity. Finally, we define our data that contains the experiences (so the size of data is capacity).
Second, we define add a function that will add our priority score in the sumtree leaf and add the experience in data:
def add(self, priority, data):
# Look at what index we want to put the experience
tree_index = self.data_pointer + self.capacity - 1
""" tree:
0
/ \
0 0
/ \ / \
tree_index 0 0 0 We fill the leaves from left to right
"""
# Update data frame
self.data[self.data_pointer] = data
# Update the leaf
self.update (tree_index, priority)
# Add 1 to data_pointer
self.data_pointer += 1
if self.data_pointer >= self.capacity: # If we're above the capacity, we go back to first index (we overwrite)
self.data_pointer = 0
While putting new data to our tree we fill the leaves from left to right, so first what we do is to look at what index we want to put the experience:
tree_index = self.data_pointer + self.capacity – 1
This is what our tree will look like, while we start filling it:
0 / \ 0 0 / \ / \ tree_index 0 0 0
so, while adding new information to our tree we are doing 3 steps:
If we reach the capacity limit, we go back to the first index (we overwrite) again.
As I said, next we create a function to update the leaf priority score and propagate the change through the tree:
def update(self, tree_index, priority):
# Change = new priority score - former priority score
change = priority - self.tree[tree_index]
self.tree[tree_index] = priority
# then propagate the change through tree
# this method is faster than the recursive loop
while tree_index != 0:
tree_index = (tree_index - 1) // 2
self.tree[tree_index] += change
In the update function, first what we do is calculate priority change, from the new priority we subtract our previous priority score, and we overwrite our previous priority with the new priority. After that, we propagate the change through the tree in a while loop.
Here is how our tree looks with leaf 6:
0 / \ 1 2 / \ / \ 3 4 5 [6]
The numbers in this tree are the indexes, not the priority values, so here we want to access the line above the leaf’s. So for example: If we are in a leaf at index 6, we updated the priority score, we need then to update the index node 2:
tree_index = (tree_index - 1) // 2
tree_index = (6-1)//2
tree_index = 2 # (because of // we round the result)
The last step is to update our tree leaf with calculated change:
self.tree[2] += change
Next, we must build a function to get a leaf from our tree. So, we’ll build a function to get the leaf_index, priority value of that leaf, and experience associated with that leaf index:
def get_leaf(self, v):
parent_index = 0
while True:
left_child_index = 2 * parent_index + 1
right_child_index = left_child_index + 1
# If we reach bottom, end the search
if left_child_index >= len(self.tree):
leaf_index = parent_index
break
else: # downward search, always search for a higher priority node
if v <= self.tree[left_child_index]:
parent_index = left_child_index
else:
v -= self.tree[left_child_index]
parent_index = right_child_index
data_index = leaf_index - self.capacity + 1
return leaf_index, self.tree[leaf_index], self.data[data_index]
@property
def total_priority(self):
return self.tree[0] # Returns the root node
To understand what we are doing, let’s look at our tree from an index perspective:
0 -> storing priority sum / \ 1 2 / \ / \ 3 4 5 6 -> storing priority for experiences
Here we are looping our code in a while loop. The first thing we do, we find our left and right child indexes. We keep repeating the action to find our leaf until we find it. When we know our parent leaf index, we calculate our data index, and finally, we return our leaf index, our leaf priority, and data stored according to the leaf index. At the end I also wrote the `total_priority` function, this function will be used to return the root node. Now we finished constructing our SumTree object, next we’ll build a memory object. Writing this tutorial I relied on code from this link. So same as before we’ll create a Memory object:
class Memory(object): # stored as ( state, action, reward, next_state ) in SumTree
PER_e = 0.01 # Hyperparameter that we use to avoid some experiences to have 0 probability of being taken
PER_a = 0.6 # Hyperparameter that we use to make a tradeoff between taking only exp with high priority and sampling randomly
PER_b = 0.4 # importance-sampling, from initial value increasing to 1
PER_b_increment_per_sampling = 0.001
absolute_error_upper = 1. # clipped abs error
def __init__(self, capacity):
# Making the tree
self.tree = SumTree(capacity)
Here we defined 3 hyperparameters:
Before, we created a tree function that is composed of a sumtree that contains the priority scores at his leaf and data in the array. Now, differently from the previous tutorials, we won’t use deque(), because at each timestep our experiences index changes by one. We prefer to use a simple array and overwrite it when our memory is full.
Next, we define a function to store a new experience in our tree. Each new experience will have a score of max_prority (it will be then improved when we use this experience to train our agent). Experience, f. e. in cart pole game would be (state, action, reward, next_state, done). So, we are defining our store function:
def store(self, experience):
# Find the max priority
max_priority = np.max(self.tree.tree[-self.tree.capacity:])
# If the max priority = 0 we can't put priority = 0 since this experience will never have a chance to be selected
# So we use a minimum priority
if max_priority == 0:
max_priority = self.absolute_error_upper
self.tree.add(max_priority, experience) # set the max priority for new priority
So, here we search for max priority in our leaf nodes that contain experiences, if we can’t find any priority in our tree, we set a max priority as absolute_error_upper, in our case 1. Then we store this priority and experience in our memory tree. Else wise we store our experience with the maximum priority we can find.
Next, we are creating a sample function, which will be used to pick a batch from our tree memory, which will be used to train our model. First, we sample a minibatch of n size, the range [0, priority_total] into priority ranges. Then a value is uniformly sampled from each range. Then we search in the sumtree, for the experience where the priority score corresponds to retrieved sample values.
def sample(self, n):
# Create a minibatch array that will contains the minibatch
minibatch = []
b_idx = np.empty((n,), dtype=np.int32)
# Calculate the priority segment
# Here, as explained in the paper, we divide the Range[0, ptotal] into n ranges
priority_segment = self.tree.total_priority / n # priority segment
for i in range(n):
# A value is uniformly sample from each range
a, b = priority_segment * i, priority_segment * (i + 1)
value = np.random.uniform(a, b)
# Experience that correspond to each value is retrieved
index, priority, data = self.tree.get_leaf(value)
b_idx[i]= index
minibatch.append([data[0],data[1],data[2],data[3],data[4]])
return b_idx, minibatch
And finally, we create a function, to update the priorities on the tree:
def batch_update(self, tree_idx, abs_errors):
abs_errors += self.PER_e # convert to abs and avoid 0
clipped_errors = np.minimum(abs_errors, self.absolute_error_upper)
ps = np.power(clipped_errors, self.PER_a)
for ti, p in zip(tree_idx, ps):
self.tree.update(ti, p)
Now we finished our Memory and SumTree classes, don’t worry, everything is uploaded to GitHub, so you could download this code! The above code is in the PER.py script.
Now we can continue on our main agent code, we’ll modify it so that we could use deque() and prioritized memory replay with a simple Boolean function, this will help us to check the difference in results.
So, now we know how our Prioritized Experienced Replay memory works, so I stored our created SumTree and Memory object classes to PER.py script. We’ll import them with the following new line: from PER import *.
In our DQN Agent initialization, we create an object self.MEMORY = Memory(memory_size) with memory_size = 10000. So while we will be using PER memory instead of self.memory = deque(maxlen=2000) we’ll use self.MEMORY. And to easily control, if we want to use PER or not to use it, we’ll insert self.USE_PER = True Boolean command.
Note: all this “memory_size = 10000” is stored in memory, if you have a too large number here, you may get out of memory. So, while implementing PER almost all the functions stay the same as before, only a few of them changes a little. For example, now remember function will look like this:
def remember(self, state, action, reward, next_state, done):
experience = state, action, reward, next_state, done
if self.USE_PER:
self.MEMORY.store(experience)
else:
self.memory.append((experience))
As I already said, here with the Boolean operation, we choose what memory type our agent will use.
More will change our replay function (the way we sample our minibatches), we take them from PER memory or deq list. If we take our mini-batches from PER, we must recalculate absolute_errors and update our memory with it.
def replay(self):
if self.USE_PER:
# Sample minibatch from the PER memory
tree_idx, minibatch = self.MEMORY.sample(self.batch_size)
else:
# Randomly sample minibatch from the deque memory
minibatch = random.sample(self.memory, min(len(self.memory), self.batch_size))
'''
everything stay the same here as before
'''
target_old = np.array(target)
'''
everything stay the same here as before
'''
if self.USE_PER:
absolute_errors = np.abs(target_old[i]-target[i])
# Update priority
self.MEMORY.batch_update(tree_idx, absolute_errors)
# Train the Neural Network with batches
self.model.fit(state, target, batch_size=self.batch_size, verbose=0)
Our run function doesn’t change. From this point, you should download the code from GitHub link.
Now, let’s look at two examples of the same CartPole balancing game, where I trained our agent for 1000 steps. I trained two examples:
Both agents were trained with double dueling Deep Q Network, epsilon greedy update, and soft update disabled.
First let’s look at our results, where we were training our agent without PER, results look very similar to what they were in my previous tutorial. The best average score our agent could hit was around 1060:
Now let’s look at our agent while it was using Prioritized Experienced Replay memory:
This agent is a Dueling Double Deep Q Learning with PER and fixed q-targets. The notebook is here
That’s all! You’ve just created a smarter agent that learns to play Doom. Awesome! Remember that if you want to have an agent with really good performance, you need many more GPU hours (about two days of training)!
However, with only 2–3 hours of training on CPU (yes CPU), our agent understood that they needed to kill enemies before being able to move forward. If they move forward without killing enemies, they will be killed before getting the vest.
Don’t forget to implement each part of the code by yourself. It’s really important to try to modify the code I gave you. Try to add epochs, change the architecture, add fixed Q-values, change the learning rate, use a harder environment…and so on. Experiment, have fun!
Remember that this was a big article, so be sure to really understand why we use these new strategies, how they work, and the advantages of using them.
In the next article, we’ll learn about an awesome hybrid method between value-based and policy-based reinforcement learning algorithms. This is a baseline for the state of the art’s algorithms: Advantage Actor-Critic (A2C). You’ll implement an agent that learns to play Outrun!
References:
https://medium.com/@qempsil0914/deep-q-learning-part2-double-deep-q-network-double-dqn-b8fc9212bbb2
https://jaromiru.com/2016/11/07/lets-make-a-dqn-double-learning-and-prioritized-experience-replay/
https://danieltakeshi.github.io/2019/07/14/per/
https://contest.openai.com/2018-1/details/
https://pylessons.com/CartPole-PER/
https://jaromiru.com/2016/11/07/lets-make-a-dqn-double-learning-and-prioritized-experience-replay/
http://www.florens.io/projects/priorexprepl/