Recurrent neural networks in general maintain state information about data previously passed through the network. This is true of both vanilla RNNs and LSTMs. This “hidden state”, as it is called is passed back into the network along with each new element of a sequence of data points. Therefore, each output of the network is a function not only of the input variables but of the hidden state that serves as “memory” of what the network has seen in the past.
It helps to understand the gap that LSTMs fill in the abilities of traditional RNNs. Vanilla RNNs suffer from rapid gradient vanishing or gradient explosion. Roughly speaking, when the chain rule is applied to the equation that governs “memory” within the network, an exponential term is produced. If certain conditions are met, that exponential term may grow very large or disappear very rapidly. LSTMs do not suffer (as badly) from this problem of vanishing gradients and are therefore able to maintain longer “memory”, making them ideal for learning temporal data.
Now, you likely already knew the back story behind LSTMs. You are here because you are having trouble taking your conceptual knowledge and turning it into working code.
A quick search of the PyTorch user forums will yield dozens of questions on how to define an LSTM’s architecture, how to shape the data as it moves from layer to layer, and what to do with the data when it comes out the other end. Many of those questions have no answers, and many more are answered at a level that is difficult to understand by the beginners who are asking them.
Suffice it to say, understanding data flow through an LSTM is the number one pain point I have encountered in practice. And it seems like I’m not alone.
Perhaps the single most difficult concept to grasp when learning LSTMs after other types of networks is how the data flows through the layers of the model. It’s not magic, but it may seem so. Pictures may help:
After an LSTM layer (or set of LSTM layers), we typically add a fully connected layer to the network for final output via the nn.Linear()
class.
nn.Linear()
layer will always be equal to the number of hidden nodes in the LSTM layer that precedes it.We will go over 2 examples of defining network architecture and passing inputs through the network:
Consider some time-series data, perhaps stock prices. Given the past 7 days worth of stock prices for a particular product, we wish to predict the 8th day’s price. In this case, we wish our output to be a single value. We will evaluate the accuracy of this single value using MSE, so for both prediction and for performance evaluations, we need a single-valued output from the seven-day input. Therefore, we would define our network architecture as something like this:
input_size = 1 # The number of variables in your sequence data.
n_hidden = 100 # The number of hidden nodes in the LSTM layer.
n_layers = 2 # The total number of LSTM models layers
out_size = 1 # The size of the output you desire from your RNN
lstm = nn.LSTM(input_size, n_hidden, n_layers)
linear = nn.Linear(n_hidden, 1)
We can pin down some specifics of how this machine works.
(batch_size, sequence_length, number_features)
, where batch_size
refers to the number of sequences per batch and number_features
is the number of variables in your time series.(batch_size, sequence_length, hidden_size)
. Take another look at the flow chart I created above.nn.Linear()
layer requires an input size corresponding to the number of hidden nodes in the preceding LSTM layer. Therefore we must reshape our data into the form (batches, n_hidden)
.Important note: batches
is not the same as batch_size
in the sense that they are not the same number. However, the idea is the same in that we are dividing up the output of the LSTM layer into batches
number of pieces, where each piece is of size n_hidden
, the number of hidden LSTM nodes.
Here is some code that simulates passing input data x
through the entire network, following the protocol above:
input_size = 1 # The number of variables in your sequence data.
n_hidden = 100 # The number of hidden nodes in the LSTM layer.
n_layers = 2 # The total number of LSTM layers to stack.
out_size = 1 # The size of the output you desire from your RNN.
lstm = nn.LSTM(input_size, n_hidden, n_layers, batch_first=True)
linear = nn.Linear(n_hidden, 1)
# Data Flow Protocol:
# 1. network input shape: (batch_size, seq_length, num_features)
# 2. LSTM output shape: (batch_size, seq_length, hidden_size)
# 3. Linear input shape: (batch_size * seq_length, hidden_size)
# 4. Linear output: (batch_size * seq_length, out_size)
x = get_batches(data)
lstm_out, hs = lstm(x, hs)
linear_in = lstm_out.reshape(-1, hidden_size)
linear_out = linear(linear_in)
Recall that out_size = 1
because we only wish to know a single value, and that single value will be evaluated using MSE as the metric.
In this example, we want to generate some text. A model is trained on a large body of text, perhaps a book, and then fed a sequence of characters. The model will look at each character and predict which character should come next. This time our problem is one of classification rather than regression, and we must alter our architecture accordingly. I created this diagram to sketch the general idea:
Perhaps our model has trained on a text of millions of words made up of 50 unique characters. What this means is that when our network gets a single character, we wish to know which of the 50 characters comes next. Therefore our network output for a single character will be 50 probabilities corresponding to each of 50 possible next characters.
Additionally, we will one-hot encode each character in a string of text, meaning the number of variables (input_size = 50
) is no longer one as it was before, but rather is the size of the one-hot encoded character vectors.
input_size = 50 # representing the one-hot encoded vector size
hidden_size = 100 # number of hidden nodes in the LSTM layer
n_layers = 2 # number of LSTM layers
output_size = 50 # output of 50 scores for the next character
lstm = nn.LSTM(input_size, n_hidden, n_layers, batch_first=True)
linear = nn.Linear(n_hidden, output_size)
As far as shaping the data between layers, there isn’t much difference. The logic is identical:
input_size = 50 # representing the one-hot encoded vector size
hidden_size = 100 # number of hidden nodes in the LSTM layer
n_layers = 2 # number of LSTM layers
output_size = 50 # output of 50 scores for the next character
lstm = nn.LSTM(input_size, hidden_size, n_layers, batch_first=True)
linear = nn.Linear(hidden_size, output_size)
# Data Flow Protocol
# 1. network input shape: (batch_size, seq_length, num_features)
# 2. LSTM output shape: (batch_size, seq_length, hidden_size)
# 3. Linear input shape: (batch_size * seq_length, hidden_size)
# 4. Linear output: (batch_size * seq_length, out_size)
x = get_batches(data)
x, hs = lstm(x, hs)
x = x.reshape(-1, hidden_size)
x = linear(x)
However, this scenario presents a unique challenge. Because we are dealing with categorical predictions, we will likely want to use cross-entropy loss to train our model. In this case, it is so important to know your loss function’s requirements. For example, take a look at PyTorch’s nn.CrossEntropyLoss()
input requirements (emphasis mine, because let’s be honest some documentation needs help):
The input is expected to contain raw, unnormalized scores for each class. The input has to be a Tensor of size either (minibatch, C)…
This criterion [Cross Entropy Loss] expects a class index in the range [0, C-1] as the target for each value of a 1D tensor of size minibatch.
Okay, no offense PyTorch, but that’s shite. I’m not sure it’s even English. Let me translate:
C = 50
.What this means for you is that you will have to shape your training data in two different ways. Inputs x
will be one-hot encoded but your targets y
must be label encoded. Further, the one-hot columns of x
should be indexed in line with the label encoding of y
.
# Assume 26 unique characters
alphabet = ['a', 'b', ... , 'z']
# two sample sequences, inputs and targets
x = np.array(list('abc')) # inputs
y = np.array(list('xyz')) # targets
# define one-hot encoder and label encoder
onehot_encoder = OneHotEncoder(sparse=False).fit(alphabet)
label_encoder = {ch: i for i, ch in enumerate(alphabet)}
# Use Cross Entropy Loss for classification problem
criterion = nn.CrossEntropyLoss()
# Transform input and targets
x = onehot_encoder.transform(x)
y = [label_encoder[ch] for ch in y]
y = torch.tensor(y)
# Define architecture:
input_size = 50 # representing the one-hot encoded vector size
hidden_size = 100 # number of hidden nodes in the LSTM layer
n_layers = 2 # number of LSTM layers
output_size = 50 # output of 50 scores for the next character
lstm = nn.LSTM(input_size, hidden_size, n_layers, batch_first=True)
linear = nn.Linear(hidden_size, output_size)
# feed forward
x = get_batches(data) # -> input x: (batch_size, seq_length, num_features)
x, hs = lstm(x, hs) # -> LSTM out: (batch_size, seq_length, hidden_size)
x = x.reshape(-1, hidden_size) # -> Linear in: (batch_size * seq_length, hidden_size)
x = linear(x) # -> Linear out: (batch_size * seq_length, out_size)
# calculate loss
loss = criterion(x, y)
LSTMs can be complex in their implementation. Most of this complexity can be eliminated by understanding the individual needs of the problem you are trying to solve, and then shaping your data accordingly.
If you’d like to take a look at the full, working Jupyter Notebooks for the two examples above, please visit them on my GitHub:
I hope this article has helped in your understanding of the flow of data through an LSTM!
Resources:
https://towardsdatascience.com/lstms-in-pytorch-528b0440244
https://towardsdatascience.com/pytorch-lstms-for-time-series-data-cd16190929d7