Understand Q-Learning in Reinforcement Learning with a numerical example and Python implementation
2022-06-06
Detecting elbow/knee points in a graph using Python
2022-06-13
Show all

Minimal PyTorch LSTM example for regression and classification tasks

10 mins read

The Idea Behind RNNs

Recurrent neural networks in general maintain state information about data previously passed through the network. This is true of both vanilla RNNs and LSTMs. This “hidden state”, as it is called is passed back into the network along with each new element of a sequence of data points. Therefore, each output of the network is a function not only of the input variables but of the hidden state that serves as “memory” of what the network has seen in the past.

Where Vanilla RNNs Fail

It helps to understand the gap that LSTMs fill in the abilities of traditional RNNs. Vanilla RNNs suffer from rapid gradient vanishing or gradient explosion. Roughly speaking, when the chain rule is applied to the equation that governs “memory” within the network, an exponential term is produced. If certain conditions are met, that exponential term may grow very large or disappear very rapidly. LSTMs do not suffer (as badly) from this problem of vanishing gradients and are therefore able to maintain longer “memory”, making them ideal for learning temporal data.

Pain Points of LSTMs in PyTorch

Now, you likely already knew the back story behind LSTMs. You are here because you are having trouble taking your conceptual knowledge and turning it into working code.

A quick search of the PyTorch user forums will yield dozens of questions on how to define an LSTM’s architecture, how to shape the data as it moves from layer to layer, and what to do with the data when it comes out the other end. Many of those questions have no answers, and many more are answered at a level that is difficult to understand by the beginners who are asking them.

Suffice it to say, understanding data flow through an LSTM is the number one pain point I have encountered in practice. And it seems like I’m not alone.

Understanding Data Flow: LSTM Layer

Perhaps the single most difficult concept to grasp when learning LSTMs after other types of networks is how the data flows through the layers of the model. It’s not magic, but it may seem so. Pictures may help:

Diagram by the Wesley Neill.
  1. An LSTM layer is comprised of a set of M hidden nodes. This value M is assigned by the user when the model object is instantiated. Much like traditional neural networks, while guidelines exist, it is a somewhat arbitrary choice.
  2. When a single sequence S of length N is passed into the network, each individual element s_i of the sequence S is passed through every hidden node.
  3. Each hidden node gives a single output for each input it sees. This results in overall output from the hidden layer of shape (NM)
  4. If mini-batches of B sequences are fed to the network, there is an additional dimension added, resulting in an output of shape (B, NM)

Understanding Data Flow: Fully Connected Layer

After an LSTM layer (or set of LSTM layers), we typically add a fully connected layer to the network for final output via the nn.Linear() class.

  1. The input size for the final nn.Linear() layer will always be equal to the number of hidden nodes in the LSTM layer that precedes it.
  2. The output of this final fully connected layer will depend on the form of the targets and/or loss function you are using.

Understanding Data Flow: Examples

We will go over 2 examples of defining network architecture and passing inputs through the network:

  1. Regression
  2. Classification

Example 1a: Regression Network Architecture

Consider some time-series data, perhaps stock prices. Given the past 7 days worth of stock prices for a particular product, we wish to predict the 8th day’s price. In this case, we wish our output to be a single value. We will evaluate the accuracy of this single value using MSE, so for both prediction and for performance evaluations, we need a single-valued output from the seven-day input. Therefore, we would define our network architecture as something like this:

input_size = 1    # The number of variables in your sequence data. 
n_hidden   = 100  # The number of hidden nodes in the LSTM layer.
n_layers   = 2    # The total number of LSTM models layers
out_size   = 1    # The size of the output you desire from your RNN

lstm   = nn.LSTM(input_size, n_hidden, n_layers)
linear = nn.Linear(n_hidden, 1)

Example 1b: Shaping Data Between Layers

We can pin down some specifics of how this machine works.

  1. The input to the LSTM layer must be of shape (batch_size, sequence_length, number_features), where batch_size refers to the number of sequences per batch and number_features is the number of variables in your time series.
  2. The output of your LSTM layer will be shaped like (batch_size, sequence_length, hidden_size). Take another look at the flow chart I created above.
  3. The input of our fully connected nn.Linear() layer requires an input size corresponding to the number of hidden nodes in the preceding LSTM layer. Therefore we must reshape our data into the form (batches, n_hidden).

Important note: batches is not the same as batch_size in the sense that they are not the same number. However, the idea is the same in that we are dividing up the output of the LSTM layer into batches number of pieces, where each piece is of size n_hidden, the number of hidden LSTM nodes.

Here is some code that simulates passing input data x through the entire network, following the protocol above:

input_size = 1    # The number of variables in your sequence data. 
n_hidden   = 100  # The number of hidden nodes in the LSTM layer.
n_layers   = 2    # The total number of LSTM layers to stack.
out_size   = 1    # The size of the output you desire from your RNN.

lstm   = nn.LSTM(input_size, n_hidden, n_layers, batch_first=True)
linear = nn.Linear(n_hidden, 1)

# Data Flow Protocol:
# 1. network input shape: (batch_size, seq_length, num_features)
# 2. LSTM output shape: (batch_size, seq_length, hidden_size)
# 3. Linear input shape:  (batch_size * seq_length, hidden_size)
# 4. Linear output: (batch_size * seq_length, out_size)

x = get_batches(data)
lstm_out, hs = lstm(x, hs)            
linear_in = lstm_out.reshape(-1, hidden_size) 
linear_out = linear(linear_in)

Recall that out_size = 1 because we only wish to know a single value, and that single value will be evaluated using MSE as the metric.

Example 2a: Classification Network Architecture

In this example, we want to generate some text. A model is trained on a large body of text, perhaps a book, and then fed a sequence of characters. The model will look at each character and predict which character should come next. This time our problem is one of classification rather than regression, and we must alter our architecture accordingly. I created this diagram to sketch the general idea:

Diagram by the Wesley Neill.

Perhaps our model has trained on a text of millions of words made up of 50 unique characters. What this means is that when our network gets a single character, we wish to know which of the 50 characters comes next. Therefore our network output for a single character will be 50 probabilities corresponding to each of 50 possible next characters.

Additionally, we will one-hot encode each character in a string of text, meaning the number of variables (input_size = 50) is no longer one as it was before, but rather is the size of the one-hot encoded character vectors.

input_size  = 50  # representing the one-hot encoded vector size
hidden_size = 100 # number of hidden nodes in the LSTM layer
n_layers    = 2   # number of LSTM layers
output_size = 50  # output of 50 scores for the next character

lstm   = nn.LSTM(input_size, n_hidden, n_layers, batch_first=True)
linear = nn.Linear(n_hidden, output_size)

Example 2b: Shaping Data Between Layers

As far as shaping the data between layers, there isn’t much difference. The logic is identical:

input_size  = 50  # representing the one-hot encoded vector size
hidden_size = 100 # number of hidden nodes in the LSTM layer
n_layers    = 2   # number of LSTM layers
output_size = 50  # output of 50 scores for the next character

lstm   = nn.LSTM(input_size, hidden_size, n_layers, batch_first=True)
linear = nn.Linear(hidden_size, output_size)

# Data Flow Protocol
# 1. network input shape: (batch_size, seq_length, num_features)
# 2. LSTM output shape: (batch_size, seq_length, hidden_size)
# 3. Linear input shape:  (batch_size * seq_length, hidden_size)
# 4. Linear output: (batch_size * seq_length, out_size)

x = get_batches(data)         
x, hs = lstm(x, hs)
x = x.reshape(-1, hidden_size) 
x = linear(x)

Example 2c: Training Challenges

However, this scenario presents a unique challenge. Because we are dealing with categorical predictions, we will likely want to use cross-entropy loss to train our model. In this case, it is so important to know your loss function’s requirements. For example, take a look at PyTorch’s nn.CrossEntropyLoss() input requirements (emphasis mine, because let’s be honest some documentation needs help):

The input is expected to contain raw, unnormalized scores for each class. The input has to be a Tensor of size either (minibatch, C)…

This criterion [Cross Entropy Loss] expects a class index in the range [0, C-1] as the target for each value of a 1D tensor of size minibatch.

Okay, no offense PyTorch, but that’s shite. I’m not sure it’s even English. Let me translate:

  1. The prediction (called the input above, even though there are two inputs) should be of shape (minibatch, C) where C is the number of possible classes. In our example C = 50.
  2. The target, which is the second input, should be of size (minibatch, 1). In other words, the target should not be one-hot encoded. However, it should be label encoded.

What this means for you is that you will have to shape your training data in two different ways. Inputs x will be one-hot encoded but your targets y must be label encoded. Further, the one-hot columns of x should be indexed in line with the label encoding of y.

# Assume 26 unique characters
alphabet = ['a', 'b', ... , 'z']

# two sample sequences, inputs and targets
x = np.array(list('abc')) # inputs
y = np.array(list('xyz')) # targets

# define one-hot encoder and label encoder
onehot_encoder = OneHotEncoder(sparse=False).fit(alphabet)
label_encoder  = {ch: i for i, ch in enumerate(alphabet)}

# Use Cross Entropy Loss for classification problem
criterion = nn.CrossEntropyLoss()

# Transform input and targets
x = onehot_encoder.transform(x)
y = [label_encoder[ch] for ch in y]
y = torch.tensor(y)

# Define architecture:
input_size  = 50  # representing the one-hot encoded vector size
hidden_size = 100 # number of hidden nodes in the LSTM layer
n_layers    = 2   # number of LSTM layers
output_size = 50  # output of 50 scores for the next character

lstm   = nn.LSTM(input_size, hidden_size, n_layers, batch_first=True)
linear = nn.Linear(hidden_size, output_size)

# feed forward
x = get_batches(data)          # -> input x:    (batch_size, seq_length, num_features)
x, hs = lstm(x, hs)            # -> LSTM out:   (batch_size, seq_length, hidden_size)
x = x.reshape(-1, hidden_size) # -> Linear in:  (batch_size * seq_length, hidden_size)
x = linear(x)                  # -> Linear out: (batch_size * seq_length, out_size)

# calculate loss 
loss = criterion(x, y)

Final Thoughts

LSTMs can be complex in their implementation. Most of this complexity can be eliminated by understanding the individual needs of the problem you are trying to solve, and then shaping your data accordingly.

If you’d like to take a look at the full, working Jupyter Notebooks for the two examples above, please visit them on my GitHub:

  1. Regression Example
  2. Classification Example

I hope this article has helped in your understanding of the flow of data through an LSTM!

Resources:

https://towardsdatascience.com/lstms-in-pytorch-528b0440244

https://towardsdatascience.com/pytorch-lstms-for-time-series-data-cd16190929d7

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.