10 mins read
## The Idea Behind RNNs

## Where Vanilla RNNs Fail

## Pain Points of LSTMs in PyTorch

## Understanding Data Flow: LSTM Layer

## Understanding Data Flow: Fully Connected Layer

## Understanding Data Flow: Examples

## Example 1a: Regression Network Architecture

## Example 1b: Shaping Data Between Layers¶

## Example 2a: Classification Network Architecture

## Example 2b: Shaping Data Between Layers

## Example 2c: Training Challenges

## Final Thoughts

Recurrent neural networks in general maintain state information about data previously passed through the network. This is true of both vanilla RNNs and LSTMs. This “hidden state”, as it is called is passed back into the network along with each new element of a sequence of data points. Therefore, each output of the network is a function not only of the input variables but of the hidden state that serves as “memory” of what the network has seen in the past.

It helps to understand the gap that LSTMs fill in the abilities of traditional RNNs. Vanilla RNNs suffer from rapid **gradient vanishing** or **gradient explosion**. Roughly speaking, when the chain rule is applied to the equation that governs “memory” within the network, an exponential term is produced. If certain conditions are met, that exponential term may grow very large or disappear very rapidly. LSTMs do not suffer (as badly) from this problem of vanishing gradients and are therefore able to maintain longer “memory”, making them ideal for learning temporal data.

Now, you likely already knew the back story behind LSTMs. You are here because you are having trouble taking your conceptual knowledge and turning it into working code.

A quick search of the PyTorch user forums will yield dozens of questions on how to define an LSTM’s architecture, how to shape the data as it moves from layer to layer, and what to do with the data when it comes out the other end. Many of those questions have no answers, and many more are answered at a level that is difficult to understand by the beginners who are asking them.

Suffice it to say, understanding data flow through an LSTM is the number one pain point I have encountered in practice. And it seems like I’m not alone.

Perhaps the single most difficult concept to grasp when learning LSTMs after other types of networks is how the data flows through the layers of the model. It’s not magic, but it may seem so. Pictures may help:

- An LSTM layer is comprised of a set of
hidden nodes. This value*M*is assigned by the user when the model object is instantiated. Much like traditional neural networks, while guidelines exist, it is a somewhat arbitrary choice.*M* - When a single sequence
of length*S*is passed into the network, each individual element*N*of the sequence*s_i***S**is passed through every hidden node. - Each hidden node gives a single output for each input it sees. This results in overall output from the hidden layer of shape
**(***N*,*M*) - If mini-batches of
**B**sequences are fed to the network, there is an additional dimension added, resulting in an output of shape**(B,***N*,*M*)

After an LSTM layer (or set of LSTM layers), we typically add a fully connected layer to the network for final output via the `nn.Linear()`

class.

- The input size for the final
`nn.Linear()`

layer will always be equal to the number of hidden nodes in the LSTM layer that precedes it. - The output of this final fully connected layer will depend on the form of the targets and/or loss function you are using.

We will go over 2 examples of defining network architecture and passing inputs through the network:

- Regression
- Classification

Consider some time-series data, perhaps stock prices. Given the past 7 days worth of stock prices for a particular product, we wish to predict the 8th day’s price. In this case, we wish our output to be a single value. We will evaluate the accuracy of this single value using MSE, so for both prediction and for performance evaluations, we need a single-valued output from the seven-day input. Therefore, we would define our network architecture as something like this:

```
input_size = 1 # The number of variables in your sequence data.
n_hidden = 100 # The number of hidden nodes in the LSTM layer.
n_layers = 2 # The total number of LSTM models layers
out_size = 1 # The size of the output you desire from your RNN
lstm = nn.LSTM(input_size, n_hidden, n_layers)
linear = nn.Linear(n_hidden, 1)
```

We can pin down some specifics of how this machine works.

**The input**to the LSTM layer must be of shape`(batch_size, sequence_length, number_features)`

, where`batch_size`

refers to the number of sequences per batch and`number_features`

is the number of variables in your time series.**The output**of your LSTM layer will be shaped like`(batch_size, sequence_length, hidden_size)`

. Take another look at the flow chart I created above.**The input**of our fully connected`nn.Linear()`

layer requires an input size corresponding to the number of hidden nodes in the preceding LSTM layer. Therefore we must reshape our data into the form`(batches, n_hidden)`

.

**Important note:** `batches`

is not the same as `batch_size`

in the sense that they are not the same number. However, the idea is the same in that we are dividing up the output of the LSTM layer into `batches`

number of pieces, where each piece is of size `n_hidden`

, the number of hidden LSTM nodes.

Here is some code that simulates passing input data `x`

through the entire network, following the protocol above:

```
input_size = 1 # The number of variables in your sequence data.
n_hidden = 100 # The number of hidden nodes in the LSTM layer.
n_layers = 2 # The total number of LSTM layers to stack.
out_size = 1 # The size of the output you desire from your RNN.
lstm = nn.LSTM(input_size, n_hidden, n_layers, batch_first=True)
linear = nn.Linear(n_hidden, 1)
# Data Flow Protocol:
# 1. network input shape: (batch_size, seq_length, num_features)
# 2. LSTM output shape: (batch_size, seq_length, hidden_size)
# 3. Linear input shape: (batch_size * seq_length, hidden_size)
# 4. Linear output: (batch_size * seq_length, out_size)
x = get_batches(data)
lstm_out, hs = lstm(x, hs)
linear_in = lstm_out.reshape(-1, hidden_size)
linear_out = linear(linear_in)
```

Recall that `out_size = 1`

because we only wish to know a single value, and that single value will be evaluated using MSE as the metric.

In this example, we want to generate some text. A model is trained on a large body of text, perhaps a book, and then fed a sequence of characters. The model will look at each character and predict which character should come next. This time our problem is one of classification rather than regression, and we must alter our architecture accordingly. I created this diagram to sketch the general idea:

Perhaps our model has trained on a text of millions of words made up of 50 unique characters. What this means is that when our network gets a single character, we wish to know which of the 50 characters comes next. Therefore our network output for a single character will be 50 probabilities corresponding to each of 50 possible next characters.

Additionally, we will one-hot encode each character in a string of text, meaning the number of variables (`input_size = 50`

) is no longer one as it was before, but rather is the size of the one-hot encoded character vectors.

```
input_size = 50 # representing the one-hot encoded vector size
hidden_size = 100 # number of hidden nodes in the LSTM layer
n_layers = 2 # number of LSTM layers
output_size = 50 # output of 50 scores for the next character
lstm = nn.LSTM(input_size, n_hidden, n_layers, batch_first=True)
linear = nn.Linear(n_hidden, output_size)
```

As far as shaping the data between layers, there isn’t much difference. The logic is identical:

```
input_size = 50 # representing the one-hot encoded vector size
hidden_size = 100 # number of hidden nodes in the LSTM layer
n_layers = 2 # number of LSTM layers
output_size = 50 # output of 50 scores for the next character
lstm = nn.LSTM(input_size, hidden_size, n_layers, batch_first=True)
linear = nn.Linear(hidden_size, output_size)
# Data Flow Protocol
# 1. network input shape: (batch_size, seq_length, num_features)
# 2. LSTM output shape: (batch_size, seq_length, hidden_size)
# 3. Linear input shape: (batch_size * seq_length, hidden_size)
# 4. Linear output: (batch_size * seq_length, out_size)
x = get_batches(data)
x, hs = lstm(x, hs)
x = x.reshape(-1, hidden_size)
x = linear(x)
```

However, this scenario presents a unique challenge. Because we are dealing with categorical predictions, we will likely want to use **cross-entropy loss** to train our model. In this case, it is ** so important** to know your loss function’s requirements. For example, take a look at PyTorch’s

`nn.CrossEntropyLoss()`

input requirements (emphasis mine, because let’s be honest some documentation needs help):

The inputis expected to contain raw, unnormalized scores for each class.The inputhas to be a Tensor of size either (minibatch, C)…

This criterion[Cross Entropy Loss]expects a class index in the range [0, C-1] asthe targetfor each value of a1D tensorof size minibatch.

Okay, no offense PyTorch, but that’s shite. I’m not sure it’s even English. Let me translate:

- The prediction (called
**the input**above, even though there are two inputs) should be of shape**(minibatch, C)**where**C**is the number of possible classes. In our example

.**C = 50** - The target, which is the second input, should be of size
**(minibatch, 1)**. In other words, the target**should not**be one-hot encoded. However, it**should be**label encoded.

What this means for you is that you will have to shape your training data in two different ways. Inputs `x`

will be one-hot encoded but your targets `y`

must be label encoded. Further, the one-hot columns of `x`

should be indexed in line with the label encoding of `y`

.

```
# Assume 26 unique characters
alphabet = ['a', 'b', ... , 'z']
# two sample sequences, inputs and targets
x = np.array(list('abc')) # inputs
y = np.array(list('xyz')) # targets
# define one-hot encoder and label encoder
onehot_encoder = OneHotEncoder(sparse=False).fit(alphabet)
label_encoder = {ch: i for i, ch in enumerate(alphabet)}
# Use Cross Entropy Loss for classification problem
criterion = nn.CrossEntropyLoss()
# Transform input and targets
x = onehot_encoder.transform(x)
y = [label_encoder[ch] for ch in y]
y = torch.tensor(y)
# Define architecture:
input_size = 50 # representing the one-hot encoded vector size
hidden_size = 100 # number of hidden nodes in the LSTM layer
n_layers = 2 # number of LSTM layers
output_size = 50 # output of 50 scores for the next character
lstm = nn.LSTM(input_size, hidden_size, n_layers, batch_first=True)
linear = nn.Linear(hidden_size, output_size)
# feed forward
x = get_batches(data) # -> input x: (batch_size, seq_length, num_features)
x, hs = lstm(x, hs) # -> LSTM out: (batch_size, seq_length, hidden_size)
x = x.reshape(-1, hidden_size) # -> Linear in: (batch_size * seq_length, hidden_size)
x = linear(x) # -> Linear out: (batch_size * seq_length, out_size)
# calculate loss
loss = criterion(x, y)
```

LSTMs can be complex in their implementation. Most of this complexity can be eliminated by understanding the individual needs of the problem you are trying to solve, and then shaping your data accordingly.

If you’d like to take a look at the full, working Jupyter Notebooks for the two examples above, please visit them on my GitHub:

I hope this article has helped in your understanding of the flow of data through an LSTM!

Resources:

https://towardsdatascience.com/lstms-in-pytorch-528b0440244

https://towardsdatascience.com/pytorch-lstms-for-time-series-data-cd16190929d7