21 mins read
## Overview

## Introduction

## Introduction to Transfer Learning

## Let’s dive into the code

## Loading Data

## Model Building

## Model Training

## Model Validation

## Test with your own image

## What are Pre-trained Models and how to Pick the Right Pre-trained Model?

### ImageNet vs. MNIST

## Case Study: Emergency vs Non-Emergency Vehicle Classification

## Solving the Challenge using Convolutional Neural Networks (CNNs)

## Solving the Challenge using Transfer Learning

## End Notes

- The art of transfer learning could transform the way you build machine learning and deep learning models
- Learn how transfer learning works using PyTorch and how it ties into using pre-trained models
- We’ll work on a real-world dataset and compare the performance of a model built using convolutional neural networks (CNNs) versus one built using transfer learning

I was working on a computer vision project last year where we had to build a robust face detection model. The concept behind that is fairly straightforward – it’s the execution part that always sticks in my mind.

Given the size of the dataset, we had, building a model from scratch was a real challenge. It was going to be potentially time-consuming and a strain on the computational resources we had. We had to figure out a solution quickly because we were working with a tight deadline.

This is when the powerful concept of transfer learning came to our rescue. It is a really helpful tool to have in your data scientist armory, especially when you’re working with limited time and computational power.

So in this article, we will learn all about transfer learning and how to leverage it on a real-world project using Python. We’ll also discuss the role of pre-trained models in this space and how they’ll change the way you build machine learning pipelines.

Transfer Learning is a technique where a model trained for a certain task is used for another similar task. In deep learning, there are two major transfer learning approaches:

1. Fine-tuning: Here, a pre-trained model is loaded and used for training. This will remove the burden of random initialization on the network.

2. Feature Extraction: Like Fine-tuning, a pre-trained model is loaded, and then we will freeze the weights of all layers say except the last layer then use it for training.

In both approaches, the output layer is modified according to our needs. And we may add or delete layers depending on different factors.

Let me illustrate the concept of transfer learning using an example. Picture this – you want to learn a topic from a domain you’re completely new to. Pick any domain and any topic – you can think of deep learning and neural networks as well.

What are the different approaches you would take to understand the topic? Off the top of my head:

- Search online for resources
- Read articles and blogs
- Refer to books
- Look out for video tutorials, and so on

All of these will help you get comfortable with the topic. In this situation, you are the only person who is putting in all the effort.

But there’s another approach, which might yield better results in a short amount of time.

You can consult a domain/topic expert who has a solid grasp of the topic you want to learn. This person will transfer his/her knowledge to you. thus expediting your learning process. *The first approach, where you are putting in all the effort alone, is an example of learning from scratch. The second approach is referred to as transfer learning. There is a knowledge transfer happening from an expert in that domain to a person who is new to it.*

Yes, the idea behind transfer learning is that straightforward!

Neural Networks and Convolutional Neural Networks (CNNs) are examples of learning from scratch. Both these networks extract features from a given set of images (in the case of an image-related task) and then classify the images into their respective classes based on these extracted features.

This is where transfer learning and pre-trained models are so useful. Let’s understand a bit about the latter concept in the next section.

Let’s build a Dog vs Cat classifier using a pre-trained resnet34. You can download the dataset from here.

We will start with importing the necessary packages.

```
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import torchvision
from torchvision import datasets, models, transforms
```

We will use **torchvision** and **torch.utils.data** packages for loading the data.

```
transforms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])train_set = datasets.ImageFolder("data/train",transforms)
val_set = datasets.ImageFolder("data/train",transforms)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=4,
shuffle=True, num_workers=4)
val_loader = torch.utils.data.DataLoader(val_set, batch_size=4,
shuffle=True, num_workers=4)classes = train_set.classes
device = torch.device("cuda:0" if torch.cuda.is_available()
else "cpu")
```

The above code is the same for both approaches.

First, let’s import pre-trained resnet34. In Fine-tuning, that is the only thing we need to do whereas in Feature Extraction we need to freeze the weight.

**Fine-tuning**

```
model = models.resnet34(pretrained=True)
```

**Feature Extraction**

```
model = models.resnet34(pretrained=True)
for param in model.parameters():
param.requires_grad = False
```

From now code for both approaches will be the same.

In ResNet34, the last layer is a fully-connected layer with 1000 neurons. Since we are doing binary classification we will alter the final layer to have two neurons.

```
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 2)model = model.to(device)criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
```

Training and Validation will be the same as we do normally in PyTorch.

```
for epoch in range(25):
running_loss = 0.0
for i, data in enumerate(train_loader, 0):
inputs, labels = data
inputs = inputs.to(device)
labels = labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels) loss.backward()
optimizer.step()
running_loss += loss.item()
print(running_loss)
print('Finished Training')
```

Now our Transfer learned model is ready, let’s validate our model over the validation set.

```
class_correct = list(0. for i in range(2))
class_total = list(0. for i in range(2))
with torch.no_grad():
for i, data in enumerate(val_loader, 0):
inputs, labels = data
inputs = inputs.to(device)
labels = labels.to(device)
outputs = model(inputs)
_, predicted = torch.max(outputs, 1)
c = (predicted == labels).squeeze()
for i in range(4):
label = labels[i]
class_correct[label] += c[i].item()
class_total[label] += 1for i in range(2):
print('Accuracy of %5s : %2d %%' % (
classes[i], 100 * class_correct[i] / class_total[i]))
```

```
from PIL import Imagemodel.eval()img_name = "1.jpeg" # change this to the name of your image file.def predict_image(image_path, model):
image = Image.open(image_path)
image_tensor = transforms(image)
image_tensor = image_tensor.unsqueeze(0)
image_tensor = image_tensor.to(device)
output = model(image_tensor)
index = output.argmax().item()
if index == 0:
return "Cat"
elif index == 1:
return "Dog"
else:
returnpredict(img_name,model)
```

So, that’s how you do transfer learning in PyTorch, I hope you enjoyed it. If you’ve made it this far and found any errors in any of the above or can think of any ways to make it clearer for future readers, don’t hesitate to drop a comment. Thanks!

With this technique, learning process can be faster, more accurate, and need less training data the **size of the dataset** and the **similarity with the original dataset **(the one in which the network was initially trained) are the two keys to consider before applying transfer learning. There are four scenarios:

**Small**dataset and**similar**to the original: train only the (last) fully connected layer**Small**dataset and**different**from the original: train only the fully connected layers- Large dataset and similar to the original: freeze the earlier layers (simple features) and train the rest of the layers
- Large dataset and different from the original: train the model from scratch and reuse the network architecture (using the trained weights as a started point).

Pre-trained models are super useful in any deep learning project that you’ll work on. Not all of us have the unlimited computational power of the top tech behemoths. We need to make do with our local machines so pre-trained models are a blessing there. *A pre-trained model, as you might have surmised already, is a model already designed and trained by a certain person or team to solve a specific problem.*

Recall that we learn the weights and biases while training models like Neural networks and CNNs. These weights and biases, when multiplied with the image pixels, help to generate features.

Pre-trained models share their learning by passing their weights and biases matrix to a new model. So, whenever we do transfer learning, we will first select the right pre-trained model and then pass its weight and bias matrix to the new model.

There is *n* number of pre-trained models available out there. We need to decide which will be the best-suited model for our problem. For now, let’s consider that we have three pre-trained networks available – BERT, ULMFiT, and VGG16.

Our task is to classify the images (as we have been doing in the previous articles of this series). So, which of these pre-trained models will you pick? Let me first give you a quick overview of these pre-trained networks which will help us to decide on the right pre-trained model.

BERT and ULMFiT are used for language modeling and VGG16 is used for image classification tasks. And if you look at the problem at hand, it is an image classification one. So it stands to reason that we will pick VGG16.

Now, VGG16 can have different weights, i.e. VGG16 trained on ImageNet or VGG16 trained on MNIST:

Now, to decide on the right pre-trained model for our problem, we should explore these ImageNet and MNIST datasets. The ImageNet dataset consists of 1000 classes and a total of 1.2 million images. Some of the classes in this data are animals, cars, shops, dogs, food, instruments, etc.:

MNIST, on the other hand, is trained on handwritten digits. It includes 10 classes from 0 to 9:

We will be working on a project where we need to classify images into emergency and non-emergency vehicles (we will discuss this in more detail in the next section). This dataset includes images of vehicles so a VGG16 model trained on the ImageNet dataset would be more useful for us as it has images of vehicles.

This, in a nutshell, is how we should decide on the right pre-trained model based on our problem.

Ideally, we would be using the Identify the Apparels problem for this article. Unfortunately, this isn’t possible here because VGG16 requires that the images should be of the shape (224,224,3) (the images in the other problem are of shape (28,28)). One way to combat this could have been to resize these (28,28) images to (224,224,3) but this will not make sense intuitively.

Here’s the good part – we’ll be working on a brand new project! Here, we aim to classify the vehicles as emergency or non-emergency.

Let’s now start with understanding the problem and visualizing a few examples. * You can download the images using this link*. First, import the required libraries:

```
# importing the libraries
import pandas as pd
import numpy as np
from tqdm import tqdm
# for reading and displaying images
from skimage.io import imread
from skimage.transform import resize
import matplotlib.pyplot as plt
%matplotlib inline
# for creating validation set
from sklearn.model_selection import train_test_split
# for evaluating the model
from sklearn.metrics import accuracy_score
# PyTorch libraries and modules
import torch
from torch.autograd import Variable
from torch.nn import Linear, ReLU, CrossEntropyLoss, Sequential, Conv2d, MaxPool2d, Module, Softmax, BatchNorm2d, Dropout
from torch.optim import Adam, SGD
# torchvision for pre-trained models
from torchvision import models
```

Next, we will read the .csv file containing the image name and the corresponding label:

```
# loading dataset
train = pd.read_csv('emergency_train.csv')
train.head()
```

There are two columns in the .csv file:

**image_names:**It represents the name of all the images in the dataset**emergency_or_no:**It specifies whether that particular image belongs to the emergency or non-emergency class. 0 means that the image is a non-emergency vehicle and 1 represents an emergency vehicle

Next, we will load all the images and store them in an array format:

```
# loading training images
train_img = []
for img_name in tqdm(train['image_names']):
# defining the image path
image_path = 'images/' + img_name
# reading the image
img = imread(image_path)
# normalizing the pixel values
img = img/255
# resizing the image to (224,224,3)
img = resize(img, output_shape=(224,224,3), mode='constant', anti_aliasing=True)
# converting the type of pixel to float 32
img = img.astype('float32')
# appending the image into the list
train_img.append(img)
# converting the list to numpy array
train_x = np.array(train_img)
train_x.shape
```

It took approximately 12 seconds to load these images. There are 1,646 images in our dataset and we have reshaped all of them to (224,224,3) since VGG16 requires all the images in this particular shape. Let’s now visualize a few images from the dataset:

```
# Exploring the data
index = 10
plt.imshow(train_x[index])
if (train['emergency_or_not'][index] == 1):
print('It is an Emergency vehicle')
else:
print('It is a Non-Emergency vehicle')
```

This is a police car and hence has a label of Emergency vehicle. Now we will store the target in a separate variable:

```
# defining the target
train_y = train['emergency_or_not'].values
```

Let’s create a validation set to evaluate our model:

```
# create validation set
train_x, val_x, train_y, val_y = train_test_split(train_x, train_y, test_size = 0.1, random_state = 13, stratify=train_y)
(train_x.shape, train_y.shape), (val_x.shape, val_y.shape)
```

We have 1,481 images in the training set and the remaining 165 images in the validation set. We now have to convert the dataset into torch format:

```
# converting training images into torch format
train_x = train_x.reshape(1481, 3, 224, 224)
train_x = torch.from_numpy(train_x)
# converting the target into torch format
train_y = train_y.astype(int)
train_y = torch.from_numpy(train_y)
# shape of training data
train_x.shape, train_y.shape
```

Similarly, we will convert the validation set:

```
# converting validation images into torch format
val_x = val_x.reshape(165, 3, 224, 224)
val_x = torch.from_numpy(val_x)
# converting the target into torch format
val_y = val_y.astype(int)
val_y = torch.from_numpy(val_y)
# shape of validation data
val_x.shape, val_y.shape
```

Our data is ready! In the next section, we will build a Convolutional Neural Network (CNN) before we use the pre-trained model to solve this problem.

We are finally at the model-building part! Before using transfer learning to solve the problem, let’s use a CNN model and set a benchmark for ourselves.

We will build a very simple CNN architecture with two convolutional layers to extract features from images and a dense layer at the end to classify these features:

```
class Net(Module):
def __init__(self):
super(Net, self).__init__()
self.cnn_layers = Sequential(
# Defining a 2D convolution layer
Conv2d(3, 4, kernel_size=3, stride=1, padding=1),
BatchNorm2d(4),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=2),
# Defining another 2D convolution layer
Conv2d(4, 8, kernel_size=3, stride=1, padding=1),
BatchNorm2d(8),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=2),
)
self.linear_layers = Sequential(
Linear(8 * 56 * 56, 2)
)
# Defining the forward pass
def forward(self, x):
x = self.cnn_layers(x)
x = x.view(x.size(0), -1)
x = self.linear_layers(x)
return x
```

Let’s now define the optimizer, learning rate, and the loss function for our model and use a GPU to train the model:

```
# defining the model
model = Net()
# defining the optimizer
optimizer = Adam(model.parameters(), lr=0.0001)
# defining the loss function
criterion = CrossEntropyLoss()
# checking if GPU is available
if torch.cuda.is_available():
model = model.cuda()
criterion = criterion.cuda()
print(model)
```

This is what the architecture of the model looks like. Finally, we will train the model for 15 epochs. I am setting the *batch_size* of the model to 128 (you can play around with this):

```
# batch size of the model
batch_size = 128
# number of epochs to train the model
n_epochs = 15
for epoch in range(1, n_epochs+1):
# keep track of training and validation loss
train_loss = 0.0
permutation = torch.randperm(train_x.size()[0])
training_loss = []
for i in tqdm(range(0,train_x.size()[0], batch_size)):
indices = permutation[i:i+batch_size]
batch_x, batch_y = train_x[indices], train_y[indices]
if torch.cuda.is_available():
batch_x, batch_y = batch_x.cuda(), batch_y.cuda()
optimizer.zero_grad()
# in case you wanted a semi-full example
outputs = model(batch_x)
loss = criterion(outputs,batch_y)
training_loss.append(loss.item())
loss.backward()
optimizer.step()
training_loss = np.average(training_loss)
print('epoch: \t', epoch, '\t training loss: \t', training_loss)
```

This will print a summary of the training as well. The training loss is decreasing after each epoch and that’s a good sign. Let’s check the training as well as the validation accuracy:

```
# prediction for training set
prediction = []
target = []
permutation = torch.randperm(train_x.size()[0])
for i in tqdm(range(0,train_x.size()[0], batch_size)):
indices = permutation[i:i+batch_size]
batch_x, batch_y = train_x[indices], train_y[indices]
if torch.cuda.is_available():
batch_x, batch_y = batch_x.cuda(), batch_y.cuda()
with torch.no_grad():
output = model(batch_x.cuda())
softmax = torch.exp(output).cpu()
prob = list(softmax.numpy())
predictions = np.argmax(prob, axis=1)
prediction.append(predictions)
target.append(batch_y)
# training accuracy
accuracy = []
for i in range(len(prediction)):
accuracy.append(accuracy_score(target[i],prediction[i]))
print('training accuracy: \t', np.average(accuracy))
```

We got a training accuracy of around 82% which is a good score. Let’s now check the validation accuracy:

```
# prediction for validation set
prediction_val = []
target_val = []
permutation = torch.randperm(val_x.size()[0])
for i in tqdm(range(0,val_x.size()[0], batch_size)):
indices = permutation[i:i+batch_size]
batch_x, batch_y = val_x[indices], val_y[indices]
if torch.cuda.is_available():
batch_x, batch_y = batch_x.cuda(), batch_y.cuda()
with torch.no_grad():
output = model(batch_x.cuda())
softmax = torch.exp(output).cpu()
prob = list(softmax.numpy())
predictions = np.argmax(prob, axis=1)
prediction_val.append(predictions)
target_val.append(batch_y)
# validation accuracy
accuracy_val = []
for i in range(len(prediction_val)):
accuracy_val.append(accuracy_score(target_val[i],prediction_val[i]))
print('validation accuracy: \t', np.average(accuracy_val))
```

The validation accuracy comes out to be 76%. Now that we have a benchmark with us, it’s time to use transfer learning to solve this emergency versus non-emergency vehicle classification problem. Let’s get rolling!

I’ve touched on this above and I’ll reiterate it here – we will be using the VGG16 pre-trained model trained on the ImageNet dataset. Let’s look at the steps we will be following to train the model using transfer learning:

- First, we will load the weights of the pre-trained model – VGG16 in our case
- Then we will fine-tune the model as per the problem at hand
- Next, we will use these pre-trained weights and extract features for our images
- Finally, we will train the fine-tuned model using the extracted features

So, let’s start by loading the weights of the model:

```
# loading the pretrained model
model = models.vgg16_bn(pretrained=True)
```

We will now fine tune the model. We will not be training the layers of the VGG16 model and hence let’s freeze the weights of these layers:

```
# Freeze model weights
for param in model.parameters():
param.requires_grad = False
```

Since we only have 2 classes to predict and VGG16 is trained on ImageNet which has 1000 classes, we need to update the final layer as per our problem:

Since we will be training only the last layer, I have set the *requires_grad* as True for the last layer. Let’s set the training to GPU:

```
# checking if GPU is available
if torch.cuda.is_available():
model = model.cuda()
```

```
# Add on classifier
model.classifier[6] = Sequential(
Linear(4096, 2))
for param in model.classifier[6].parameters():
param.requires_grad = True
```

We’ll now use the model and extract features for both the training and validation images. I will set the *batch_size* as 128 (again, you can increase or decrease this *batch_size* per your requirement):

```
# batch_size
batch_size = 128
# extracting features for train data
data_x = []
label_x = []
inputs,labels = train_x, train_y
for i in tqdm(range(int(train_x.shape[0]/batch_size)+1)):
input_data = inputs[i*batch_size:(i+1)*batch_size]
label_data = labels[i*batch_size:(i+1)*batch_size]
input_data , label_data = Variable(input_data.cuda()),Variable(label_data.cuda())
x = model.features(input_data)
data_x.extend(x.data.cpu().numpy())
label_x.extend(label_data.data.cpu().numpy())
```

Similarly, let’s extract features for our validation images:

```
# extracting features for validation data
data_y = []
label_y = []
inputs,labels = val_x, val_y
for i in tqdm(range(int(val_x.shape[0]/batch_size)+1)):
input_data = inputs[i*batch_size:(i+1)*batch_size]
label_data = labels[i*batch_size:(i+1)*batch_size]
input_data , label_data = Variable(input_data.cuda()),Variable(label_data.cuda())
x = model.features(input_data)
data_y.extend(x.data.cpu().numpy())
label_y.extend(label_data.data.cpu().numpy())
```

Next, we will convert these data into torch format:

```
# converting the features into torch format
x_train = torch.from_numpy(np.array(data_x))
x_train = x_train.view(x_train.size(0), -1)
y_train = torch.from_numpy(np.array(label_x))
x_val = torch.from_numpy(np.array(data_y))
x_val = x_val.view(x_val.size(0), -1)
y_val = torch.from_numpy(np.array(label_y))
```

We also have to define the optimizer and the loss function for our model:

```
import torch.optim as optim
# specify loss function (categorical cross-entropy)
criterion = CrossEntropyLoss()
# specify optimizer (stochastic gradient descent) and learning rate
optimizer = optim.Adam(model.classifier[6].parameters(), lr=0.0005)
```

It’s time to train the model. We will train it for 30 epochs with a batch_size set to 128:

```
# batch size
batch_size = 128
# number of epochs to train the model
n_epochs = 30
for epoch in tqdm(range(1, n_epochs+1)):
# keep track of training and validation loss
train_loss = 0.0
permutation = torch.randperm(x_train.size()[0])
training_loss = []
for i in range(0,x_train.size()[0], batch_size):
indices = permutation[i:i+batch_size]
batch_x, batch_y = x_train[indices], y_train[indices]
if torch.cuda.is_available():
batch_x, batch_y = batch_x.cuda(), batch_y.cuda()
optimizer.zero_grad()
# in case you wanted a semi-full example
outputs = model.classifier(batch_x)
loss = criterion(outputs,batch_y)
training_loss.append(loss.item())
loss.backward()
optimizer.step()
training_loss = np.average(training_loss)
print('epoch: \t', epoch, '\t training loss: \t', training_loss)
```

Here is a summary of the model. **You can see that the loss has decreased and hence we can say that the model is improving.** Let’s validate this by looking at the training and validation accuracies:

```
# prediction for training set
prediction = []
target = []
permutation = torch.randperm(x_train.size()[0])
for i in tqdm(range(0,x_train.size()[0], batch_size)):
indices = permutation[i:i+batch_size]
batch_x, batch_y = x_train[indices], y_train[indices]
if torch.cuda.is_available():
batch_x, batch_y = batch_x.cuda(), batch_y.cuda()
with torch.no_grad():
output = model.classifier(batch_x.cuda())
softmax = torch.exp(output).cpu()
prob = list(softmax.numpy())
predictions = np.argmax(prob, axis=1)
prediction.append(predictions)
target.append(batch_y)
# training accuracy
accuracy = []
for i in range(len(prediction)):
accuracy.append(accuracy_score(target[i],prediction[i]))
print('training accuracy: \t', np.average(accuracy))
```

We got an accuracy of ~ 84% on the training set. Let’s now check the validation accuracy:

```
# prediction for validation set
prediction_val = []
target_val = []
permutation = torch.randperm(x_val.size()[0])
for i in tqdm(range(0,x_val.size()[0], batch_size)):
indices = permutation[i:i+batch_size]
batch_x, batch_y = x_val[indices], y_val[indices]
if torch.cuda.is_available():
batch_x, batch_y = batch_x.cuda(), batch_y.cuda()
with torch.no_grad():
output = model.classifier(batch_x.cuda())
softmax = torch.exp(output).cpu()
prob = list(softmax.numpy())
predictions = np.argmax(prob, axis=1)
prediction_val.append(predictions)
target_val.append(batch_y)
# validation accuracy
accuracy_val = []
for i in range(len(prediction_val)):
accuracy_val.append(accuracy_score(target_val[i],prediction_val[i]))
print('validation accuracy: \t', np.average(accuracy_val))
```

The validation accuracy of the model is also similar, i,e, 83%. **The training and validation accuracies are almost in sync and hence we can say that the model is generalized.** Here is the summary of our results:

Model | Training Accuracy | Validation Accuracy |

CNN | 81.57% | 76.26% |

VGG16 | 83.70% | 83.47% |

We can infer that the accuracies have improved by using the VGG16 pre-trained model as compared to the CNN model. Got to love the art of transfer learning!

In this article, we learned how to use pre-trained models and transfer learning to solve an image classification problem. We first understood what pre-trained models are and how to choose the right pre-trained model depending on the problem at hand. Then we took a case study of classifying images of vehicles as emergency or non-emergency. We solved this case study using a CNN model first and then we used the VGG16 pre-trained model to solve the same problem.

We found that using the VGG16 pre-trained model significantly improved the model performance and we got better results as compared to the CNN model. I hope you now have a clear understanding of how to use transfer learning and the right pre-trained model to solve problems using PyTorch.

I encourage you to take other image classification problems and try to apply transfer learning to solve them. This will help you to grasp the concept much more clearly.

**References:**

https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

https://stackabuse.com/image-classification-with-transfer-learning-and-pytorch/

https://www.guru99.com/transfer-learning.html

https://towardsdatascience.com/a-practical-example-in-transfer-learning-with-pytorch-846bb835f2db

https://medium.com/analytics-vidhya/transfer-learning-in-pytorch-f7736598b1ed

Some useful articles on PyTorch basics: