Did you know the first neural network was discovered in early 1950s ?
Deep Learning (DL) and Neural Network (NN) is currently driving some of the most ingenious inventions in today’s century. Their incredible ability to learn from data and environment makes them the first choice of machine learning scientists.
Deep Learning and Neural Network lies in the heart of products such as self driving cars, image recognition software, recommender systems etc. Evidently, being a powerful algorithm, it is highly adaptive to various data types as well.
People think neural network is an extremely difficult topic to learn. Therefore, either some of them don’t use it, or the ones who use it, use it as a black box. Is there any point in doing something without knowing how is it done? NO!
In this article, I’ve attempted to explain the concept of neural network in simple words. Understanding this article requires a little bit of biology and lots of patience. By end of this article, you would become a confident analyst ready to start working with neural networks. In case you don’t understand anything, I’m always available in comments section.
Note: This article is best suited for intermediate users in data science & machine learning. Beginners might find it challenging.
Neural Networks (NN), also called as Artificial Neural Network is named after its artificial representation of working of a human being’s nervous system. Remember this diagram ? Most of us have been taught in High School !
Flashback Recap: Lets start by understanding how our nervous system works. Nervous System comprises of millions of nerve cells or neurons. A neuron has the following structure:
The major components are:
In simple terms, each neuron takes input from numerous other neurons through the dendrites. It then performs the required processing on the input and sends another electrical pulse through the axiom into the terminal nodes from where it is transmitted to numerous other neurons.
ANN works in a very similar fashion. The general structure of a neural network looks like:Source
This figure depicts a typical neural network with working of a single neuron explained separately. Let’s understand this.
The input to each neuron are like the dendrites. Just like in human nervous system, a neuron (artificial though!) collates all the inputs and performs an operation on them. Lastly, it transmits the output to all other neurons (of the next layer) to which it is connected. Neural Network is divided into layer of 3 types:
Lets start by looking into the functionality of each neuron with examples.
In this section, we will explore the working of a single neuron with easy examples. The idea is to give you some intuition on how a neuron compute outputs using the inputs. A typical neuron looks like:
The different components are:
Here f is known an activation function. This makes a Neural Network extremely flexible and imparts the capability to estimate complex non-linear relationships in data. It can be a gaussian function, logistic function, hyperbolic function or even a linear function in simple cases.
Lets implement 3 fundamental functions – OR, AND, NOT using Neural Networks. This will help us understand how they work. You can assume these to be like a classification problem where we’ll predict the output (0 or 1) for different combination of inputs.
We will model these like linear classifiers with the following activation function:
The AND function can be implemented as:
The output of this neuron is:
The truth table for this implementation is:
Here we can see that the AND function is successfully implemented. Column ‘a’ complies with ‘X1 AND X2’. Note that here the bias unit weight is -1.5. But it’s not a fixed value. Intuitively, we can understand it as anything which makes the total value positive only when both x1 and x2 are positive. So any value between (-1,-2) would work.
The OR function can be implemented as:
The output of this neuron is:
The truth table for this implementation is:
Column ‘a’ complies with ‘X1 OR X2’. We can see that, just by changing the bias unit weight, we can implement an OR function. This is very similar to the one above. Intuitively, you can understand that here, the bias unit is such that the weighted sum will be positive if any of x1 or x2 becomes positive.
Just like the previous cases, the NOT function can be implemented as:
The output of this neuron is:
The truth table for this implementation is:
Again, the compliance with desired value proves functionality. I hope with these examples, you’re getting some intuition into how a neuron inside a Neural Network works. Here I have used a very simple activation function.
Note: Generally a logistic function will be used in place of what I used here because it is differentiable and makes determination of a gradient possible. There’s just 1 catch. And, that is, it outputs floating value and not exactly 0 or 1.
After understanding the working of a single neuron, lets try to understand how a Neural Network can model complex relations using multiple layers. To understand this further, we will take the example of an XNOR function. Just a recap, the truth table of an XNOR function looks like:
Here we can see that the output is 1 when both inputs are same, otherwise 0. This sort of a relationship cannot be modeled using a single neuron. (Don’t believe me? Give it a try!) Thus we will use a multi-layer network. The idea behind using multiple layers is that complex relations can be broken into simpler functions and combined.
Lets break down the XNOR function.
X1 XNOR X2 = NOT ( X1 XOR X2 )
= NOT [ (A+B).(A'+B') ] (Note: Here '+' means OR and '.' mean AND)
= (A+B)' + (A'+B')'
= (A'.B') + (A.B)
Now we can implement it using any of the simplified cases. I will show you how to implement this using 2 cases.
Here the challenge is to design a neuron to model A’.B’ . This can be easily modeled using the following:
The output of this neuron is:
The truth table for this function is:
Now that we have modeled the individual components and we can combine them using a multi-layer network. First, lets look at the semantic diagram of that network:
Here we can see that in layer 1, we will determine A’.B’ and A.B individually. In layer 2, we will take their output and implement an OR function on top. This would complete the entire Neural Network. The final network would look like this:
If you notice carefully, this is nothing but a combination of the different neurons which we have already drawn. The different outputs represent different units:
The functionality can be verified using the truth table:
I think now you can get some intuition into how multi-layers work. Lets do another implementation of the same case.
In the above example, we had to separately calculate A’.B’. What if we want to implement the function just using the basic AND, OR, NOT functions. Consider the following semantic:
Here you can see that we had to use 3 hidden layers. The working will be similar to what we did before. The network looks like:
Here the neurons perform following actions:
Note that, typically a neuron feeds into every other neuron of the next layer except the bias unit. In this case, I’ve obviated few connections from layer 1 to layer 2. This is because their weights are 0 and adding them will make it visually cumbersome to grasp.
The truth table is:
Finally, we have successfully implemented XNOR function. This method is more complicated than case 1. Hence, you should prefer case 1 always. But the idea here is to show how complicated functions can be broken down in multiple layers. I hope the advantages of multiple layers are clearer now.
Now that we had a look at some basic examples, lets define a generic structure in which every Neural Network falls. We will also see the equations to be followed to determine the output given an input. This is known as Forward Propagation.
A generic Neural Network can be defined as:
It has L layers with 1 input layer, 1 output layer and L-2 hidden layers. Terminology:
Since the the output of each layer forms the input of next layer, lets define the equation to determine the output of i+1th layer using output of ith layer as input.
The input to the i+1th layer are:
Ai = [ ai(0), ai(1), ......, ai(Ni) ] Dimension: 1 x Ni+1
The weights matrix from ith to i+1th layer is:
W(i) = [ [ W01(i) W11(i) ....... WNi1(i) ] [ W02(i) W12(i) ....... WNi2(i) ] ... ... ... ... ... ... ... ... [ W0Ni+1(i) W1Ni+1(i) ....... WNiNi+1(i) ] ] Dimension: Ni+1 x Ni+1
The output of the i+1th layer can be calculated as:
Ai+1 = f( Ai.W(i) ) Dimension: 1 x Ni+1
Using these equations for each subsequent layer, we can determine the final output. The number of neurons in the output layer will depend on the type of problem. It can be 1 for regression or binary classification problem or multiple for multi-class classification problems.
But this is just determining the output from 1 run. The ultimate objective is to update the weights of the model in order to minimize the loss function. The weights are updated using a back-propogation algorithm which we’ll study next.
Back-propagation (BP) algorithms works by determining the loss (or error) at the output and then propagating it back into the network. The weights are updated to minimize the error resulting from each neuron. I will not go in details of the algorithm but I will try to give you some intuition into how it works.
The first step in minimizing the error is to determine the gradient of each node wrt. the final output. Since, it is a multi-layer network, determining the gradient is not very straightforward.
Let’s understand the gradients for multi-layer networks. Lets take a step back from neural networks and consider a very simple system as following:
Here there are 3 inputs which simple processing as:
Now we need to determine the gradients of a,b,c,d wrt the output e. The following cases are very straight forward:
However, for determining the gradients for a and b, we need to apply the chain rule.
And, this way the gradient can be computed by simply multiplying the gradient of the input to a node with that of the output of that node. If you’re still confused, just read the equation carefully 5 times and you’ll get it!
But, the actual cases are not that simple. Let’s take another example. Consider a case where a single input is being fed into multiple items in the next layer as this is almost always the case with neural network.
In this case, the gradients of all other will be very similar to the above example except for ‘m’ because m is being fed into 2 nodes. Here, I’ll show how to determine the gradient for m and rest you should calculate on your own.
Here you can see that the gradient is simply the summation of the two different gradients. I hope the cloud cover is slowly vanishing and things are becoming lucid. Just understand these concepts and we’ll come back to this.
Before moving forward, let’s sum up the entire process behind optimization of a neural network. The various steps involved in each iteration are:
Till now we have covered #1 – #3 and we have some intuition into #5. Now lets start from #4 – #6. We’ll use the same generic structure of NN as described in section 4.
#4- Find the error
eL(i) = y(i) - aL(i) | i = 1,2,....,NL
Here y(i) is the actual outcome from training data
#5- Back-propogating the error into the network
The error for layer L-1 should be determined first using the following:
where i = 0,1,2, ….., NL-1 (number of nodes in L-1th layer)
Intuition from the concepts discussed in former half of this section:
This process has to be repeated consecutively from L-1th layer to 2nd layer. Note that the first layer is just the inputs.
#6- Update weights to minimize gradient
Use the following update rule for weights: Wik(l) = Wik(l) + a(i).el+1(k)
I hope the convention is clear. I suggest you go through it multiple times and if still there are questions, I’ll be happy to take them on through comments below.
With this we have successfully understood how a neural network works. Please feel free to discuss further if needed.
This article is focused on the fundamentals of a Neural Network and how it works. I hope now you understand the working of a neural network and wouldn’t use it as a black box ever. It’s really easy once you understand doing it practically as well.
Therefore, in my upcoming article, I’ll explain the applications of using Neural Network in Python. More than theoretical, I’ll focus on practical aspect of Neural Network. Two applications come to my mind immediately:
I hope you enjoyed this. I would love if you could share your feedback through comments below. Looking forward to interacting with you further on this!
The power of artificial intelligence is beyond our imagination. We all know robots have already reached a testing phase in some of the powerful countries of the world. Governments, large companies are spending billions in developing this ultra-intelligence creature. The recent existence of robots have gained attention of many research houses across the world.
Does it excite you as well ? Personally for me, learning about robots & developments in AI started with a deep curiosity and excitement in me! Let’s learn about computer vision today.
The earliest research in computer vision started way back in 1950s. Since then, we have come a long way but still find ourselves far from the ultimate objective. But with neural networks and deep learning, we have become empowered like never before.
Applications of deep learning in vision have taken this technology to a different level and made sophisticated things like self-driven cars possible in near future. In this article, I will also introduce you to Convolution Neural Networks which form the crux of deep learning applications in computer vision.
Note: This article is inspired by Stanford’s Class on Visual Recognition. Understanding this article requires prior knowledge of Neural Networks. If you are new to neural networks, you can start here. Another useful resource on basics of deep learning can be found here.
As the name suggests, the aim of computer vision (CV) is to imitate the functionality of human eye and brain components responsible for your sense of sight.
Doing actions such as recognizing an animal, describing a view, differentiating among visible objects are really a cake-walk for humans. You’d be surprised to know that it took decades of research to discover and impart the ability of detecting an object to a computer with reasonable accuracy.
The field of computer vision has witnessed continual advancements in the past 5 years. One of the most stated advancement is Convolution Neural Networks (CNNs). Today, deep CNNs form the crux of most sophisticated fancy computer vision application, such as self-driving cars, auto-tagging of friends in our facebook pictures, facial security features, gesture recognition, automatic number plate recognition, etc.
Let’s get familiar with it a bit more:
Object detection is considered to be the most basic application of computer vision. Rest of the other developments in computer vision are achieved by making small enhancements on top of this. In real life, every time we(humans) open our eyes, we unconsciously detect objects.
Since it is super-intuitive for us, we fail to appreciate the key challenges involved when we try to design systems similar to our eye. Lets start by looking at some of the key roadblocks:
These are just some of the challenges which I brought up so that you can appreciate the complexity of the tasks which your eye and brain duo does with such utter ease. Breaking up all these challenges and solving individually is still possible today in computer vision. But we’re still decades away from a system which can get anywhere close to our human eye (which can do everything!).
This brilliance of our human body is the reason why researchers have been trying to break the enigma of computer vision by analyzing the visual mechanics of humans or other animals. Some of the earliest work in this direction was done by Hubel and Weisel with their famous cat experiment in 1959. Read more about it here.
This was the first study which emphasized the importance of edge detection for solving the computer vision problem. They were rewarded the nobel prize for their work.
Before diving into convolutional neural networks, lets take a quick overview of the traditional or rather elementary techniques used in computer vision before deep learning became popular.
Various techniques, other than deep learning are available enhancing computer vision. Though, they work well for simpler problems, but as the data become huge and the task becomes complex, they are no substitute for deep CNNs. Let’s briefly discuss two simple approaches.
I hope this gives some intuition into the challenges faced by approaches other than deep learning. Please note that more sophisticated techniques can be used than the ones discussed above but they would rarely beat a deep learning model.
Let’s discuss some properties of a neural networks. I will skip the basics of neural networks here as I have already covered that in my previous article – Fundamentals of Deep Learning – Starting with Neural Networks.
Once your fundamentals are sorted, let’s learn in detail some important concepts such as activation functions, data preprocessing, initializing weights and dropouts.
There are various activation functions which can be used and this is an active area of research. Let’s discuss some of the popular options:
To summarize, ReLU is mostly the activation function of choice. If the caveats are kept in mind, these can be used very efficiently.
For images, generally the following preprocessing steps are done:
Note that normalization is generally not done in images.
There can be various techniques for initializing weights. Lets consider a few of them:
One more thing must be remembered while using ReLU as activation function. It is that the weights initialization might be such that some of the neurons might not get activated because of negative input. This is something that should be checked. You might be surprised to know that 10-20% of the ReLUs might be dead at a particular time while training and even in the end.
These were just some of the concepts I discussed here. Some more concepts can be of importance like batch normalization, stochastic gradient descent, dropouts which I encourage you to read on your own.
Before going into the details, lets first try to get some intuition into why deep networks work better.
As we learned from the drawbacks of earlier approaches, they are unable to cater to the vast amount of variations in images. Deep CNNs work by consecutively modeling small pieces of information and combining them deeper in network.
One way to understand them is that the first layer will try to detect edges and form templates for edge detection. Then subsequent layers will try to combine them into simpler shapes and eventually into templates of different object positions, illumination, scales, etc. The final layers will match an input image with all the templates and the final prediction is like a weighted sum of all of them. So, deep CNNs are able to model complex variations and behaviour giving highly accurate predictions.
There is an interesting paper on visualization of deep features in CNNs which you can go through to get more intuition – Understanding Neural Networks Through Deep Visualization.
For the purpose of explaining CNNs and finally showing an example, I will be using the CIFAR-10 dataset for explanation here and you can download the data set from here. This dataset has 60,000 images with 10 labels and 6,000 images of each type. Each image is colored and 32×32 in size.
A CNN typically consists of 3 types of layers:
You might find some batch normalization layers in some old CNNs but they are not used these days. We’ll consider these one by one.
Since convolution layers form the crux of the network, I’ll consider them first. Each layer can be visualized in the form of a block or a cuboid. For instance in the case of CIFAR-10 data, the input layer would have the following form:
Here you can see, this is the original image which is 32×32 in height and width. The depth here is 3 which corresponds to the Red, Green and Blue colors, which form the basis of colored images. Now a convolution layer is formed by running a filter over it. A filter is another block or cuboid of smaller height and width but same depth which is swept over this base block. Let’s consider a filter of size 5x5x3.
We start this filter from the top left corner and sweep it till the bottom left corner. This filter is nothing but a set of eights, i.e. 5x5x3=75 + 1 bias = 76 weights. At each position, the weighted sum of the pixels is calculated as WTX + b and a new value is obtained. A single filter will result in a volume of size 28x28x1 as shown above.
Note that multiple filters are generally run at each step. Therefore, if 10 filters are used, the output would look like:
Here the filter weights are parameters which are learned during the back-propagation step. You might have noticed that we got a 28×28 block as output when the input was 32×32. Why so? Let’s look at a simpler case.
Suppose the initial image had size 6x6xd and the filter has size 3x3xd. Here I’ve kept the depth as d because it can be anything and it’s immaterial as it remains the same in both. Since depth is same, we can have a look at the front view of how filter would work:
Here we can see that the result would be 4x4x1 volume block. Notice there is a single output for entire depth of the each location of filter. But you need not do this visualization all the time. Let’s define a generic case where image has dimension NxNxd and filter has FxFxd. Also, lets define another term stride (S) here which is the number of cells (in above matrix) to move in each step. In the above case, we had a stride of 1 but it can be a higher value as well. So the size of the output will be:
output size = (N – F)/S + 1
You can validate the first case where N=32, F=5, S=1. The output had 28 pixels which is what we get from this formula as well. Please note that some S values might result in non-integer result and we generally don’t use such values.
Let’s consider an example to consolidate our understanding. Starting with the same image as before of size 32×32, we need to apply 2 filters consecutively, first 10 filters of size 7, stride 1 and next 6 filters of size 5, stride 2. Before looking at the solution below, just think about 2 things:
Here is the answer:
Notice here that the size of the images is getting shrunk consecutively. This will be undesirable in case of deep networks where the size would become very small too early. Also, it would restrict the use of large size filters as they would result in faster size reduction.
To prevent this, we generally use a stride of 1 along with zero-padding of size (F-1)/2. Zero-padding is nothing but adding additional zero-value pixels towards the border of the image.
Consider the example we saw above with 6×6 image and 3×3 filter. The required padding is (3-1)/2=1. We can visualize the padding as:
Here you can see that the image now becomes 8×8 because of padding of 1 on each side. So now the output will be of size 6×6 same as the original image.
Now let’s summarize a convolution layer as following:
Some additional points to be taken into consideration:
Having understood the convolution layer, lets move on to pooling layer.
When we use padding in convolution layer, the image size remains same. So, pooling layers are used to reduce the size of image. They work by sampling in each layer using filters. Consider the following 4×4 layer. So if we use a 2×2 filter with stride 2 and max-pooling, we get the following response:
Here you can see that 4 2×2 matrix are combined into 1 and their maximum value is taken. Generally, max-pooling is used but other options like average pooling can be considered.
At the end of convolution and pooling layers, networks generally use fully-connected layers in which each pixel is considered as a separate neuron just like a regular neural network. The last fully-connected layer will contain as many neurons as the number of classes to be predicted. For instance, in CIFAR-10 case, the last fully-connected layer will have 10 neurons.
I recommend reading the prior section multiple times and getting a hang of the concepts before moving forward.
In this section, I will discuss the AlexNet architecture in detail. To give you some background, AlexNet is the winning solution of IMAGENET Challenge 2012. This is one of the most reputed computer vision challenge and 2012 was the first time that a deep learning network was used for solving this problem.
Also, this resulted in a significantly better result as compared to previous solutions. I will share the network architecture here and review all the concepts learned above.
The detailed solution has been explained in this paper. I will explain the overall architecture of the network here. The AlexNet consists of a 11 layer CNN with the following architecture:
Here you can see 11 layers between input and output. Lets discuss each one of them individually. Note that the output of each layer will be the input of next layer. So you should keep that in mind.
I understand this is a complicated structure but once you understand the layers, it’ll give you a much better understanding of the architecture. Note that you fill find a different representation of the structure if you look at the AlexNet paper. This is because at that GPUs were not very powerful and they used 2 GPUs for training the network. So the work processing was divided between the two.
I highly encourage you to go through the other advanced solutions of ImageNet challenges after 2012 to get more ideas of how people design these networks. Some of interesting solutions are:
This video gives a brief overview and comparison of these solutions towards the end.
Having understood the theoretical concepts, lets move on to the fun part (practical) and make a basic CNN on the CIFAR-10 dataset which we’ve downloaded before.
I’ll be using GraphLab for the purpose of running algorithms. Instead of GraphLab, you are free to use alternatives tools such as Torch, Theano, Keras, Caffe, TensorFlow, etc. But GraphLab allows a quick and dirty implementation as it takes care of the weights initializations and network architecture on its own.
We’ll work on the CIFAR-10 dataset which you can download from here. The first step is to load the data. This data is packed in a specific format which can be loaded using the following code:
import pandas as pd import numpy as np import cPickle #Define a function to load each batch as dictionary: def unpickle(file): fo = open(file, 'rb') dict = cPickle.load(fo) fo.close() return dict #Make dictionaries by calling the above function: batch1 = unpickle('data/data_batch_1') batch2 = unpickle('data/data_batch_2') batch3 = unpickle('data/data_batch_3') batch4 = unpickle('data/data_batch_4') batch5 = unpickle('data/data_batch_5') batch_test = unpickle('data/test_batch') #Define a function to convert this dictionary into dataframe with image pixel array and labels: def get_dataframe(batch): df = pd.DataFrame(batch['data']) df['image'] = df.as_matrix().tolist() df.drop(range(3072),axis=1,inplace=True) df['label'] = batch['labels'] return df #Define train and test files: train = pd.concat([get_dataframe(batch1),get_dataframe(batch2),get_dataframe(batch3),get_dataframe(batch4),get_dataframe(batch5)],ignore_index=True) test = get_dataframe(batch_test)
We can verify this data by looking at the head and shape of data as follow:
print train.shape, test.shape
Since we’ll be using graphlab, the next step is to convert this into a graphlab SFrame and run neural network. Let’s convert the data first:
import graphlab as gl gltrain = gl.SFrame(train) gltest = gl.SFrame(test)
GraphLab has a functionality of automatically creating a neural network based on the data. Lets run that as a baseline model before going into an advanced model.
model = gl.neuralnet_classifier.create(gltrain, target='label', validation_set=None)
Here it used a simple fully connected network with 2 hidden layers and 10 neurons each. Let’s evaluate this model on test data.
As you can see that we have a pretty low accuracy of ~15%. This is because it is a very fundamental network. Lets try to make a CNN now. But if we go about training a deep CNN from scratch, we will face the following challenges:
To overcome these challenges, we can use pre-trained networks. These are nothing but networks like AlexNet which are pre-trained on many images and the weights for deep layers have been determined. The only challenge is to find a pre-trianed network which has been trained on images similar to the one we want to train. If the pre-trained network is not made on images of similar domain, then the features will not exactly make sense and classifier will not be of higher accuracy.
Before proceeding further, we need to convert these images into the size used in ImageNet which we’re using for classification. The GraphLab model is based on 256×256 size images. So we need to convert our images to that size. Lets do it using the following code:
#Convert pixels to graphlab image format gltrain['glimage'] = gl.SArray(gltrain['image']).pixel_array_to_image(32, 32, 3, allow_rounding = True) gltest['glimage'] = gl.SArray(gltest['image']).pixel_array_to_image(32, 32, 3, allow_rounding = True)
#Remove the original column gltrain.remove_column('image') gltest.remove_column('image')
Here we can see that a new column of type graphlab image has been created but the images are in 32×32 size. So we convert them to 256×256 using following code:
#Convert into 256x256 size gltrain['image'] = gl.image_analysis.resize(gltrain['glimage'], 256, 256, 3) gltest['image'] = gl.image_analysis.resize(gltest['glimage'], 256, 256, 3)
#Remove old column: gltrain.remove_column('glimage') gltest.remove_column('glimage')
Now we can see that the image has been converted into the desired size. Next, we will load the ImageNet pre-trained model in graphlab and use the features created in its last layer into a simple classifier and make predictions.
Lets start by loading the pre-trained model.
#Load the pre-trained model: pretrained_model = gl.load_model('http://s3.amazonaws.com/GraphLab-Datasets/deeplearning/imagenet_model_iter45')
Now we have to use this model and extract features which will be passed into a classifier. Note that the following operations may take a lot of computing time. I use a Macbook Pro 15″ and I had to leave it for whole night!
gltrain['features'] = pretrained_model.extract_features(gltrain) gltest['features'] = pretrained_model.extract_features(gltest)
Lets have a look at the data to make sure we have the features:
Though, we have the features with us, notice here that lot of them are zeros. You can understand this as a result of smaller data set. ImageNet was created on 1.2Mn images. So there would be many features in those images that don’t make sense for this data, thus resulting in zero outcome.
Now lets create a classifier using graphlab. The advantage with “classifier” function is that it will automatically create various classifiers and chose the best model.
simple_classifier = graphlab.classifier.create(gltrain, features = ['features'], target = 'label')
The various outputs are:
The final model selection is based on a validation set with 5% of the data. The results are:
So we can see that Boosted Trees Classifier has been chosen as the final model. Let’s look at the results on test data:
So we can see that the test accuracy is now ~50%. It’s a decent jump from 15% to 50% but there is still huge potential to do better. The idea here was to get you started and I will skip the next steps. Here are some things which you can try:
You can find many open-source solutions for this dataset which give >95% accuracy. You should check those out. Please feel free to try them and post your solutions in comments below.
Now, its time to take the plunge and actually play with some other real datasets. So are you ready to take on the challenge? Accelerate your deep learning journey with the following Practice Problems:
|Practice Problem: Identify the Apparels||Identify the type of apparel for given images|
|Practice Problem: Identify the Digits||Identify the digit in given images|
In this article, we covered the basics of computer vision using deep Convolution Neural Networks (CNNs). We started by appreciating the challenges involved in designing artificial systems which mimic the eye. Then, we looked at some of the traditional techniques, prior to deep learning, and got some intuition into their drawbacks.
We moved on to understanding the some aspects of tuning a neural networks such as activation functions, weights initialization and data-preprocessing. Next, we got some intuition into why deep CNNs should work better than traditional approaches and we understood the different elements present in a general deep CNN.
Subsequently, we consolidated our understanding by analyzing the architecture of AlexNet, the winning solution of ImageNet 2012 challenge. Finally, we took the CIFAR-10 data and implemented a CNN on it using a pre-trained AlexNet deep network.
I hope you liked this article. Did you find this article useful ? Please feel free to share your feedback through comments below. And to gain expertise in working in neural network try out the deep learning practice problem – Identify the Digits.
In my previous article, I discussed the implementation of neural networks using TensorFlow. Continuing the series of articles on neural network libraries, I have decided to throw light on Keras – supposedly the best deep learning library so far.
I have been working on deep learning for sometime now and according to me, the most difficult thing when dealing with Neural Networks is the never-ending range of parameters to tune. With increase in depth of a Neural Network, it becomes increasingly difficult to take care of all the parameters. Mostly, people rely on intuition and experience to tune it. In reality, research is still rampant on this topic.
Thankfully we have Keras, which takes care of a lot of this hard work and provides an easier interface!
In this article, I am going to share my experience of working in deep learning. We will begin with an overview of Keras, its features and differentiation over other libraries. We will then, look at a simple implementation of neural networks in Keras. And then, I will take you through a hands-on exercise on parameter tuning in neural networks.
Keras is a high level library, used specially for building neural network models. It is written in Python and is compatible with both Python – 2.7 & 3.5. Keras was specifically developed for fast execution of ideas. It has a simple and highly modular interface, which makes it easier to create even complex neural network models. This library abstracts low level libraries, namely Theano and TensorFlow so that, the user is free from “implementation details” of these libraries.
The key features of Keras are:
Being a high level library and its simpler interface, Keras certainly shines as one of the best deep learning library available. Few features of Keras, which stands out in comparison with other libraries are:
Given the above reasons, it is no surprise that Keras is increasingly becoming popular as a deep learning library.
Neural networks is a special type of machine learning (ML) algorithm. So, like every ML algorithm, it follows the usual ML workflow of data preprocessing, model building and model evaluation. For the sake of conciseness, I have listed out a To-D0 list of how to approach a Neural Network problem.
Before starting this experiment, make sure you have Keras installed in your system. Refer the official installation guide. We will use tensorflow for backend, so make sure you have this done in your config file. If not, follow the steps given here.
Here, we solve our deep learning practice problem – Identify the Digits. Let’s take a look at our problem statement:
Our problem is an image recognition problem, to identify digits from a given 28 x 28 image. We have a subset of images for training and the rest for testing our model. So first, download the train and test files. The dataset contains a zipped file of all the images and both the train.csv and test.csv have the name of corresponding train and test images. Any additional features are not provided in the datasets, just the raw images are provided in ‘.png’ format.
a) Import all the necessary libraries
%pylab inline import os import numpy as np import pandas as pd from scipy.misc import imread from sklearn.metrics import accuracy_score import tensorflow as tf import keras
b) Let’s set a seed value, so that we can control our models randomness
# To stop potential randomness seed = 128 rng = np.random.RandomState(seed)
c) The first step is to set directory paths, for safekeeping!
root_dir = os.path.abspath('../..')
data_dir = os.path.join(root_dir, 'data')
sub_dir = os.path.join(root_dir, 'sub')
# check for existence
a) Now let us read our datasets. These are in .csv formats, and have a filename along with the appropriate labels
train = pd.read_csv(os.path.join(data_dir, 'Train', 'train.csv')) test = pd.read_csv(os.path.join(data_dir, 'Test.csv')) sample_submission = pd.read_csv(os.path.join(data_dir, 'Sample_Submission.csv')) train.head()
b) Let us see what our data looks like! We read our image and display it.
img_name = rng.choice(train.filename) filepath = os.path.join(data_dir, 'Train', 'Images', 'train', img_name) img = imread(filepath, flatten=True) pylab.imshow(img, cmap='gray') pylab.axis('off') pylab.show()
c) The above image is represented as numpy array, as seen below
d) For easier data manipulation, let’s store all our images as numpy arrays
temp =  for img_name in train.filename: image_path = os.path.join(data_dir, 'Train', 'Images', 'train', img_name) img = imread(image_path, flatten=True) img = img.astype('float32') temp.append(img) train_x = np.stack(temp) train_x /= 255.0 train_x = train_x.reshape(-1, 784).astype('float32') temp =  for img_name in test.filename: image_path = os.path.join(data_dir, 'Train', 'Images', 'test', img_name) img = imread(image_path, flatten=True) img = img.astype('float32') temp.append(img) test_x = np.stack(temp) test_x /= 255.0 test_x = test_x.reshape(-1, 784).astype('float32')
train_y = keras.utils.np_utils.to_categorical(train.label.values)
e) As this is a typical ML problem, to test the proper functioning of our model we create a validation set. Let’s take a split size of 70:30 for train set vs validation set
split_size = int(train_x.shape*0.7) train_x, val_x = train_x[:split_size], train_x[split_size:] train_y, val_y = train_y[:split_size], train_y[split_size:]
a) Now comes the main part! Let us define our neural network architecture. We define a neural network with 3 layers input, hidden and output. The number of neurons in input and output are fixed, as the input is our 28 x 28 image and the output is a 10 x 1 vector representing the class. We take 50 neurons in the hidden layer. Here, we use Adam as our optimization algorithms, which is an efficient variant of Gradient Descent algorithm. There are a number of other optimizers available in keras (refer here). In case you don’t understand any of these terminologies, check out the article on fundamentals of neural network to know more in depth of how it works.
# define vars input_num_units = 784 hidden_num_units = 50 output_num_units = 10 epochs = 5 batch_size = 128 # import keras modules from keras.models import Sequential from keras.layers import Dense # create model model = Sequential([ Dense(output_dim=hidden_num_units, input_dim=input_num_units, activation='relu'), Dense(output_dim=output_num_units, input_dim=hidden_num_units, activation='softmax'), ]) # compile the model with necessary attributes model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
b) It’s time to train our model
trained_model = model.fit(train_x, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x, val_y))
a) To test our model with our own eyes, let’s visualize its predictions
pred = model.predict_classes(test_x) img_name = rng.choice(test.filename) filepath = os.path.join(data_dir, 'Train', 'Images', 'test', img_name) img = imread(filepath, flatten=True) test_index = int(img_name.split('.')) - train.shape print "Prediction is: ", pred[test_index] pylab.imshow(img, cmap='gray') pylab.axis('off') pylab.show() Prediction is: 8
b) We see that our model performs well even on being very simple. Now we create a submission with our model
sample_submission.filename = test.filename; sample_submission.label = pred sample_submission.to_csv(os.path.join(sub_dir, 'sub02.csv'), index=False)
I feel that, hyperparameter tuning is the hardest in neural network in comparison to any other machine learning algorithm. You would be insane to apply Grid Search, as there are numerous parameters when it comes to tuning a neural network.
Note: I have discussed a few more details, on when to apply neural networks in the following article An Introduction to Implementing Neural Networks using TensorFlow
Some important parameters to look out for while optimizing neural networks are:
Also, there may be many more hyperparameters depending on the type of architecture. For example, if you use a convolutional neural network, you would have to look at hyperparameters like convolutional filter size, pooling value, etc.
The best way to pick good parameters is to understand your problem domain. Research the previously applied techniques on your data, and most importantly ask experienced people for insights to the problem. It’s the only way you can try to ensure you get a “good enough” neural network model.
Let us take our knowledge of hyperparameters and start tweaking our neural network model.
%pylab inline import os import numpy as np import pandas as pd from scipy.misc import imread from sklearn.metrics import accuracy_score import tensorflow as tf import keras from keras.models import Sequential from keras.layers import Dense, Activation, Dropout, Convolution2D, Flatten, MaxPooling2D, Reshape, InputLayer
# To stop potential randomness seed = 128 rng = np.random.RandomState(seed)
root_dir = os.path.abspath('../..') data_dir = os.path.join(root_dir, 'data') sub_dir = os.path.join(root_dir, 'sub') # check for existence os.path.exists(root_dir) os.path.exists(data_dir) os.path.exists(sub_dir)
train = pd.read_csv(os.path.join(data_dir, 'Train', 'train.csv')) test = pd.read_csv(os.path.join(data_dir, 'Test.csv')) sample_submission = pd.read_csv(os.path.join(data_dir, 'Sample_Submission.csv')) temp =  for img_name in train.filename: image_path = os.path.join(data_dir, 'Train', 'Images', 'train', img_name) img = imread(image_path, flatten=True) img = img.astype('float32') temp.append(img) train_x = np.stack(temp) train_x /= 255.0 train_x = train_x.reshape(-1, 784).astype('float32') temp =  for img_name in test.filename: image_path = os.path.join(data_dir, 'Train', 'Images', 'test', img_name) img = imread(image_path, flatten=True) img = img.astype('float32') temp.append(img) test_x = np.stack(temp) test_x /= 255.0 test_x = test_x.reshape(-1, 784).astype('float32') train_y = keras.utils.np_utils.to_categorical(train.label.values)
split_size = int(train_x.shape*0.7) train_x, val_x = train_x[:split_size], train_x[split_size:] train_y, val_y = train_y[:split_size], train_y[split_size:]
# define vars input_num_units = 784 hidden_num_units = 500 output_num_units = 10 epochs = 5 batch_size = 128 model = Sequential([ Dense(output_dim=hidden_num_units, input_dim=input_num_units, activation='relu'), Dense(output_dim=output_num_units, input_dim=hidden_num_units, activation='softmax'), ])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) trained_model_500 = model.fit(train_x, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x, val_y))
# define vars input_num_units = 784 hidden1_num_units = 50 hidden2_num_units = 50 hidden3_num_units = 50 hidden4_num_units = 50 hidden5_num_units = 50 output_num_units = 10 epochs = 5 batch_size = 128 model = Sequential([ Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'), Dense(output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'), Dense(output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'), Dense(output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'), Dense(output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'), Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'), ])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) trained_model_5d = model.fit(train_x, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x, val_y))
# define vars input_num_units = 784 hidden1_num_units = 50 hidden2_num_units = 50 hidden3_num_units = 50 hidden4_num_units = 50 hidden5_num_units = 50 output_num_units = 10 epochs = 5 batch_size = 128 dropout_ratio = 0.2 model = Sequential([ Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'), Dropout(dropout_ratio), Dense(output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'), Dropout(dropout_ratio), Dense(output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'), Dropout(dropout_ratio), Dense(output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'), Dropout(dropout_ratio), Dense(output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'), Dropout(dropout_ratio), Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'), ])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) trained_model_5d_with_drop = model.fit(train_x, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x, val_y))
# define vars input_num_units = 784 hidden1_num_units = 50 hidden2_num_units = 50 hidden3_num_units = 50 hidden4_num_units = 50 hidden5_num_units = 50 output_num_units = 10 epochs = 50 batch_size = 128 model = Sequential([ Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'), Dropout(0.2), Dense(output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'), Dropout(0.2), Dense(output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'), Dropout(0.2), Dense(output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'), Dropout(0.2), Dense(output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'), Dropout(0.2), Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'), ])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) trained_model_5d_with_drop_more_epochs = model.fit(train_x, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x, val_y))
# define vars input_num_units = 784 hidden1_num_units = 500 hidden2_num_units = 500 hidden3_num_units = 500 hidden4_num_units = 500 hidden5_num_units = 500 output_num_units = 10 epochs = 25 batch_size = 128 model = Sequential([ Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'), Dropout(0.2), Dense(output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'), Dropout(0.2), Dense(output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'), Dropout(0.2), Dense(output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'), Dropout(0.2), Dense(output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'), Dropout(0.2), Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'), ])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) trained_model_deep_n_wide = model.fit(train_x, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x, val_y))
pred = model.predict_classes(test_x) sample_submission.filename = test.filename; sample_submission.label = pred sample_submission.to_csv(os.path.join(sub_dir, 'sub03.csv'), index=False)
# reshape data train_x_temp = train_x.reshape(-1, 28, 28, 1) val_x_temp = val_x.reshape(-1, 28, 28, 1) # define vars input_shape = (784,) input_reshape = (28, 28, 1) conv_num_filters = 5 conv_filter_size = 5 pool_size = (2, 2) hidden_num_units = 50 output_num_units = 10 epochs = 5 batch_size = 128 model = Sequential([ InputLayer(input_shape=input_reshape), Convolution2D(25, 5, 5, activation='relu'), MaxPooling2D(pool_size=pool_size), Convolution2D(25, 5, 5, activation='relu'), MaxPooling2D(pool_size=pool_size), Convolution2D(25, 4, 4, activation='relu'), Flatten(), Dense(output_dim=hidden_num_units, activation='relu'), Dense(output_dim=output_num_units, input_dim=hidden_num_units, activation='softmax'), ]) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) trained_model_conv = model.fit(train_x_temp, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x_temp, val_y))
This result blows your mind, doesn’t it. Even with such small training time, the performance is way better! This proves that a better architecture can certainly boost your performance when dealing with neural networks.
It’s time to let go of the training wheels. There’s many things you can try, so many tweaks to do. Try this on your end and let us know how it goes!
Now, you have a basic overview of Keras and a hands-on experience of implementing neural networks. There is still much more you can do. For example, I really like the implementation of keras to build image analogies. In this project, the authors train a neural network to understand an image, and recreate learnt attributes to another image. As seen below, the first two images are given as input, where the model trains on the first image and on giving input as second image, gives output as the third image.
Neural network tuning is still considered as a “dark art”. So, don’t expect that you would get the best model in your first try. Build, evaluate and reiterate, this is how you would be a better neural network practitioner.
Another point you should know that there are other methods to ensure that you would get a “good enough” neural network model without training it from scratch. Techniques like pre-training and transfer learning, are essential to know when you are implementing neural network models to solve real life problems.
I hope you found this article helpful. Now, it’s time for you to practice and read as much as you can. Good luck! If you have any recommendations / suggestions on neural networks, I’d love to interact with you in comments. If you have any more doubts or queries feel to drop in your comments below. Try out the practice problem Identify the Digits yourself and let me know what was your experience.
Fundamentals of Deep Learning – Starting with Artificial Neural Network
Deep Learning for Computer Vision – Introduction to Convolution Neural Networks
Optimizing Neural Networks Tutorial using Keras (Image recognition)