January 13, 2020
January 15, 2020

## Introduction

Did you know the first neural network was discovered in early 1950s ?

Deep Learning (DL) and Neural Network (NN) is currently driving some of the most ingenious inventions in today’s century. Their incredible ability to learn from data and environment makes them the first choice of machine learning scientists.

Deep Learning and Neural Network lies in the heart of products such as self driving cars, image recognition software, recommender systems etc. Evidently, being a powerful algorithm, it is highly adaptive to various data types as well.

People think neural network is an extremely difficult topic to learn. Therefore, either some of them don’t use it, or the ones who use it, use it as a black box. Is there any point in doing something without knowing how is it done? NO!

In this article, I’ve attempted to explain the concept of neural network in simple words. Understanding this article requires a little bit of biology and lots of patience. By end of this article, you would become a confident analyst ready to start working with neural networks. In case you don’t understand anything, I’m always available in comments section.

1. What is a Neural Network?
2. How a Single Neuron works?
3. Why multi-layer networks are useful?
4. General Structure of a Neural Network
5. Back-propagation ( Really Important )

## 1. What is a Neural Network?

Neural Networks (NN), also called as Artificial Neural Network is named after its artificial representation of working of a human being’s nervous system. Remember this diagram ? Most of us have been taught in High School !

Flashback Recap: Lets start by understanding how our nervous system works. Nervous System comprises of millions of nerve cells or neurons. A neuron has the following structure:

The major components are:

• Dendrites- It takes input from other neurons in form of an electrical impulse
• Cell Body– It generate inferences from those inputs and decide what action to take
• Axon terminals– It transmit outputs in form of electrical impulse

In simple terms, each neuron takes input from numerous other neurons through the dendrites. It then performs the required processing on the input and sends another electrical pulse through the axiom into the terminal nodes from where it is transmitted to numerous other neurons.

ANN works in a very similar fashion. The general structure of a neural network looks like:Source

This figure depicts a typical neural network with working of a single neuron explained separately. Let’s understand this.

The input to each neuron are like the dendrites. Just like in human nervous system, a neuron (artificial though!) collates all the inputs and performs an operation on them. Lastly, it transmits the output to all other neurons (of the next layer) to which it is connected. Neural Network is divided into layer of 3 types:

1. Input Layer: The training observations are fed through these neurons
2. Hidden Layers: These are the intermediate layers between input and output which help the Neural Network learn the complicated relationships involved in data.
3. Output Layer: The final output is extracted from previous two layers. For Example: In case of a classification problem with 5 classes, the output later will have 5 neurons.

Lets start by looking into the functionality of each neuron with examples.

## 2. How a Single Neuron works?

In this section, we will explore the working of a single neuron with easy examples. The idea is to give you some intuition on how a neuron compute outputs using the inputs. A typical neuron looks like:

The different components are:

1. x1, x2,…, xN: Inputs to the neuron. These can either be the actual observations from input layer or an intermediate value from one of the hidden layers.
2. x0: Bias unit. This is a constant value added to the input of the activation function. It works similar to an intercept term and typically has +1 value.
3. w0,w1, w2,…,wN: Weights on each input. Note that even bias unit has a weight.
4. a: Output of the neuron which is calculated as:

Here f is known an activation function. This makes a Neural Network extremely flexible and imparts the capability to estimate complex non-linear relationships in data. It can be a gaussian function, logistic function, hyperbolic function or even a linear function in simple cases.

Lets implement 3 fundamental functions – OR, AND, NOT using Neural Networks. This will help us understand how they work. You can assume these to be like a classification problem where we’ll predict the output (0 or 1) for different combination of inputs.

We will model these like linear classifiers with the following activation function:

### Example 1: AND

The AND function can be implemented as:

The output of this neuron is:

#### a = f( -1.5 + x1 + x2 )

The truth table for this implementation is:

Here we can see that the AND function is successfully implemented. Column ‘a’ complies with ‘X1 AND X2’. Note that here the bias unit weight is -1.5. But it’s not a fixed value. Intuitively, we can understand it as anything which makes the total value positive only when both x1 and x2 are positive. So any value between (-1,-2) would work.

### Example 2: OR

The OR function can be implemented as:

The output of this neuron is:

#### a = f( -0.5 + x1 + x2 )

The truth table for this implementation is:

Column ‘a’ complies with ‘X1 OR X2’. We can see that, just by changing the bias unit weight, we can implement an OR function. This is very similar to the one above. Intuitively, you can understand that here, the bias unit is such that the weighted sum will be positive if any of x1 or x2 becomes positive.

### Example 3: NOT

Just like the previous cases, the NOT function can be implemented as:

The output of this neuron is:

#### a = f( 1 – 2*x1 )

The truth table for this implementation is:

Again, the compliance with desired value proves functionality. I hope with these examples, you’re getting some intuition into how a neuron inside a Neural Network works. Here I have used a very simple activation function.

Note: Generally a logistic function will be used in place of what I used here because it is differentiable and makes determination of a gradient possible. There’s just 1 catch. And, that is, it outputs floating value and not exactly 0 or 1.

## 3. Why multi-layer networks are useful?

After understanding the working of a single neuron, lets try to understand how a Neural Network can model complex relations using multiple layers. To understand this further, we will take the example of an XNOR function. Just a recap, the truth table of an XNOR function looks like:

Here we can see that the output is 1 when both inputs are same, otherwise 0. This sort of a relationship cannot be modeled using a single neuron. (Don’t believe me? Give it a try!) Thus we will use a multi-layer network. The idea behind using multiple layers is that complex relations can be broken into simpler functions and combined.

Lets break down the XNOR function.

        X1 XNOR X2 = NOT ( X1 XOR X2 )
                   = NOT [ (A+B).(A'+B') ]       (Note: Here '+' means OR and '.' mean AND)
                   = (A+B)' + (A'+B')'
                   = (A'.B') + (A.B)


Now we can implement it using any of the simplified cases. I will show you how to implement this using 2 cases.

### Case 1: X1 XNOR X2 = (A’.B’) + (A.B)

Here the challenge is to design a neuron to model A’.B’ . This can be easily modeled using the following:

The output of this neuron is:

#### a = f( 0.5 – x1 – x2 )

The truth table for this function is:

Now that we have modeled the individual components and we can combine them using a multi-layer network. First, lets look at the semantic diagram of that network:

Here we can see that in layer 1, we will determine A’.B’ and A.B individually. In layer 2, we will take their output and implement an OR function on top. This would complete the entire Neural Network. The final network would look like this:

If you notice carefully, this is nothing but a combination of the different neurons which we have already drawn. The different outputs represent different units:

1. a1: implements A’.B’
2. a2: implements A.B
3. a3: implements OR which works on a1 and a2, thus effectively (A’.B’ + A.B)

The functionality can be verified using the truth table:

I think now you can get some intuition into how multi-layers work. Lets do another implementation of the same case.

### Case 2: X1 XNOR X2 = NOT [ (A+B).(A’+B’) ]

In the above example, we had to separately calculate A’.B’. What if we want to implement the function just using the basic AND, OR, NOT functions. Consider the following semantic:

Here you can see that we had to use 3 hidden layers. The working will be similar to what we did before. The network looks like:

Here the neurons perform following actions:

1. a1: same as A
2. a2: implements A’
3. a3: same as B
4. a4: implements B’
5. a5: implements OR, effectively A+B
6. a6: implements OR, effectively A’+B’
7. a7: implements AND, effectively (A+B).(A’+B’)
8. a8: implements NOT, effectively NOT [ (A+B).(A’+B’) ] which is the final XNOR

Note that, typically a neuron feeds into every other neuron of the next layer except the bias unit. In this case, I’ve obviated few connections from layer 1 to layer 2. This is because their weights are 0 and adding them will make it visually cumbersome to grasp.

The truth table is:

Finally, we have successfully implemented XNOR function. This method is more complicated than case 1. Hence, you should prefer case 1 always. But the idea here is to show how complicated functions can be broken down in multiple layers. I hope the advantages of multiple layers are clearer now.

## 4. General Structure of a Neural Network

Now that we had a look at some basic examples, lets define a generic structure in which every Neural Network falls. We will also see the equations to be followed to determine the output given an input. This is known as Forward Propagation.

A generic Neural Network can be defined as:

It has L layers with 1 input layer, 1 output layer and L-2 hidden layers. Terminology:

• L: number of layers
• Ni: number of neuron in ith layer excluding the bias unit, where i=1,2,…,L
• ai(j): the output of the jth neuron in ith layer, where i=1,2…L | j=0,1,2….Ni

Since the the output of each layer forms the input of next layer, lets define the equation to determine the output of i+1th layer using output of ith layer as input.

The input to the i+1th layer are:

Ai = [ ai(0), ai(1), ......, ai(Ni) ]
Dimension: 1 x Ni+1

The weights matrix from ith to i+1th layer is:

W(i) = [ [ W01(i)    W11(i)  .......   WNi1(i) ]
[ W02(i)    W12(i)  .......   WNi2(i) ]
...                   ...
...                   ...
...                   ...
...                   ...
[ W0Ni+1(i)  W1Ni+1(i)  ....... WNiNi+1(i) ] ]

Dimension: Ni+1 x Ni+1

The output of the i+1th layer can be calculated as:

 Ai+1 = f( Ai.W(i) )
Dimension: 1 x Ni+1

Using these equations for each subsequent layer, we can determine the final output. The number of neurons in the output layer will depend on the type of problem. It can be 1 for regression or binary classification problem or multiple for multi-class classification problems.

But this is just determining the output from 1 run. The ultimate objective is to update the weights of the model in order to minimize the loss function. The weights are updated using a back-propogation algorithm which we’ll study next.

## 5. Back-Propagation

Back-propagation (BP) algorithms works by determining the loss (or error) at the output and then propagating it back into the network. The weights are updated to minimize the error resulting from each neuron. I will not go in details of the algorithm but I will try to give you some intuition into how it works.

The first step in minimizing the error is to determine the gradient of each node wrt. the final output. Since, it is a multi-layer network, determining the gradient is not very straightforward.

Let’s understand the gradients for multi-layer networks. Lets take a step back from neural networks and consider a very simple system as following:

Here there are 3 inputs which simple processing as:

#### e = d * c = (a-b)*c

Now we need to determine the gradients of a,b,c,d wrt the output e. The following cases are very straight forward:

However, for determining the gradients for a and b, we need to apply the chain rule.

And, this way the gradient can be computed by simply multiplying the gradient of the input to a node with that of the output of that node. If you’re still confused, just read the equation carefully 5 times and you’ll get it!

But, the actual cases are not that simple. Let’s take another example. Consider a case where a single input is being fed into multiple items in the next layer as this is almost always the case with neural network.

In this case, the gradients of all other will be very similar to the above example except for ‘m’ because m is being fed into 2 nodes. Here, I’ll show how to determine the gradient for m and rest you should calculate on your own.

Here you can see that the gradient is simply the summation of the two different gradients. I hope the cloud cover is slowly vanishing and things are becoming lucid. Just understand these concepts and we’ll come back to this.

Before moving forward, let’s sum up the entire process behind optimization of a neural network. The various steps involved in each iteration are:

1. Select a network architecture, i.e. number of hidden layers,  number of neurons in each layer and activation function
2. Initialize weights randomly
3. Use forward propagation to determine the output node
4. Find the error of the model using the known labels
5. Back-propogate the error into the network and determine the error for each node
6. Update the weights to minimize gradient

Till now we have covered #1 – #3 and we have some intuition into #5. Now lets start from #4 – #6. We’ll use the same generic structure of NN as described in section 4.

#4- Find the error

eL(i) = y(i) - aL(i) | i = 1,2,....,NL

Here y(i) is the actual outcome from training data

#5- Back-propogating the error into the network

The error for layer L-1 should be determined first using the following:

where i = 0,1,2, ….., NL-1 (number of nodes in L-1th layer)

Intuition from the concepts discussed in former half of this section:

• We saw that the gradient of a node is a function of the gradients of all nodes from next layer. Here, the error at a node is based on weighted sum of errors on all the nodes of the next layer which take output of this node as input. Since errors are calculated using gradients of each node, the factor comes into picture.
• f'(x)(i) refers to the derivative of the activation function for the inputs coming into that node. Note that x refers to weighted sum of all inputs in present node before application of activation function.
• The chain rule is followed here by multiplication of the gradient of current node, i.e. f'(x)(i) with that of subsequent nodes which comes from first half of RHS of the equation.

This process has to be repeated consecutively from L-1th layer to 2nd layer. Note that the first layer is just the inputs.

#6- Update weights to minimize gradient

Use the following update rule for weights:
Wik(l) = Wik(l) + a(i).el+1(k)


where,

• l = 1,2,….., (L-1) | index of layers (excluding the last layer)
• i = 0,1,….., Nl | index of node in lth layer
• k = 1,2,…., Nl+1 | index of node in l+1th layer
• Wik(l) refers to the weight from the lth layer to l+1th layer from ith node to kth node

I hope the convention is clear. I suggest you go through it multiple times and if still there are questions, I’ll be happy to take them on through comments below.

With this we have successfully understood how a neural network works. Please feel free to discuss further if needed.

## End Notes

This article is focused on the fundamentals of a Neural Network and how it works. I hope now you understand the working of a neural network and wouldn’t use it as a black box ever. It’s really easy once you understand doing it practically as well.

Therefore, in my upcoming article, I’ll explain the applications of using Neural Network in Python. More than theoretical, I’ll focus on practical aspect of Neural Network. Two applications come to my mind immediately:

1. Image Processing
2. Natural Language Processing

I hope you enjoyed this. I would love if you could share your feedback through comments below. Looking forward to interacting with you further on this!

## Introduction to Convolution Neural Networks

The power of artificial intelligence is beyond our imagination. We all know robots have already reached a testing phase in some of the powerful countries of the world. Governments, large companies are spending billions in developing this ultra-intelligence creature. The recent existence of robots have gained attention of many research houses across the world.

Does it excite you as well ? Personally for me, learning about robots & developments in AI started with a deep curiosity and excitement in me! Let’s learn about computer vision today.

The earliest research in computer vision started way back in 1950s. Since then, we have come a long way but still find ourselves far from the ultimate objective. But with neural networks and deep learning, we have become empowered like never before.

Applications of deep learning in vision have taken this technology to a different level and made sophisticated things like self-driven cars possible in near future. In this article, I will also introduce you to Convolution Neural Networks which form the crux of deep learning applications in computer vision.

Note: This article is inspired by Stanford’s Class on Visual Recognition. Understanding this article requires prior knowledge of Neural Networks. If you are new to neural networks, you can start here. Another useful resource on basics of deep learning can be found here.

1. Challenges in Computer Vision
3. Review of Neural Networks Fundamentals
4. Introduction to Convolution Neural Networks
5. Case Study: Increasing power of of CNNs in IMAGENET competition
6. Implementing CNNs using GraphLab (Practical in Python)

## 1. Challenges in Computer Vision (CV)

As the name suggests, the aim of computer vision (CV) is to imitate the functionality of human eye and brain components responsible for your sense of sight.

Doing actions such as recognizing an animal, describing a view, differentiating among visible objects are really a cake-walk for humans. You’d be surprised to know that it took decades of research to discover and impart the ability of detecting an object to a computer with reasonable accuracy.

The field of computer vision has witnessed continual advancements in the past 5 years. One of the most stated advancement is Convolution Neural Networks (CNNs). Today, deep CNNs form the crux of most sophisticated fancy computer vision application, such as self-driving cars, auto-tagging of friends in our facebook pictures, facial security features, gesture recognition, automatic number plate recognition, etc.

Let’s get familiar with it a bit more:

Object detection is considered to be the most basic application of computer vision. Rest of the other developments in computer vision are achieved by making small enhancements on top of this. In real life, every time we(humans) open our eyes, we unconsciously detect objects.

Since it is super-intuitive for us, we fail to appreciate the key challenges involved when we try to design systems similar to our eye. Lets start by looking at some of the key roadblocks:

1. Variations in Viewpoint
• The same object can have different positions and angles in an image depending on the relative position of the object and the observer.
• There can also be different positions. For instance look at the following images:
• Though its obvious to know that these are the same object, it is not very easy to teach this aspect to a computer (robots or machines).
2. Difference in Illumination
• Different images can have different light conditions. For instance:
• Though this image is so dark, we can still recognize that it is a cat. Teaching this to a computer is another challenge.
3. Hidden parts of images
• Images need not necessarily be complete. Small or large proportions of the images might be hidden which makes the detection task difficult. For instance:
• Here, only the face of the puppy is visible and that too partially, posing another challenge for the computer to recognize.
4. Background Clutter
• Some images might blend into the background. For instance:
• If you observe carefully, you can find a man in this image. As simple as it looks, it’s an uphill task for a computer to learn.

These are just some of the challenges which I brought up so that you can appreciate the complexity of the tasks which your eye and brain duo does with such utter ease. Breaking up all these challenges and solving individually is still possible today in computer vision. But we’re still decades away from a system which can get anywhere close to our human eye (which can do everything!).

This brilliance of our human body is the reason why researchers have been trying to break the enigma of computer vision by analyzing the visual mechanics of humans or other animals. Some of the earliest work in this direction was done by Hubel and Weisel with their famous cat experiment in 1959. Read more about it here.

This was the first study which emphasized the importance of edge detection for solving the computer vision problem. They were rewarded the nobel prize for their work.

Before diving into convolutional neural networks, lets take a quick overview of the traditional or rather elementary techniques used in computer vision before deep learning became popular.

## 2. Overview of Traditional Approaches

Various techniques, other than deep learning are available enhancing computer vision. Though, they work well for simpler problems, but as the data become huge and the task becomes complex, they are no substitute for deep CNNs. Let’s briefly discuss two simple approaches.

1. KNN (K-Nearest Neighbours)
• Each image is matched with all images in training data. The top K with minimum distances are selected. The majority class of those top K is predicted as output class of the image.
• Various distance metrics can be used like L1 distance (sum of absolute distance), L2 distance (sum of squares), etc.
• Drawbacks:
• Even if we take the image of same object with same illumination and orientation, the object might lie in different locations of image, i.e. left, right or center of image. For instance:
• Here the same dog is on right side in first image and left side in second. Though its the same image, KNN would give highly non-zero distance for the 2 images.
• Similar to above, other challenges mentioned in section 1 will be faced by KNN.
2. Linear Classifiers
• They use a parametric approach where each pixel value is considered as a parameter.
• It’s like a weighted sum of the pixel values with the dimension of the weights matrix depending on the number of outcomes.
• Intuitively, we can understand this in terms of a template. The weighted sum of pixels forms a template image which is matched with every image. This will also face difficulty in overcoming the challenges discussed in section 1 as single template is difficult to design for all the different cases.

I hope this gives some intuition into the challenges faced by approaches other than deep learning. Please note that more sophisticated techniques can be used than the ones discussed above but they would rarely beat a deep learning model.

## 3. Review of Neural Networks Fundamentals

Let’s discuss some properties of a neural networks. I will skip the basics of neural networks here as I have already covered that in my previous article – Fundamentals of Deep Learning – Starting with Neural Networks.

Once your fundamentals are sorted, let’s learn in detail some important concepts such as activation functions, data preprocessing, initializing weights and dropouts.

### Activation Functions

There are various activation functions which can be used and this is an active area of research. Let’s discuss some of the popular options:

1. Sigmoid Function
• Equation: σ(x) = 1/(1+e-x)
• Sigmoid activation, also used in logistic regression regression, squashes the input space from (-inf,inf) to (0,1)
• But it has various problems and it is almost never used in CNNs:
1. Saturated neurons kill the gradient
• If you observe the above graph carefully, if the input is beyond -5 or 5, the output will be very close to 0 and 1 respectively. Also, in this region the gradients are almost zero. Notice that the tangents in this region will be almost parallel to x-axis thus ~0 slope.
• As we know that gradients get multiplied in back-propogation, so this small gradient will virtually stop back-propogation into further layers, thus killing the gradient.
2. Outputs are not zero-centered
• As you can see that all the outputs are between 0 and 1. As these become inputs to the next layer, all the gradients of the next layer will be either positive or negative. So the path to optimum will be zig-zag. I will skip the mathematics here. Please refer the stanford class referred above for details.
3. Taking the exp() is computationally expensive
• Though not a big drawback, it has a slight negative impact
2. tanh activation
• It is simply the hyperbolic tangent function with form:
• It is always preferred over sigmoid because it solved problem #2, i.e. the outputs are in range (-1,1).
• But it will still result in killing the gradient and thus not recommended choice.
3.  ReLU (Rectified Linear Unit)
• Equation: f(x) = max( 0 , x )
• It is the most commonly used activation function for CNNs. It has following advantages:
• Gradient won’t saturate in the positive region
• Computationally very efficient as simple thresholding is required
• Empirically found to converge faster than sigmoid or tanh.
• But still it has the following disadvantages:
• Output is not zero-centered and always positive
• Gradient is killed for x<0. Few techniques like leaky ReLU and parametric ReLU are used to overcome this and I encourage you to find these
• Gradient is not defined at x=0. But this can be easily catered using sub-gradients and posts less practical challenges as x=0 is generally a rare case

To summarize, ReLU is mostly the activation function of choice. If the caveats are kept in mind, these can be used very efficiently.

### Data Preprocessing

For images, generally the following preprocessing steps are done:

1. Same Size Images: All images are converted to the same size and generally in square shape.
2. Mean Centering: For each pixel, its mean value among all images can be subtracted from each pixel. Sometimes (but rarely) mean centering along red, green and blue channels can also be done

Note that normalization is generally not done in images.

### Weight Initialization

There can be various techniques for initializing weights. Lets consider a few of them:

1. All zeros
• This is generally a bad idea because in this case all the neuron will generate the same output initially and similar gradients would flow back in back-propagation
• The results are generally undesirable as network won’t train properly.
2. Gaussian Random Variables
• The weights can be initialized with random gaussian distribution of 0 mean and small standard deviation (0.1 to 1e-5)
• This works for shallow networks, i.e. ~5 hidden layers but not for deep networks
• In case of deep networks, the small weights make the outputs small and as you move towards the end, the values become even smaller. Thus the gradients will also become small resulting in gradient killing at the end.
• Note that you need to play with the standard deviation of the gaussian distribution which works well for your network.
3. Xavier Initialization
• It suggests that variance of the gaussian distribution of weights for each neuron should depend on the number of inputs to the layer.
• The recommended variance is square root of inputs. So the numpy code for initializing the weights of layer with n inputs is: np.random.randn(n_in, n_out)*sqrt(1/n_in)
• A recent research suggested that for ReLU neurons, the recommended update is: np.random.randn(n_in, n_out)*sqrt(2/n_in). Read this blog post for more details.

One more thing must be remembered while using ReLU as activation function. It is that the weights initialization might be such that some of the neurons might not get activated because of negative input. This is something that should be checked. You might be surprised to know that 10-20% of the ReLUs might be dead at a particular time while training and even in the end.

These were just some of the concepts I discussed here. Some more concepts can be of importance like batch normalization, stochastic gradient descent, dropouts which I encourage you to read on your own.

## 4. Introduction to Convolution Neural Networks

Before going into the details, lets first try to get some intuition into why deep networks work better.

As we learned from the drawbacks of earlier approaches, they are unable to cater to the vast amount of variations in images. Deep CNNs work by consecutively modeling small pieces of information and combining them deeper in network.

One way to understand them is that the first layer will try to detect edges and form templates for edge detection. Then subsequent layers will try to combine them into simpler shapes and eventually into templates of different object positions, illumination, scales, etc. The final layers will match an input image with all the templates and the final prediction is like a weighted sum of all of them. So, deep CNNs are able to model complex variations and behaviour giving highly accurate predictions.

There is an interesting paper on visualization of deep features in CNNs which you can go through to get more intuition – Understanding Neural Networks Through Deep Visualization.

For the purpose of explaining CNNs and finally showing an example, I will be using the CIFAR-10 dataset for explanation here and you can download the data set from here. This dataset has 60,000 images with 10 labels and 6,000 images of each type. Each image is colored and 32×32 in size.

A CNN typically consists of 3 types of layers:

1. Convolution Layer
2. Pooling Layer
3. Fully Connected Layer

You might find some batch normalization layers in some old CNNs but they are not used these days. We’ll consider these one by one.

### Convolution Layer

Since convolution layers form the crux of the network, I’ll consider them first. Each layer can be visualized in the form of a block or a cuboid. For instance in the case of CIFAR-10 data, the input layer would have the following form:

Here you can see, this is the original image which is 32×32 in height and width. The depth here is 3 which corresponds to the Red, Green and Blue colors, which form the basis of colored images. Now a convolution layer is formed by running a filter over it. A filter is another block or cuboid of smaller height and width but same depth which is swept over this base block. Let’s consider a filter of size 5x5x3.

We start this filter from the top left corner and sweep it till the bottom left corner. This filter is nothing but a set of eights, i.e. 5x5x3=75 + 1 bias = 76 weights. At each position, the weighted sum of the pixels is calculated as WTX + b and a new value is obtained. A single filter will result in a volume of size 28x28x1 as shown above.

Note that multiple filters are generally run at each step. Therefore, if 10 filters are used, the output would look like:

Here the filter weights are parameters which are learned during the back-propagation step. You might have noticed that we got a 28×28 block as output when the input was 32×32. Why so? Let’s look at a simpler case.

Suppose the initial image had size 6x6xd and the filter has size 3x3xd. Here I’ve kept the depth as d because it can be anything and it’s immaterial as it remains the same in both. Since depth is same, we can have a look at the front view of how filter would work:

Here we can see that the result would be 4x4x1 volume block. Notice there is a single output for entire depth of the each location of filter. But you need not do this visualization all the time. Let’s define a generic case where image has dimension NxNxd and filter has FxFxd. Also, lets define another term stride (S) here which is the number of cells (in above matrix) to move in each step. In the above case, we had a stride of 1 but it can be a higher value as well. So the size of the output will be:

output size = (N – F)/S + 1

You can validate the first case where N=32, F=5, S=1. The output had 28 pixels which is what we get from this formula as well. Please note that some S values might result in non-integer result and we generally don’t use such values.

Let’s consider an example to consolidate our understanding. Starting with the same image as before of size 32×32, we need to apply 2 filters consecutively, first 10 filters of size 7, stride 1 and next 6 filters of size 5, stride 2. Before looking at the solution below, just think about 2 things:

1. What should be the depth of each filter?
2. What will the resulting size of the images in each step.

Notice here that the size of the images is getting shrunk consecutively. This will be undesirable in case of deep networks where the size would become very small too early. Also, it would restrict the use of large size filters as they would result in faster size reduction.

To prevent this, we generally use a stride of 1 along with zero-padding of size (F-1)/2. Zero-padding is nothing but adding additional zero-value pixels towards the border of the image.

Consider the example we saw above with 6×6 image and 3×3 filter. The required padding is (3-1)/2=1. We can visualize the padding as:

Here you can see that the image now becomes 8×8 because of padding of 1 on each side. So now the output will be of size 6×6 same as the original image.

Now let’s summarize a convolution layer as following:

• Input size: W1 x H1 x D1
• Hyper-parameters:
• K: #filters
• F: filter size (FxF)
• S: stride
• Output size: W2 x H2 x D2
• W21
• H21
• D2
• #parameters = (F.F.D).K + K
• F.F.D : Number of parameters for each filter (analogous to volume of the cuboid)
• (F.F.D).K : Volume of each filter multiplied by the number of filters
• +K: adding K parameters for the bias term

Some additional points to be taken into consideration:

• K should be set as powers of 2 for computational efficiency
• F is generally taken as odd number
• F=1 might sometimes be used and it makes sense because there is a depth component involved
• Filters might be called kernels sometimes

Having understood the convolution layer, lets move on to pooling layer.

### Pooling Layer

When we use padding in convolution layer, the image size remains same. So, pooling layers are used to reduce the size of image. They work by sampling in each layer using filters. Consider the following 4×4 layer. So if we use a 2×2 filter with stride 2 and max-pooling, we get the following response:

Here you can see that 4 2×2 matrix are combined into 1 and their maximum value is taken. Generally, max-pooling is used but other options like average pooling can be considered.

### Fully Connected Layer

At the end of convolution and pooling layers, networks generally use fully-connected layers in which each pixel is considered as a separate neuron just like a regular neural network. The last fully-connected layer will contain as many neurons as the number of classes to be predicted. For instance, in CIFAR-10 case, the last fully-connected layer will have 10 neurons.

## 5. Case Study: AlexNet

I recommend reading the prior section multiple times and getting a hang of the concepts before moving forward.

In this section, I will discuss the AlexNet architecture in detail. To give you some background, AlexNet is the winning solution of IMAGENET Challenge 2012. This is one of the most reputed computer vision challenge and 2012 was the first time that a deep learning network was used for solving this problem.

Also, this resulted in a significantly better result as compared to previous solutions. I will share the network architecture here and review all the concepts learned above.

The detailed solution has been explained in this paper. I will explain the overall architecture of the network here. The AlexNet consists of a 11 layer CNN with the following architecture:

Here you can see 11 layers between input and output. Lets discuss each one of them individually. Note that the output of each layer will be the input of next layer. So you should keep that in mind.

• Layer 0: Input image
• Size: 227 x 227 x 3
• Note that in the paper referenced above, the network diagram has 224x224x3 printed which appears to be a typo.
• Layer 1: Convolution with 96 filters, size 11×11, stride 4, padding 0
• Size: 55 x 55 x 96
• (227-11)/4 + 1 = 55 is the size of the outcome
• 96 depth because 1 set denotes 1 filter and there are 96 filters
• Layer 2: Max-Pooling with 3×3 filter, stride 2
• Size: 27 x 27 x 96
• (55 – 3)/2 + 1 = 27 is size of outcome
• depth is same as before, i.e. 96 because pooling is done independently on each layer
• Layer 3: Convolution with 256 filters, size 5×5, stride 1, padding 2
• Size: 27 x 27 x 256
• Because of padding of (5-1)/2=2, the original size is restored
• 256 depth because of 256 filters
• Layer 4: Max-Pooling with 3×3 filter, stride 2
• Size: 13 x 13 x 256
• (27 – 3)/2 + 1 = 13 is size of outcome
• Depth is same as before, i.e. 256 because pooling is done independently on each layer
• Layer 5: Convolution with 384 filters, size 3×3, stride 1, padding 1
• Size: 13 x 13 x 384
• Because of padding of (3-1)/2=1, the original size is restored
• 384 depth because of 384 filters
• Layer 6: Convolution with 384 filters, size 3×3, stride 1, padding 1
• Size: 13 x 13 x 384
• Because of padding of (3-1)/2=1, the original size is restored
• 384 depth because of 384 filters
• Layer 7: Convolution with 256 filters, size 3×3, stride 1, padding 1
• Size: 13 x 13 x 256
• Because of padding of (3-1)/2=1, the original size is restored
• 256 depth because of 256 filters
• Layer 8: Max-Pooling with 3×3 filter, stride 2
• Size: 6 x 6 x 256
• (13 – 3)/2 + 1 = 6 is size of outcome
• Depth is same as before, i.e. 256 because pooling is done independently on each layer
• Layer 9: Fully Connected with 4096 neuron
• In this later, each of the 6x6x256=9216 pixels are fed into each of the 4096 neurons and weights determined by back-propagation.
• Layer 10: Fully Connected with 4096 neuron
• Similar to layer #9
• Layer 11: Fully Connected with 1000 neurons
• This is the last layer and has 1000 neurons because IMAGENET data has 1000 classes to be predicted.

I understand this is a complicated structure but once you understand the layers, it’ll give you a much better understanding of the architecture. Note that you fill find a different representation of the structure if you look at the AlexNet paper. This is because at that GPUs were not very powerful and they used 2 GPUs for training the network. So the work processing was divided between the two.

I highly encourage you to go through the other advanced solutions of ImageNet challenges after 2012 to get more ideas of how people design these networks. Some of interesting solutions are:

• ZFNet: winner of 2013 challenge
• GoogleNet: winner of 2014 challenge
• VGGNet: a good solution from 2014 challenge
• ResNet: winner of 2015 challenge designed by Microsoft Research Team

This video gives a brief overview and comparison of these solutions towards the end.

## 6. Implementing CNNs using GraphLab

Having understood the theoretical concepts, lets move on to the fun part (practical) and make a basic CNN on the CIFAR-10 dataset which we’ve downloaded before.

I’ll be using GraphLab for the purpose of running algorithms. Instead of GraphLab, you are free to use alternatives tools such as Torch, Theano, Keras, Caffe, TensorFlow, etc. But GraphLab allows a quick and dirty implementation as it takes care of the weights initializations and network architecture on its own.

We’ll work on the CIFAR-10 dataset which you can download from here. The first step is to load the data. This data is packed in a specific format which can be loaded using the following code:

import pandas as pd
import numpy as np
import cPickle

#Define a function to load each batch as dictionary:
def unpickle(file):
fo = open(file, 'rb')
fo.close()
return dict

#Make dictionaries by calling the above function:
batch1 = unpickle('data/data_batch_1')
batch2 = unpickle('data/data_batch_2')
batch3 = unpickle('data/data_batch_3')
batch4 = unpickle('data/data_batch_4')
batch5 = unpickle('data/data_batch_5')
batch_test = unpickle('data/test_batch')

#Define a function to convert this dictionary into dataframe with image pixel array and labels:
def get_dataframe(batch):
df = pd.DataFrame(batch['data'])
df['image'] = df.as_matrix().tolist()
df.drop(range(3072),axis=1,inplace=True)
df['label'] = batch['labels']
return df

#Define train and test files:
train = pd.concat([get_dataframe(batch1),get_dataframe(batch2),get_dataframe(batch3),get_dataframe(batch4),get_dataframe(batch5)],ignore_index=True)
test = get_dataframe(batch_test)

We can verify this data by looking at the head and shape of data as follow:

print train.head()

print train.shape, test.shape

Since we’ll be using graphlab, the next step is to convert this into a graphlab SFrame and run neural network. Let’s convert the data first:

import graphlab as gl
gltrain = gl.SFrame(train)
gltest = gl.SFrame(test)

GraphLab has a functionality of automatically creating a neural network based on the data. Lets run that as a baseline model before going into an advanced model.

model = gl.neuralnet_classifier.create(gltrain, target='label', validation_set=None)

Here it used a simple fully connected network with 2 hidden layers and 10 neurons each. Let’s evaluate this model on test data.

model.evaluate(gltest)

As you can see that we have a pretty low accuracy of ~15%. This is because it is a very fundamental network. Lets try to make a CNN now. But if we go about training a deep CNN from scratch, we will face the following challenges:

1. The available data is very less to capture all the required features
2. Training deep CNNs generally requires a GPU as a CPU is not powerful enough to perform the required calculations. Thus we won’t be able to run it on our system. We can probably rent an Amazom AWS instance.

To overcome these challenges, we can use pre-trained networks. These are nothing but networks like AlexNet which are pre-trained on many images and the weights for deep layers have been determined. The only challenge is to find a pre-trianed network which has been trained on images similar to the one we want to train. If the pre-trained network is not made on images of similar domain, then the features will not exactly make sense and classifier will not be of higher accuracy.

Before proceeding further, we need to convert these images into the size used in ImageNet which we’re using for classification. The GraphLab model is based on 256×256 size images. So we need to convert our images to that size. Lets do it using the following code:

#Convert pixels to graphlab image format
gltrain['glimage'] = gl.SArray(gltrain['image']).pixel_array_to_image(32, 32, 3, allow_rounding = True)
gltest['glimage'] = gl.SArray(gltest['image']).pixel_array_to_image(32, 32, 3, allow_rounding = True)
#Remove the original column
gltrain.remove_column('image')
gltest.remove_column('image')
gltrain.head()

Here we can see that a new column of type graphlab image has been created but the images are in 32×32 size. So we convert them to 256×256 using following code:

#Convert into 256x256 size
gltrain['image'] = gl.image_analysis.resize(gltrain['glimage'], 256, 256, 3)
gltest['image'] = gl.image_analysis.resize(gltest['glimage'], 256, 256, 3)
#Remove old column:
gltrain.remove_column('glimage')
gltest.remove_column('glimage')
gltrain.head()

Now we can see that the image has been converted into the desired size. Next, we will load the ImageNet pre-trained model in graphlab and use the features created in its last layer into a simple classifier and make predictions.

#Load the pre-trained model:
pretrained_model = gl.load_model('http://s3.amazonaws.com/GraphLab-Datasets/deeplearning/imagenet_model_iter45')

Now we have to use this model and extract features which will be passed into a classifier. Note that the following operations may take a lot of computing time. I use a Macbook Pro 15″ and I had to leave it for whole night!

gltrain['features'] = pretrained_model.extract_features(gltrain)
gltest['features'] = pretrained_model.extract_features(gltest)

Lets have a look at the data to make sure we have the features:

gltrain.head()

Though, we have the features with us, notice here that lot of them are zeros. You can understand this as a result of smaller data set. ImageNet was created on 1.2Mn images. So there would be many features in those images that don’t make sense for this data, thus resulting in zero outcome.

Now lets create a classifier using graphlab. The advantage with “classifier” function is that it will automatically create various classifiers and chose the best model.

simple_classifier = graphlab.classifier.create(gltrain, features = ['features'], target = 'label')

The various outputs are:

1. Boosted Trees Classifier
2. Random Forest Classifier

3. Decision Tree Classifier

4. Logistic Regression Classifier

The final model selection is based on a validation set with 5% of the data. The results are:

So we can see that Boosted Trees Classifier has been chosen as the final model. Let’s look at the results on test data:

simple_classifier.evaluate(gltest)

So we can see that the test accuracy is now ~50%. It’s a decent jump from 15% to 50% but there is still huge potential to do better. The idea here was to get you started and I will skip the next steps. Here are some things which you can try:

1. Remove the redundant features in the data
2. Perform hyper-parameter tuning in models
3. Search for pre-trained models which are trained on images similar to this dataset

You can find many open-source solutions for this dataset which give >95% accuracy. You should check those out. Please feel free to try them and post your solutions in comments below.

## Projects

Now, its time to take the plunge and actually play with some other real datasets. So are you ready to take on the challenge? Accelerate your deep learning journey with the following Practice Problems:

## End Notes

In this article, we covered the basics of computer vision using deep Convolution Neural Networks (CNNs). We started by appreciating the challenges involved in designing artificial systems which mimic the eye. Then, we looked at some of the traditional techniques, prior to deep learning, and got some intuition into their drawbacks.

We moved on to understanding the some aspects of tuning a neural networks such as activation functions, weights initialization and data-preprocessing. Next, we got some intuition into why deep CNNs should work better than traditional approaches and we understood the different elements present in a general deep CNN.

Subsequently, we consolidated our understanding by analyzing the architecture of AlexNet, the winning solution of ImageNet 2012 challenge. Finally, we took the CIFAR-10 data and implemented a CNN on it using a pre-trained AlexNet deep network.

I hope you liked this article. Did you find this article useful ? Please feel free to share your feedback through comments below. And to gain expertise in working in neural network try out the deep learning practice problem – Identify the Digits.

## Optimizing Neural Networks using Keras (with Image recognition case study)

In my previous article, I discussed the implementation of neural networks using TensorFlow. Continuing the series of articles on neural network libraries, I have decided to throw light on Keras – supposedly the best deep learning library so far.

I have been working on deep learning for sometime now and according to me, the most difficult thing when dealing with Neural Networks is the never-ending range of parameters to tune. With increase in depth of a Neural Network, it becomes increasingly difficult to take care of all the parameters. Mostly, people rely on intuition and experience to tune it. In reality, research is still rampant on this topic.

Thankfully we have Keras, which takes care of a lot of this hard work and provides an easier interface!

In this article, I am going to share my experience of working in deep learning. We will begin with an overview of Keras, its features and differentiation over other libraries. We will then, look at a simple implementation of neural networks in Keras. And then, I will take you through a hands-on exercise on parameter tuning in neural networks.

## 1. Keras : Overview

Keras is a high level library, used specially for building neural network models. It is written in Python and is compatible with both Python – 2.7 & 3.5. Keras was specifically developed for fast execution of ideas. It has a simple and highly modular interface, which makes it easier to create even complex neural network models. This library abstracts low level libraries, namely Theano and TensorFlow so that, the user is free from “implementation details” of these libraries.

The key features of Keras are:

• Modularity : Modules necessary for building a neural network are included in a simple interface so that Keras is easier to use for the end user.
• Minimalistic : Implementation is short and concise.
• Extensibility : It’s very easy to write a new module for Keras and makes it suitable for advance research.

Being a high level library and its simpler interface, Keras certainly shines as one of the best deep learning library available. Few features of Keras, which stands out in comparison with other libraries are:

• In comparison to Theano and TensorFlow, it takes in all the advantages of both of these libraries and tries to give a better “user experience”.
• As Keras is a python library, it is more accessible to general public because of Python’s inherent simplicity as a programming language.
• A similar library in comparison to Keras is Lasagne, but having used both I can say that Keras is much easier to use.

Given the above reasons, it is no surprise that Keras is increasingly becoming popular as a deep learning library.

## 3. Keras : Limitations

• I think that having a dependency on low level libraries like Theano / TensorFlow is a double edged sword. This is because Keras cannot go “out of the realms” of these libraries. For example, both Theano and TensorFlow do not support GPUs other than Nvidia (currently). And hence, Keras too doesn’t have the corresponding support.
• Also unlike Lasagne, Keras completely abstracts the low level languages. So, it is less flexible when it comes to building custom operations.
• The last point I’ll make is that Keras is relatively new. The first version was released in early 2015, and it has undergone many changes since then. Although Keras is already used in production, but you should think twice before deploying keras models for productions.

## 4. General way to solve problems with Neural Networks

Neural networks is a special type of machine learning (ML) algorithm. So, like every ML algorithm, it follows the usual ML workflow of data preprocessing, model building and model evaluation. For the sake of conciseness, I have listed out a To-D0 list of how to approach a Neural Network problem.

• Check if it is a problem where Neural Network gives you uplift over traditional algorithms (refer to the checklist in the section above)
• Do a survey of which Neural Network architecture is most suitable for the required problem
• Define Neural Network architecture through whichever language / library you choose.
• Convert data to right format and divide it in batches
• Pre-process the data according to your needs
• Augment Data to increase size and make better trained models
• Feed batches to Neural Network
• Train and monitor changes in training and validation data sets
• Test your model, and save it for future use

## 5. Starting with a simple Keras implementation on “Identify the Digits”

Before starting this experiment, make sure you have Keras installed in your system. Refer the official installation guide. We will use tensorflow for backend, so make sure you have this done in your config file. If not, follow the steps given here.

Here, we solve our deep learning practice problem – Identify the Digits.  Let’s take a look at our problem statement:

Our problem is an image recognition problem, to identify digits from a given 28 x 28 image. We have a subset of images for training and the rest for testing our model. So first, download the train and test files. The dataset contains a zipped file of all the images and both the train.csv and test.csv have the name of corresponding train and test images. Any additional features are not provided in the datasets, just the raw images are provided in ‘.png’ format.

Let’s start:

a) Import all the necessary libraries

%pylab inline
import os
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score

import tensorflow as tf
import keras



b) Let’s set a seed value, so that we can control our models randomness

# To stop potential randomness
seed = 128
rng = np.random.RandomState(seed)

c) The first step is to set directory paths, for safekeeping!

root_dir = os.path.abspath('../..') data_dir = os.path.join(root_dir, 'data') sub_dir = os.path.join(root_dir, 'sub') # check for existence os.path.exists(root_dir) os.path.exists(data_dir) os.path.exists(sub_dir)

a) Now let us read our datasets. These are in .csv formats, and have a filename along with the appropriate labels

train = pd.read_csv(os.path.join(data_dir, 'Train', 'train.csv'))

train.head()

b) Let us see what our data looks like! We read our image and display it.

img_name = rng.choice(train.filename)
filepath = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)

pylab.imshow(img, cmap='gray')
pylab.axis('off')
pylab.show()

c) The above image is represented as numpy array, as seen below

d) For easier data manipulation, let’s store all our images as numpy arrays

temp = []
for img_name in train.filename:
image_path = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)
img = img.astype('float32')
temp.append(img)

train_x = np.stack(temp)

train_x /= 255.0
train_x = train_x.reshape(-1, 784).astype('float32')

temp = []
for img_name in test.filename:
image_path = os.path.join(data_dir, 'Train', 'Images', 'test', img_name)
img = img.astype('float32')
temp.append(img)

test_x = np.stack(temp)

test_x /= 255.0
test_x = test_x.reshape(-1, 784).astype('float32')

train_y = keras.utils.np_utils.to_categorical(train.label.values)

e) As this is a typical ML problem, to test the proper functioning of our model we create a validation set. Let’s take a split size of 70:30 for train set vs validation set

split_size = int(train_x.shape[0]*0.7)

train_x, val_x = train_x[:split_size], train_x[split_size:]
train_y, val_y = train_y[:split_size], train_y[split_size:]
train.label.ix[split_size:]

### STEP 2: Model Building

a) Now comes the main part! Let us define our neural network architecture. We define a neural network with 3 layers  input, hidden and output. The number of neurons in input and output are fixed, as the input is our 28 x 28 image and the output is a 10 x 1 vector representing the class. We take 50 neurons in the hidden layer. Here, we use Adam as our optimization algorithms, which is an efficient variant of Gradient Descent algorithm. There are a number of other optimizers available in keras (refer here). In case you don’t understand any of these terminologies, check out the article on fundamentals of neural network to know more in depth of how it works.

# define vars
input_num_units = 784
hidden_num_units = 50
output_num_units = 10

epochs = 5
batch_size = 128

# import keras modules

from keras.models import Sequential
from keras.layers import Dense

# create model
model = Sequential([
Dense(output_dim=hidden_num_units, input_dim=input_num_units, activation='relu'),
Dense(output_dim=output_num_units, input_dim=hidden_num_units, activation='softmax'),
])

# compile the model with necessary attributes
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

b) It’s time to train our model

trained_model = model.fit(train_x, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x, val_y))

### STEP 3: Model Evaluation

a) To test our model with our own eyes, let’s visualize its predictions

pred = model.predict_classes(test_x)

img_name = rng.choice(test.filename)
filepath = os.path.join(data_dir, 'Train', 'Images', 'test', img_name)

test_index = int(img_name.split('.')[0]) - train.shape[0]

print "Prediction is: ", pred[test_index]

pylab.imshow(img, cmap='gray')
pylab.axis('off')
pylab.show()

Prediction is:  8


b) We see that our model performs well even on being very simple. Now we create a submission with our model

sample_submission.filename = test.filename; sample_submission.label = pred
sample_submission.to_csv(os.path.join(sub_dir, 'sub02.csv'), index=False)

## 6. Hyperparameters to look out for in Neural Networks

I feel that, hyperparameter tuning is the hardest in neural network in comparison to any other machine learning algorithm. You would be insane to apply Grid Search, as there are numerous parameters when it comes to tuning a neural network.

Note: I have discussed a few more details, on when to apply neural networks in the following article An Introduction to Implementing Neural Networks using TensorFlow

Some important parameters to look out for while optimizing neural networks are:

• Type of architecture
• Number of Layers
• Number of Neurons in a layer
• Regularization parameters
• Learning Rate
• Type of optimization / backpropagation technique to use
• Dropout rate
• Weight sharing

Also, there may be many more hyperparameters depending on the type of architecture. For example, if you use a convolutional neural network, you would have to look at hyperparameters like convolutional filter size, pooling value, etc.

The best way to pick good parameters is to understand your problem domain. Research the previously applied techniques on your data, and most importantly  ask experienced people for insights to the problem. It’s the only way you can try to ensure you get a “good enough” neural network model.

Here are some resources for tips and tricks for training neural networks. (Resource 1Resource 2Resource 3)

## 7. Getting your hands dirty

Let us take our knowledge of hyperparameters and start tweaking our neural network model.

• As we did before, we redo all the pre-requisite things. Let’s import the modules
%pylab inline

import os
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
import tensorflow as tf
import keras

from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, Convolution2D, Flatten, MaxPooling2D, Reshape, InputLayer
• As before, set seed value
# To stop potential randomness
seed = 128
rng = np.random.RandomState(seed)
• Set paths for further use
root_dir = os.path.abspath('../..')
data_dir = os.path.join(root_dir, 'data')
sub_dir = os.path.join(root_dir, 'sub')

# check for existence
os.path.exists(root_dir)
os.path.exists(data_dir)
os.path.exists(sub_dir)
• Read the datasets and convert them to usable form
train = pd.read_csv(os.path.join(data_dir, 'Train', 'train.csv'))

temp = []
for img_name in train.filename:
image_path = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)
img = img.astype('float32')
temp.append(img)

train_x = np.stack(temp)

train_x /= 255.0
train_x = train_x.reshape(-1, 784).astype('float32')

temp = []
for img_name in test.filename:
image_path = os.path.join(data_dir, 'Train', 'Images', 'test', img_name)
img = img.astype('float32')
temp.append(img)

test_x = np.stack(temp)

test_x /= 255.0
test_x = test_x.reshape(-1, 784).astype('float32')

train_y = keras.utils.np_utils.to_categorical(train.label.values)
• Divide our train data into training and validation
split_size = int(train_x.shape[0]*0.7)

train_x, val_x = train_x[:split_size], train_x[split_size:]
train_y, val_y = train_y[:split_size], train_y[split_size:]
• Let’s start our tweaking! Lets change our model to be “wide”, i.e. increase the number of neurons in our hidden layer
# define vars
input_num_units = 784
hidden_num_units = 500
output_num_units = 10
epochs = 5
batch_size = 128

model = Sequential([
Dense(output_dim=hidden_num_units, input_dim=input_num_units, activation='relu'),

Dense(output_dim=output_num_units, input_dim=hidden_num_units, activation='softmax'),
])
•  Let’s test this model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

trained_model_500 = model.fit(train_x, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x, val_y))
• We see that this model performs significantly better than before! Now instead of “wide”, we try making our model “deep”. We add four more hidden layers with 50 neurons each
# define vars
input_num_units = 784
hidden1_num_units = 50
hidden2_num_units = 50
hidden3_num_units = 50
hidden4_num_units = 50
hidden5_num_units = 50
output_num_units = 10

epochs = 5
batch_size = 128

model = Sequential([
Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
Dense(output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'),
Dense(output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'),
Dense(output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'),
Dense(output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'),

Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
])
•  Any guesses on how this model would perform?
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

trained_model_5d = model.fit(train_x, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x, val_y))
• Looks like we didn’t get what we expected. This may be because our model may be overfitting. To deal with this, we use a method called dropout. Dropout is essentially randomly turning off parts of the model so that it does not “overlearn” a concept (To read more about dropout, check out the article on core concepts of neural networks)
# define vars
input_num_units = 784
hidden1_num_units = 50
hidden2_num_units = 50
hidden3_num_units = 50
hidden4_num_units = 50
hidden5_num_units = 50
output_num_units = 10

epochs = 5
batch_size = 128

dropout_ratio = 0.2

model = Sequential([
Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
Dropout(dropout_ratio),
Dense(output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'),
Dropout(dropout_ratio),
Dense(output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'),
Dropout(dropout_ratio),
Dense(output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'),
Dropout(dropout_ratio),
Dense(output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'),
Dropout(dropout_ratio),

Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
])
• Now let’s check our accuracy
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

trained_model_5d_with_drop = model.fit(train_x, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x, val_y))
• Something seems off. It seems that our model is not performing well enough. One reason may be because we are not training our model to its full potential. Increase our training epochs to 50 and check it out!
# define vars
input_num_units = 784
hidden1_num_units = 50
hidden2_num_units = 50
hidden3_num_units = 50
hidden4_num_units = 50
hidden5_num_units = 50
output_num_units = 10

epochs = 50
batch_size = 128
model = Sequential([
Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
Dropout(0.2),
Dense(output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'),
Dropout(0.2),
Dense(output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'),
Dropout(0.2),
Dense(output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'),
Dropout(0.2),
Dense(output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'),
Dropout(0.2),

Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
])
• Well I’m excited to see what will happen. Are you?
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

trained_model_5d_with_drop_more_epochs = model.fit(train_x, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x, val_y))
• Yes! this is good. We see an increase in accuracy. (As an optional assignment, you could try increasing number of epochs to train more) Let’s try another thing, we make our model both deep and wide! We also implement all the tweaks that we learnt before. For the purpose of getting faster results, we reduce the training epochs. But you are free to increase them if you want.
# define vars
input_num_units = 784
hidden1_num_units = 500
hidden2_num_units = 500
hidden3_num_units = 500
hidden4_num_units = 500
hidden5_num_units = 500
output_num_units = 10

epochs = 25
batch_size = 128

model = Sequential([
Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
Dropout(0.2),
Dense(output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'),
Dropout(0.2),
Dense(output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'),
Dropout(0.2),
Dense(output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'),
Dropout(0.2),
Dense(output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'),
Dropout(0.2),

Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
])
• Forgive me for the spoliers, but its clear that our model would be better than all our models before.Still lets check it out
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

trained_model_deep_n_wide = model.fit(train_x, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x, val_y))
•  Seems like we broke all the records! Lets submit this model to the solution checker
pred = model.predict_classes(test_x)
sample_submission.filename = test.filename; sample_submission.label = pred
sample_submission.to_csv(os.path.join(sub_dir, 'sub03.csv'), index=False)
•  As a last tweak, we will try changing the type of our model. Until now we made multilayer perceptrons (MLP). Let’s now change it to a convolutional neural network. (To get an in-depth introduction to convolutional neural network (CNN), go through this article). One thing necessary for running a CNN is that it requires to be arranged in a specific format. So let’s reshape our data and feed it to our CNN.
# reshape data

train_x_temp = train_x.reshape(-1, 28, 28, 1)
val_x_temp = val_x.reshape(-1, 28, 28, 1)

# define vars
input_shape = (784,)
input_reshape = (28, 28, 1)

conv_num_filters = 5
conv_filter_size = 5

pool_size = (2, 2)

hidden_num_units = 50
output_num_units = 10

epochs = 5
batch_size = 128

model = Sequential([
InputLayer(input_shape=input_reshape),

Convolution2D(25, 5, 5, activation='relu'),
MaxPooling2D(pool_size=pool_size),

Convolution2D(25, 5, 5, activation='relu'),
MaxPooling2D(pool_size=pool_size),

Convolution2D(25, 4, 4, activation='relu'),

Flatten(),

Dense(output_dim=hidden_num_units, activation='relu'),

Dense(output_dim=output_num_units, input_dim=hidden_num_units, activation='softmax'),
])

trained_model_conv = model.fit(train_x_temp, train_y, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_x_temp, val_y))

This result blows your mind, doesn’t it. Even with such small training time, the performance is way better! This proves that a better architecture can certainly boost your performance when dealing with neural networks.

It’s time to let go of the training wheels. There’s many things you can try, so many tweaks to do. Try this on your end and let us know how it goes!

## 8. Where to go from here?

Now, you have a basic overview of Keras and a hands-on experience of implementing neural networks. There is still much more you can do. For example, I really like the implementation of keras to build image analogies. In this project, the authors train a neural network to understand an image, and recreate learnt attributes to another image. As seen below, the first two images are given as input, where the model trains on the first image and on giving input as second image, gives output as the third image.

Neural network tuning is still considered as a “dark art”. So, don’t expect that you would get the best model in your first try. Build, evaluate and reiterate, this is how you would be a better neural network practitioner.

Another point you should know that there are other methods to ensure that you would get a “good enough” neural network model without training it from scratch. Techniques like pre-training and transfer learning, are essential to know when you are implementing neural network models to solve real life problems.

## End Notes

I hope you found this article helpful. Now, it’s time for you to practice and read as much as you can. Good luck! If you have any recommendations / suggestions on neural networks, I’d love to interact with you in comments. If you have any more doubts or queries feel to drop in your comments below. Try out the practice problem Identify the Digits yourself and let me know what was your experience.

Fundamentals of Deep Learning – Starting with Artificial Neural Network
https://www.analyticsvidhya.com/blog/2016/03/introduction-deep-learning-fundamentals-neural-networks/

Deep Learning for Computer Vision – Introduction to Convolution Neural Networks
https://www.analyticsvidhya.com/blog/2016/04/deep-learning-computer-vision-introduction-convolution-neural-networks/

Optimizing Neural Networks Tutorial using Keras (Image recognition)
https://www.analyticsvidhya.com/blog/2016/10/tutorial-optimizing-neural-networks-using-keras-with-image-recognition-case-study/

##### Amir Masoud Sefidian
Data Scientist, Machine Learning Engineer, Researcher, Software Developer