In a standard Convolutional Neural Network, we have an input image, that is then passed through the network to get an output predicted label in a way where the forward pass is pretty straightforward as shown in the image below:
Each convolutional layer except the first one (which takes in the input image), takes in the output of the previous convolutional layer and produces an output feature map that is then passed to the next convolutional layer. For L
layers, there are L
direct connections – one between each layer and its subsequent layer.
The DenseNet architecture is all about modifying this standard CNN architecture like so:
In a DenseNet architecture, each layer is connected to every other layer, hence the name Densely Connected Convolutional Network. For L
layers, there are L(L+1)/2
direct connections. For each layer:
This is really it, as simple as this may sound, DenseNets essentially connect every layer to every other layer. This is the main idea that is extremely powerful. The input of a layer inside DenseNet is the concatenation of feature maps from previous layers.
DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.
Okay, so then, now we know the input of Lth layer is the feature maps from [L1, L1, L1… L-1th] concatenated but is this concatenation possible?
At this point in time, I want you to think about whether we can concatenate the features from the first layer of a DenseNet with the last layer of the DenseNet? If we can, why? If we can’t, what do we need to do to make this possible?
So, here’s what I think – it would not be possible to concatenate the feature maps if the size of feature maps is different. So, to be able to perform the concatenation operation, we need to make sure that the size of the feature maps that we are concatenating is the same. Right?
But we can’t just keep the feature maps the same size throughout the network – an essential part of convolutional networks is down-sampling layers that change the size of feature maps. For example, look at the VGG architecture below:
The input of shape 224x224x3 is downsampled to 7x7x512 towards the end of the network.
To facilitate both down-sampling in the architecture and feature concatenation – the authors divided the network into multiple densely connected dense blocks. Inside the dense blocks, the feature map size remains the same.
Dividing the network into densely connected blocks solves the problem that we discussed above.
Now, the Convolution + Pooling
operations outside the dense blocks can perform the downsampling operation and inside the dense block, we can make sure that the size of the feature maps is the same to be able to perform feature concatenation.
The authors refer to the layers between the dense blocks as transition layers which do the convolution and pooling. From the paper, we know that the transition layers used in the DenseNet architecture consist of a batch-norm layer, 1×1 convolution followed by a 2×2 average pooling layer.
Given that the transition layers are pretty easy, let’s quickly implement them here:
class _Transition(nn.Sequential):
def __init__(self, num_input_features, num_output_features):
super(_Transition, self).__init__()
self.add_module('norm', nn.BatchNorm2d(num_input_features))
self.add_module('relu', nn.ReLU(inplace=True))
self.add_module('conv', nn.Conv2d(num_input_features, num_output_features,
kernel_size=1, stride=1, bias=False))
self.add_module('pool', nn.AvgPool2d(kernel_size=2, stride=2))
Essentially the 1x1 conv
performs the downsampling from num_input_features
to num_output_features
.
Let’s consider a network with L
layers, each of which performs a non-linear transformation HL. The output of the Lth layer of the network is denoted as xL and the input image is represented as x0.
We know that traditional feed-forward networks connect the output of the Lth layer to L+1th layer. And the skip connection can be represented as:
In DenseNet architecture, the dense connectivity can be represented as:
where [x0, x1, x2..] represents the concatenation of the feature maps produced by [0,1,… Lth] layers.
Now that we understand that a DenseNet architecture is divided into multiple dense blocks, let’s look at a single dense block in a little more detail. Essentially, we know, that inside a dense block, each layer is connected to every other layer and the feature map size remains the same.
Let’s try and understand what’s really going on inside a dense block. We have some gray input features that are then passed to LAYER_0
. The LAYER_0
performs a non-linear transformation to add purple features to the gray features. These are then used as input to LAYER_1
which performs a non-linear transformation to also add orange features to the gray and purple ones. And so on until the final output for this 3-layer dense block is a concatenation of gray, purple, orange, and green features.
In a dense block, each layer adds some features on top of the existing feature maps.
Therefore, as you can see the size of the feature map grows after a pass through each dense layer, and the new features are concatenated to the existing features. One can think of the features as a global state of the network and each layer adds K features on top of the global state. This parameter K
is referred to as the growth rate of the network.
We already know by now from the figure, that DenseNets are divided into multiple DenseBlocks.
The various architectures of DenseNets have been summarized in the paper.
Each architecture consists of four DenseBlocks with a varying number of layers. For example, the DenseNet-121
has [6,12,24,16]
layers in the four dense blocks whereas DenseNet-169
has [6, 12, 32, 32]
layers.
We can see that the first part of the DenseNet architecture consists of a 7x7 stride 2 Conv Layer
followed by a 3x3 stride-2 MaxPooling layer
. And the fourth dense block is followed by a Classification Layer that accepts the feature maps of all layers of the network to perform the classification.
Also, the convolution operations inside each of the architectures are the Bottle Neck layers. What this means is that the 1×1 conv reduces the number of channels in the input and 3×3 conv performs the convolution operation on the transformed version of the input with a reduced number of channels rather than the input.
By now, we know that each layer produces K
feature maps which are then concatenated to previous feature maps. Therefore, the number of inputs is quite high, especially for later layers in the network.
This has huge computational requirements and to make it more efficient, the authors decided to utilize Bottleneck layers.
1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. In our experiments, we let each 1×1 convolution produce 4k feature-maps.
We know K
refers to the growth rate, so what the authors have finalized on is for 1x1 conv
to first produce 4*K
feature maps and then perform 3x3 conv
on these 4*k
size feature maps.
We are now ready and have all the building blocks to implement DenseNet in PyTorch.
The first thing we need is to implement the dense layer inside a dense block.
class _DenseLayer(nn.Module):
def __init__(self, num_input_features, growth_rate, bn_size, drop_rate, memory_efficient=False):
super(_DenseLayer, self).__init__()
self.add_module('norm1', nn.BatchNorm2d(num_input_features)),
self.add_module('relu1', nn.ReLU(inplace=True)),
self.add_module('conv1', nn.Conv2d(num_input_features, bn_size *
growth_rate, kernel_size=1, stride=1,
bias=False)),
self.add_module('norm2', nn.BatchNorm2d(bn_size * growth_rate)),
self.add_module('relu2', nn.ReLU(inplace=True)),
self.add_module('conv2', nn.Conv2d(bn_size * growth_rate, growth_rate,
kernel_size=3, stride=1, padding=1,
bias=False)),
self.drop_rate = float(drop_rate)
self.memory_efficient = memory_efficient
def bn_function(self, inputs):
"Bottleneck function"
# type: (List[Tensor]) -> Tensor
concated_features = torch.cat(inputs, 1)
bottleneck_output = self.conv1(self.relu1(self.norm1(concated_features))) # noqa: T484
return bottleneck_output
def forward(self, input): # noqa: F811
if isinstance(input, Tensor):
prev_features = [input]
else:
prev_features = input
bottleneck_output = self.bn_function(prev_features)
new_features = self.conv2(self.relu2(self.norm2(bottleneck_output)))
if self.drop_rate > 0:
new_features = F.dropout(new_features, p=self.drop_rate,
training=self.training)
return new_features
A DenseLayer
accepts an input, concatenates the input together, and performs bn_function
on these feature maps to get bottleneck_output
. This is done for computational efficiency. Finally, the convolution operation is performed to get new_features
which are of size K
or growth_rate
.
It should now be easy to map the above implementation with the figure shown below for reference again:
Let’s say the above is an implementation of LAYER_2
. First, LAYER_2
accepts the gray, purple, and orange feature maps and concatenates them. Next, the LAYER_2
performs a bottleneck operation to create bottleneck_output
for computational efficiency. Finally, the layer performs the HL operation as in the equation to generate new_features
. These new_features
are the green features as in figure.
Great! So far we have successfully implemented Transition and Dense layers.
Now, we are ready to implement the DenseBlock which consists of multiple such DenseLayer
s.
class _DenseBlock(nn.ModuleDict):
_version = 2
def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate, memory_efficient=False):
super(_DenseBlock, self).__init__()
for i in range(num_layers):
layer = _DenseLayer(
num_input_features + i * growth_rate,
growth_rate=growth_rate,
bn_size=bn_size,
drop_rate=drop_rate,
memory_efficient=memory_efficient,
)
self.add_module('denselayer%d' % (i + 1), layer)
def forward(self, init_features):
features = [init_features]
for name, layer in self.items():
new_features = layer(features)
features.append(new_features)
return torch.cat(features, 1)
Let’s map the implementation of this DenseBlock
with figure again. Let’s say we pass the number of layers num_layers
as 3 to create a dense block. In this case, let’s imagine that the num_input_features
in gray in the figure is 64. We already know that the authors choose the bottleneck size bn_size
for 1x1 conv
to be 4. Let’s consider the growth_rate
is 32 (same for all networks as in the paper).
Great, so the first layer LAYER_0
accepts 64 num_input_features
and outputs extra 32 features. Excellent. Now, LAYER_1
accepts the 96 features num_input_features + 1 * growth rate
and outputs extra 32 features again. Finally, LAYER_2
accepts 128 features num_input_features + 2 * growth rate
and adds the 32 green features on top with are then concatenated to existing features and returned by the DenseBlock
.
At this stage, it should be really easy for you to map the implementation of a dense block with the above figures.
Finally, we are now ready to implement the DenseNet architecture as we have already implemented the DenseLayer
and DenseBlock
.
class DenseNet(nn.Module):
def __init__(self, growth_rate=32, block_config=(6, 12, 24, 16),
num_init_features=64, bn_size=4, drop_rate=0, num_classes=1000, memory_efficient=False):
super(DenseNet, self).__init__()
# Convolution and pooling part from table-1
self.features = nn.Sequential(OrderedDict([
('conv0', nn.Conv2d(3, num_init_features, kernel_size=7, stride=2,
padding=3, bias=False)),
('norm0', nn.BatchNorm2d(num_init_features)),
('relu0', nn.ReLU(inplace=True)),
('pool0', nn.MaxPool2d(kernel_size=3, stride=2, padding=1)),
]))
# Add multiple denseblocks based on config
# for densenet-121 config: [6,12,24,16]
num_features = num_init_features
for i, num_layers in enumerate(block_config):
block = _DenseBlock(
num_layers=num_layers,
num_input_features=num_features,
bn_size=bn_size,
growth_rate=growth_rate,
drop_rate=drop_rate,
memory_efficient=memory_efficient
)
self.features.add_module('denseblock%d' % (i + 1), block)
num_features = num_features + num_layers * growth_rate
if i != len(block_config) - 1:
# add transition layer between denseblocks to
# downsample
trans = _Transition(num_input_features=num_features,
num_output_features=num_features // 2)
self.features.add_module('transition%d' % (i + 1), trans)
num_features = num_features // 2
# Final batch norm
self.features.add_module('norm5', nn.BatchNorm2d(num_features))
# Linear layer
self.classifier = nn.Linear(num_features, num_classes)
# Official init from torch repo.
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight)
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.constant_(m.bias, 0)
def forward(self, x):
features = self.features(x)
out = F.relu(features, inplace=True)
out = F.adaptive_avg_pool2d(out, (1, 1))
out = torch.flatten(out, 1)
out = self.classifier(out)
return out
Let’s use the above implementation to create densenet-121
architecture.
def _densenet(arch, growth_rate, block_config, num_init_features, pretrained, progress,
**kwargs):
model = DenseNet(growth_rate, block_config, num_init_features, **kwargs)
return model
def densenet121(pretrained=False, progress=True, **kwargs):
return _densenet('densenet121', 32, (6, 12, 24, 16), 64, pretrained, progress,
**kwargs)
Here’s what happens. First, we initialize the stem of the DenseNet architecture – this is the convolution and pooling
part from table-1.
This part of the code does that:
self.features = nn.Sequential(OrderedDict([
('conv0', nn.Conv2d(3, num_init_features, kernel_size=7, stride=2,
padding=3, bias=False)),
('norm0', nn.BatchNorm2d(num_init_features)),
('relu0', nn.ReLU(inplace=True)),
('pool0', nn.MaxPool2d(kernel_size=3, stride=2, padding=1)),
]))
Next, based on the config, we create a DenseBlock
based on the number of layers in the config.
This part of the code does this:
for i, num_layers in enumerate(block_config):
block = _DenseBlock(
num_layers=num_layers,
num_input_features=num_features,
bn_size=bn_size,
growth_rate=growth_rate,
drop_rate=drop_rate,
memory_efficient=memory_efficient
)
self.features.add_module('denseblock%d' % (i + 1), block)
Finally, we add Transition
Layers between DenseBlock
s.
if i != len(block_config) - 1:
# add transition layer between denseblocks to
# downsample
trans = _Transition(num_input_features=num_features,
num_output_features=num_features // 2)
self.features.add_module('transition%d' % (i + 1), trans)
num_features = num_features // 2
And that’s all the magic behind DenseNets!
ResNet significantly changed the view of how to parametrize the functions in deep networks. DenseNet (dense convolutional network) is to some extent the logical extension of this [Huang et al., 2017]. As a result, DenseNet is characterized by both the connectivity pattern where each layer connects to all the preceding layers and the concatenation operation (rather than the addition operator in ResNet) to preserve and reuse features from earlier layers. To understand how to arrive at it, let’s take a small detour to mathematics.
Recall the Taylor expansion for functions. For the point x=0 it can be written as
(8.7.1)f(x)=f(0)+f′(0)x+f″(0)2!x2+f‴(0)3!x3+….
The key point is that it decomposes a function into increasingly higher-order terms. In a similar vein, ResNet decomposes functions into
That is, ResNet decomposes f into a simple linear term and a more complex nonlinear one. What if we want to capture (not necessarily add) information beyond two terms? One solution was DenseNet [Huang et al., 2017].
The main difference between ResNet (left) and DenseNet (right) in cross-layer connections: the use of addition and use of concatenation.
As shown in the figure, the key difference between ResNet and DenseNet is that in the latter case outputs are concatenated (denoted by [,]) rather than added. As a result, we perform a mapping from x to its values after applying an increasingly complex sequence of functions:
x→[x,f1(x),f2([x,f1(x)]),f3([x,f1(x),f2([x,f1(x)])]),…].
In the end, all these functions are combined in MLP to reduce the number of features again. In terms of implementation, this is quite simple: rather than adding terms, we concatenate them. The name DenseNet arises from the fact that the dependency graph between variables becomes quite dense. The last layer of such a chain is densely connected to all previous layers. The dense connections are shown in the following figure.
Dense connections in DenseNet.
The main components that compose a DenseNet are dense blocks and transition layers. The former defines how the inputs and outputs are concatenated, while the latter controls the number of channels so that it is not too large.
DenseNet uses the modified “batch normalization, activation, and convolution” structure of ResNet. First, we implement this convolution block structure.
import torch
from torch import nn
from d2l import torch as d2l
def conv_block(num_channels):
return nn.Sequential(
nn.LazyBatchNorm2d(), nn.ReLU(),
nn.LazyConv2d(num_channels, kernel_size=3, padding=1))
A dense block consists of multiple convolution blocks, each using the same number of output channels. In the forward propagation, however, we concatenate the input and output of each convolution block on the channel dimension.
class DenseBlock(nn.Module):
def __init__(self, num_convs, num_channels):
super(DenseBlock, self).__init__()
layer = []
for i in range(num_convs):
layer.append(conv_block(num_channels))
self.net = nn.Sequential(*layer)
def forward(self, X):
for blk in self.net:
Y = blk(X)
# Concatenate the input and output of each block on the channel
# dimension
X = torch.cat((X, Y), dim=1)
return X
In the following example, we define a DenseBlock
instance with 2 convolution blocks of 10 output channels. When using an input with 3 channels, we will get an output with 3+2×10=23 channels. The number of convolution block channels controls the growth in the number of output channels relative to the number of input channels. This is also referred to as the growth rate.
blk = DenseBlock(2, 10)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
Y.shape
torch.Size([4, 23, 8, 8])
Since each dense block will increase the number of channels, adding too many of them will lead to an excessively complex model. A transition layer is used to control the complexity of the model. It reduces the number of channels by using the 1×1 convolutional layer and halves the height and width of the average pooling layer with a stride of 2, further reducing the complexity of the model.
def transition_block(num_channels):
return nn.Sequential(
nn.LazyBatchNorm2d(), nn.ReLU(),
nn.LazyConv2d(num_channels, kernel_size=1),
nn.AvgPool2d(kernel_size=2, stride=2))
Apply a transition layer with 10 channels to the output of the dense block in the previous example. This reduces the number of output channels to 10 and halves the height and width.
blk = transition_block(10)
blk(Y).shape
torch.Size([4, 10, 4, 4])
Next, we will construct a DenseNet model. DenseNet first uses the same single convolutional layer and max-pooling layer as in ResNet.
class DenseNet(d2l.Classifier):
def b1(self):
return nn.Sequential(
nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
nn.LazyBatchNorm2d(), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
Then, similar to the four modules made up of residual blocks that ResNet uses, DenseNet uses four dense blocks. Similar to ResNet, we can set the number of convolutional layers used in each dense block. Here, we set it to 4, consistent with the ResNet-18 model. Furthermore, we set the number of channels (i.e., growth rate) for the convolutional layers in the dense block to 32, so 128 channels will be added to each dense block.
In ResNet, the height and width are reduced between each module by a residual block with a stride of 2. Here, we use the transition layer to halve the height and width and halve the number of channels. Similar to ResNet, a global pooling layer and a fully connected layer are connected at the end to produce the output.
@d2l.add_to_class(DenseNet)
def __init__(self, num_channels=64, growth_rate=32, arch=(4, 4, 4, 4),
lr=0.1, num_classes=10):
super(DenseNet, self).__init__()
self.save_hyperparameters()
self.net = nn.Sequential(self.b1())
for i, num_convs in enumerate(arch):
self.net.add_module(f'dense_blk{i+1}', DenseBlock(num_convs,
growth_rate))
# The number of output channels in the previous dense block
num_channels += num_convs * growth_rate
# A transition layer that halves the number of channels is added
# between the dense blocks
if i != len(arch) - 1:
num_channels //= 2
self.net.add_module(f'tran_blk{i+1}', transition_block(
num_channels))
self.net.add_module('last', nn.Sequential(
nn.LazyBatchNorm2d(), nn.ReLU(),
nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
nn.LazyLinear(num_classes)))
self.net.apply(d2l.init_cnn)
Since we are using a deeper network here, in this section, we will reduce the input height and width from 224 to 96 to simplify the computation.
model = DenseNet(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
trainer.fit(model, data)
There are a few other terms that the paper talks about and which are the important concept in DenseNet.
DenseNet is composed of Dense blocks. In those blocks, the layers are densely connected together: Each layer receives in input all previous layers’ output feature maps.
This extreme use of residual creates deep supervision because each layer receives more supervision from the loss function thanks to the shorter connections.
A dense block is a group of layers connected to all their previous layers. A single layer looks like this:
The authors found that the pre-activation mode (BN and ReLU before the Conv) was more efficient than the usual post-activation mode.
Note that the authors recommend zero padding before the convolution in order to have a fixed size.
Instead of summing the residual like in ResNet, DenseNet concatenates all the feature maps. It would be impracticable to concatenate feature maps of different sizes (although some resizing may work). Thus in each dense block, the feature maps of each layer have the same size. However, down-sampling is essential to CNN. Transition layers between two dense blocks assure this role.
A transition layer is made of:
Concatenating residuals instead of summing them has a downside when the model is very deep: It generates a lot of input channels!
You may now wonder how could I say in the introduction that DenseNet has fewer parameters than usual SotA networks. There are two reasons:
First of all, a DenseNet’s convolution generates a low number of feature maps. The authors recommend 32 for optimal performance but show SotA results with only 12 output channels!
The number of output feature maps of a layer is defined as the growth rate. DenseNet has a lower need for wide layers because as layers are densely connected there is little redundancy in the learned features. All layers of the same dense block share collective knowledge. The growth rate regulates how much new information each layer contributes to the global state.
The second reason DenseNet has few parameters despite concatenating many residuals together is that each 3×3 convolution can be upgraded with a bottleneck.
A layer of a dense block with a bottleneck will be:
With a growth rate of 32, the tenth layer would have in input 288 feature maps! Thanks to the bottleneck at most 128 feature maps would be fed to a layer. This helps the network have hundred, if not thousand, layers.
The authors further improve the compactness of the model with compression. This compression happens in the transition layer.
Normally the transition layer’s convolution does not change the number of feature maps. In the case of the compression, its number of output feature maps is θ∗m. With mm the number of input feature maps and θ a compression factor between 0 and 1.
Note that the compression factor θ has the same role as the parameter α in MobileNet.
Congratulations! Today, together, we successfully understood what DenseNets are and also understood the torchvision implementation of DenseNets. I hope that by now you have a very thorough understanding of the DenseNet architecture.
The main components that compose DenseNet are dense blocks and transition layers. For the latter, we need to keep the dimensionality under control when composing the network by adding transition layers that shrink the number of channels again. In terms of cross-layer connections, unlike ResNet, where inputs and outputs are added together, DenseNet concatenates inputs and outputs on the channel dimension. Although these concatenation operations reuse features to achieve computational efficiency, unfortunately, they lead to heavy GPU memory consumption. As a result, applying DenseNet may require more complex memory-efficient implementations that may increase training time [Pleiss et al., 2017].
Resources:
https://amaarora.github.io/2020/08/02/densenets.html
https://d2l.ai/chapter_convolutional-modern/densenet.html
https://towardsdatascience.com/understanding-and-visualizing-densenets-7f688092391a