Inception and versions of Inception Network.

6 min readSep 17, 2021

In this blog, we will go through a gentle introduction to Inception architecture by Google in 2014 and versions of the Inception network.

The Problems

Before the Inception network came research tries how to make deeper models because the most straightforward way of improving the performance of deep neural networks is by increasing their size but some of the problems researchers faces are :

The problem of vanishing gradient arises as the model gets deeper because of which deeper layers not able to train properly and gradient update in deeper layers is not effective. Also, very deep networks are prone to overfitting.
Higher-dimensional filters in convolution layers are computation expensive than the one with a lower dimension.

Let’s understand with this example we have an image of dimension n*n*c and we apply Conv layer with 3*3 filter, the total number of multiplication will be taken place in one stride will be 3*3*c, whereas in 5*5 filter convolution layer total number of multiplication will be 5*5*c, i.e. 5*5 filter is 2.78 times(25/9) more computation expensive than 3*3 filter.

It is always difficult to choose filter size in convolution layers because of this huge variation in the location of the information, choosing the right kernel size for the convolution operation becomes tough. A larger kernel is preferred for information that is distributed more globally, and a smaller kernel is preferred for information that is distributed more locally.

The Solution

To take care of all these problems, Inception architecture is introduced by Google in 2014.

Researchers always try to make it better than before which leads to several versions of Inception and the most common ones are Inception-v1, Inception-v2, Inceptio-v3, Inception-v4, Inception-Resnet-v2.

Let’s look at each one by one:

Inception-v1

Complete architecture is divided into three-part :

Stem: It is a starting part of the architecture after the input layer, consist of simple max pool layers and convolution layers with Relu activation.
Output classifier: It is the last part of the network after flattening the previous layer, consist of a fully connected layer followed by a softmax function.
Inception module: This is the middle and the most important part of architecture which makes it different from other networks.

Let’s take a deep dive into it.

Here is the ‘naive’ Inception block, Why not have filters with multiple sizes operate on the same level!!

It applies convolutions with 3 different filters size ( 1*1, 3*3, 5*5 ) and max pooling on previous layer output in parallel then concatenate it and sent to the next inception module. Result in concatenating features from different convolutions and max pool in one layer without using separate layers each for different operations.

Do note that, concatenation is only done with the same height and width dimensions, that why padding is applied in each layer to make all with the same dimension.

Above is just a naive inception block, to make it less computationally expensive, 1*1 Conv filter is used before the 3*3 and 5*5 convolution. 1*1 convolutions help in reducing the number of input channels, so when 3*3 and 5*5 convolutions are applied number of multiplication that taken place is now very less so, it helps in making less computational expensive operations.

Do note that however, the 1x1 convolution is also introduced after the max-pooling layer, rather than before. The reason is being the same to reduce the number the channels in output concatenation.

Inception is a deep network, to prevent the middle part of the network from “dying out”(vanishing gradient problem), the authors introduced two auxiliary classifiers. Softmax is applied in each of them and then Auxilary loss is calculated on the same labels of the output classifier.

The total loss function is a weighted sum of the auxiliary loss and the real loss. The weight value used in the paper was 0.3 for each auxiliary loss.

Total loss = real loss + (0.3 * auxiliary loss1) + (0.3 * auxiliary loss2)

Inception v-2

Inception v2 works on the same approach as Inception-v1 with some change to make it more efficient in terms of both speed and accuracy.

What’s new

These are some updates in Inception-v2 concerning inception-v1:

5*5 convolutions are replaced with two 3*3 convolutions to make it less computationally expensive, we already described 5*5 convolution is 2.28 times more computation expensive than 3*3 convolution, so stacking two 3*3 Conv leads to boasting in performance.

Here, 5*5 convolutions are replaced with two 3*3 convolutions

Some 3*3 convolution is replaced with a combination of 1*3 and 3*1 convolutions. It helps in improving speed and computation. Note that both 1*3 and 3*1 must be applied one after the other, the reason being that 1*3 conv only gives horizontal features(one kind of feature) whereas 3*1 Conv gives vertical features of the input. This method was found to be 33% cheaper than the single 3*3 convolution.

Some 7*7 convolution is replaced with a combination of 1*7 and 7*1 convolutions.

All the above updates are used to build 3 types of inception blocks and concatenate them to get Inception v2 architecture.

Inception v-3

Inception v3 is similar to inception v2 with some updates in loss functions, optimizer, and batch normalization.

What’s new

These are some updates in Inception-v3 concerning inception-v2:

RMS prop optimizer is used
Batch normalization is used in the Auxilary classifier
Label Smoothing (A type of regularizing component added to the loss function that prevents the network from overfitting).

Inception v4

The architecture of the network was made deeper in Inception v4 with the change in stem part (stem refers to the starting part of Inception architecture) and made uniform choices for the Inception blocks.

What’s new

Change in the stem part
The number of Inception modules is increased
Inception modules made more uniform i.e. same numbers of filters are used in modules
Three types of inception module are named A, B, and C ( similar inception modules as that in inception-v2 )

Inception ResNet v2

Inspired by the performance of the ResNet, residual connections are introduced in inception modules.

Input and concatenate output after several operations should have the same dimension, therefore the padding is applied in each operation and at the end 1*1 convolution is applied to make the number of channels equal as shown below.