CNNs mimic animal visual perception and are widely used in the field of image recognition as one of the deep learning techniques. While CNNs are a powerful image recognition tool, it is easy to get lost in the seemingly convoluted mathematical details - particularly the matrix dimensions - when implementing CNNs. In my first blog post here, I will try to un-convolute the mathematics that specifically goes into correctly implementing a CNN. Nailing down the proper dimensions of all the matrices in a CNN may avoid errors, for example, dimension mismatch errors in Tensorflow. This blog post will NOT delve into how and why CNNs work. There is already loads of quality material online for that.
In this blog, I will focus on one single CNN with 2 convolution layers, 1 fully connected hidden layer and an output softmax layer. Lets take an image with h pixels in height-wise and w pixels in width-wise directions respectively. Also, the image has c channels. If it were a gray scale image, number of channels (c) would be only 1, and 3 for an RGB image. Convolution operation is represented by T(*) throughout the network. The entire network is shown below, which I will break down in the following sections.
Entire network with 2 convolution layers, 1 fully connected layer and a final softmax layer
The Convolution layer-1 (CL-1)
The input image is convoluted with d1 maps of dimension (p1 x p1 x c) - one map at a time, with same padding and stride s1. A bias b is added to each convolution operation between the image and a map. The result then goes through the activation function A1() to generate CL-1. The dimensions of the resulting CL-1 are [(w/s1) x (h/s1) x d1].
Convolution of input image with d1 maps, forming CL-1
Number of maps, or depth (d1) is also the number of features that the network will extract after the convolution operation.
The Convolution layer-2 (CL-2)
This operations is exactly similar to the previous step, if the input image is replced by the CL-1 instead. The CL-1 is convoluted with d2 maps of dimension (p2 x p2 x d1) - one map at a time, with same padding and stride s2. A bias b is added to each convolution operation between the CL-1 and a map. The result then goes through the activation function A2() to generate the CL-2. The dimensions of the resulting CL-2 are [(w/(s1 x s2)) x (h/(s1 x s2)) x (d2)].
Convolution of CL-1 with d2 maps, forming CL-2
Notice the total depth of convoluted layer is d2 at this point, which is the number of features that have been extracted from the original image so far.
The hidden and the output layer
Finally, The CL-2 is fully connected to a hidden layer of n neurons (also n weight arrays). The dimension of each weight array is [w/(s1 x s2) x h/(s1 x s2) x (d2)]. Again, a bias b is added to each product of CL-2 with a weight array, which then goes through the activation function A3(). A softmax layer with m-classifiers is used to yield the final output.
CL-2 connects fully to hidden layer with n neurons, which then connects to the output
m-class softmax layer
Keeping track of all the tensor dimensions comes very handy when implementing a deep neural network such as a CNN. It is advisable to properly work out the dimensions before jumping to the declaration of the various maps, weights and biases. This will reduce the chances of triggering the dimension mismatch type errors during code execution. Working with variable names like w, h, d1, d2, etc. instead of absolute numbers allows the code to be more modular that increases the code re-usability for other applications.