Understanding Convolutional Neural Networks
A convolutional neural network (ConvNet, or CNN) is a type of deep neural network that is applied to process visual imagery. These neural networks are modeled after the human visual system, consisting of networks of neurons. Just as a neuron is the basic functional unit of the human nervous system, they are the basic computational unit of a neural network. (Editor's Note: If you haven't already seen it, make sure you check out our blog post about Deep Learning for more foundational info about the topic.)
In the case of convolutional neural networks, instead of general matrix multiplication, the network uses a special kind of mathematical operation called convolution.
A convolutional neural network is made of two main layers - the input and output layers, as well as several hidden layers (A neural layer is a stack of neurons in a single line). An input is received by a neuron in the input layer, the neuron processes it and does some computation on it, then transfers a non-linear function called activation function to yield a final output of a neuron.
Neurons are named by the activation functions they use; for example, sigmoid neuron, RELU neuron, and TanH neurons. Each of these neurons may form connections with multiple neurons, with each connections having a different weight.
The activation function is followed by convolutions of hidden layers before producing the final output. These are called hidden layers because the activation function and final output mask their various inputs and outputs. The hidden layers consist of a series of convolutional layers of different types. Each of these hidden layers convolves the input and transfers it to the next layer.
Neural networks which have multiple hidden layers usually produce accurate results and are, therefore, called deep networks. Thus, machine learning algorithms that apply these networks are called deep learning.
Types of Layers in a Convolutional Network
As noted earlier, a neural layer is a collection of neurons in a single line. And all neurons in one layer do the same type of mathematical operations, by which the layer is named. The following are the popular kinds of layers in a convolutional neural network.
Convolutional Layers
All neurons in a convolutional layer apply the convolution operation to the inputs they receive. The common hyperparameters of a convolutional layer are:
- Filter size
- Stride
For example, let’s use a layer with filter size 5*5*3. Assuming the input transmitted to the convolution neuron is an image of size 40*40 with 3 channels, to calculate the convolution (dot product) with our filter, the third dimension of the filter must be equal to the number of channels in the input.
To get the convolved output, we can slide the convolutional filter over the whole input feature image by a certain number of pixels a time - this is called stride. After each convolution, the output reduces in size (it may reduce from 40*40 to 36*36, for instance), depending on the number of filters used.
In a deep neural network, this may reduce the size of the output significantly. As a result, it is acceptable to add zeros on the boundary of the input layer so that the resultant output has the same size as the input layer.
So, given an input image of size N*N, filter size F, and stride S used with 0 pad of size P added to the input layer, the output size will be:
(N-F+2P)/S +1
Pooling Layers
The pooling layers are used to reduce the size of inputs, an action which speeds up the speed of processing and analyzing the input. Pooling layers are typically used after convolutional layers to reduce the spatial size (width and height) of the input. This reduces the number of parameters and, hence, the computation. The hyperparameters of a pooling layer include:
- Filter size
- Stride
- Max or average pooling
For example, we use a 4 x 4 matrix shown below.
1 | 2 | 3 | 4 |
2 | 8 | 1 | 1 |
1 | 3 | 4 | 1 |
4 | 5 | 1 | 2 |
If max pooling is applied to this matrix, it results in a 2 x 2 output shown below
8 | 4 |
5 | 4 |
For the above pooling, a filter of size 2 and a stride of 2 is applied to form the 2 X 2 matrix. This filter and stride are the most commonly used in pooling. From the example above; for every max pooling, the max number fills each box. If we use average pooling, we will take the average of the numbers instead of the maximum number. However, max pooling is the most common pooling format. So if the input of a pooling layer is nh X nw X nc, the resulting output will be [{(nh – f) / s + 1} X {(nw – f) / s + 1} X nc].
Fully Connected Layer
Fully connected layers are so-called because they connect every neuron in one layer to every neuron in another. In fully connected layers, each output dimension is in tandem with each input dimension.
Need More Info?
Even when you understand the basics, convolutional networks can complex. Working with an AI vendor, like Clarifai, can help you architect, design and implement solutions. Contact us. We'd be happy to answer your questions or demonstrate how Clarifai's Computer Vision AI can integrate into your application.