CNN

Convolutional Neural Networks (CNN) is a neural network architecture used mainly for image classification or object detection.

History

Feature Selection

Convolution

The convolution layer extracts features from the image using kernels or filters. A kernel is a matrix of 2D numbers. The process of convolution is done by shifting the kernel over the image by strides (no. of pixel per shift), and computing the new value (see this great explanation). A stride of more than 1 reduces the dimensions of the image. Different kernels can accentuate various features, like edge detection, sharpening or blurring.

The convoluted layer is then passed to a ReLu activation function. The final output is called a feature map. In reality, a convulational layer has mutiple filters, and it outputs one feature map per filter.

Pooling

The pooling layer shrinks the image to reduce the computational load, memory usage, and number of parameters, while retaining the important information. There are different spatial pooling techniques like Max Pooling (most common), Average Pooling or Sum Pooling.

Classification

The image matrix is flatten to a vector and feed to a normal linear fully connected network, and with the final layer ending with a softmax activation function.

Datasets

There are various large annotated image datasets that allows model training & benchmarking for image classification, detection, and segmentation.

Name	By	Segmentation
ImageNet	Li Fei-Fei, Stanford	No
COCO	Microsoft	Yes

Evaluation

For object detection (localisation & classification), mAP (Mean Average Precision) is often used. This in large part is determined by the IoU thresholds, as depicted below.

mAP is calculated by taking the mean AP over all classes and/or overall IoU thresholds, depending on different detection challenges that exist.