Basics of Neural Network

A perceptron is the simplest unit of a neural network. The input neurons first weights its inputs and then sums them up with a constant, called bias. An activation function is then applied, which produces the output for the neuron.

A deep neural network simply represents a neural network with many hidden layers between the input and output layers. The architecture of the hidden layers can be very complex, like CNN or LSTM.

Activation Function

An activation function tells a perception what outcome it should be. The function differs for input/hidden layers with output layers.

For input/hidden layers, ReLu (Rectified Linear units) is very popular compared to the now mostly obsolete sigmoid & tanh functions because it avoids the vanishing gradient problem and has faster convergence. However, it is susceptible to dead neurons. So variants like Leaky ReLu, MaxOut and other functions are created to address this.

For the output layer, it depends on the type of learning we are trying to train.

Output Type	Function
Binary Classification	`Sigmoid`
Multi-Class Classification	`Softmax`
Regression	`Linear`

Backpropagation

Backpropagation is a short form for "backward propagation of errors." Training a model is about minimising the loss, and to do that, fine-tuning the weights of a neural network based on the error rate obtained in the previous iteration. This is done by working backwards from the last layer to the first input layer.

There are various essential terms that needs to be defined in an NN during training.

Optimizer

Optimizers are learning algothrims used to change the weights & biases of the neural network optimially to obtain the minimal loss. Gradient descent is the most classic of them, while the most widely used optimizer now is Adam (Adaptive Moment Estimation). More in this article.

Optimizer	Desc
Gradient Descent	Most basic & classic
Adam	Most popular & the current best. Adaptive learning using EWMA on 1st & 2nd moments
Rmsprop	2nd most popular. EWMA on squared gradient adagrad
Adagrad	Able to train for sparse data. Adaptive learning using squared gradient

Loss Function

The loss (or cost) function which we define as the approximation for data loss. It is thus important to choose one that best represents the type of data and learning. Below are some examples.

Type	Loss Function
Binary Classification	`binary_crossentropy`
Multi-class Classification	`categorical_crossentropy`
Regression	`mse`

Learning Rate

Learning Rate (lr), or step size is the most important parameter to adjust when using an optimizer. Too large a lr can cause the model to be unable to find the minimal loss, whereas a lr that is too small can cause the training process to take too long.

To check if you are using a good learning rate, we can plot the loss over epoch. It should ideally drop gradually to a consistent rate over time.

Loss over epoch for varying learning rates. Source

Batch Size & Epoch

Because of memory limitations, we cannot feed the entire training data to the network at one go, but divide them into batches. Given a total training sample size of 400, and a batch size of 4, we will require 400/4 = 100 iterations to complete one epoch which is the completion of training all sample sizes in a loop.

Multiple epochs are required to reduce the loss to a minimal.

Term	Desc
Iteration	one forward/backward pass
Batch Size	a subsample of the training data, in one forward/backward pass (1 iteration)
Epoch	one forward/backward pass for ALL training samples (many iterations)

Batch Normalization

While a good activation function like ReLu or its variants can reduce the vanishing gradient problems, it might still return during later training. In 2015, a technique called Batch Normalization avoid this by normalizing and shifting the batch inputs to the mean. Other benefits are as listed.

Faster convergence
Decrease initial weights importance
Robust to hyperparameters
Requires less data for generalization

Training will be slower as each epoch takes more than to compute the normalization. However, less epochs are usually required to reach convergence.

However, the results will not be good if the batch size is very small, since the mean and variance will not be representative of the dataset. Other effects are listed in this article.

Dropout

Dropout is one of the most popular regularization techniques for DNN. Proposed in 2012, it is a simple algothrim; for every iteration, it randomly selects neurons to be ignored during training, and thus prevents overfitting.

The hyperparameter dropout rate usually set to 0.1-0.5, and is placed before a NN layer. Some tricks on adjusting the rate includes:

Increase the dropout rate if the model is overfitting, and vice versa
It might help to increase the dropout rate for large layers and reduce the smaller ones too.
Many SOTA architectures only uses dropout dropout after the last hidden layer, so it might be worth a try if full dropout is too strong