Basics of Neural Network
A perceptron is the simplest unit of a neural network. The input neurons first weights its inputs and then sums them up with a constant, called bias. An activation function is then applied, which produces the output for the neuron.

A deep neural network simply represents a neural network with many hidden layers between the input and output layers. The architecture of the hidden layers can be very complex, like CNN or LSTM.

Activation Function
An activation function tells a perception what outcome it should be. The function differs for input/hidden layers with output layers.
For input/hidden layers, ReLu (Rectified Linear units) is very popular compared to the now mostly obsolete sigmoid & tanh functions because it avoids the vanishing gradient problem and has faster convergence. However, it is susceptible to dead neurons. So variants like Leaky ReLu, MaxOut and other functions are created to address this.
For the output layer, it depends on the type of learning we are trying to train.
Output Type | Function |
---|---|
Binary Classification | Sigmoid |
Multi-Class Classification | Softmax |
Regression | Linear |
Backpropagation
Backpropagation is a short form for "backward propagation of errors." Training a model is about minimising the loss, and to do that, fine-tuning the weights of a neural network based on the error rate obtained in the previous iteration. This is done by working backwards from the last layer to the first input layer.
There are various essential terms that needs to be defined in an NN during training.
Optimizer
Optimizers are learning algothrims used to change the weights & biases of the neural network optimially to obtain the minimal loss. Gradient descent is the most classic of them, while the most widely used optimizer now is Adam (Adaptive Moment Estimation). More in this article.
Optimizer | Desc |
---|---|
Gradient Descent | Most basic & classic |
Adam | Most popular & the current best. Adaptive learning using EWMA on 1st & 2nd moments |
Rmsprop | 2nd most popular. EWMA on squared gradient adagrad |
Adagrad | Able to train for sparse data. Adaptive learning using squared gradient |
Loss Function
The loss (or cost) function which we define as the approximation for data loss. It is thus important to choose one that best represents the type of data and learning. Below are some examples.
Type | Loss Function |
---|---|
Binary Classification | binary_crossentropy |
Multi-class Classification | categorical_crossentropy |
Regression | mse |
Learning Rate
Learning Rate (lr), or step size is the most important parameter to adjust when using an optimizer. Too large a lr can cause the model to be unable to find the minimal loss, whereas a lr that is too small can cause the training process to take too long.

To check if you are using a good learning rate, we can plot the loss over epoch. It should ideally drop gradually to a consistent rate over time.

Batch Size & Epoch
Because of memory limitations, we cannot feed the entire training data to the network at one go, but divide them into batches. Given a total training sample size of 400, and a batch size of 4, we will require 400/4 = 100 iterations to complete one epoch which is the completion of training all sample sizes in a loop.
Multiple epochs are required to reduce the loss to a minimal.
Term | Desc |
---|---|
Iteration | one forward/backward pass |
Batch Size | a subsample of the training data, in one forward/backward pass (1 iteration) |
Epoch | one forward/backward pass for ALL training samples (many iterations) |
Batch Normalization
While a good activation function like ReLu or its variants can reduce the vanishing gradient problems, it might still return during later training. In 2015, a technique called Batch Normalization avoid this by normalizing and shifting the batch inputs to the mean. Other benefits are as listed.
- Faster convergence
- Decrease initial weights importance
- Robust to hyperparameters
- Requires less data for generalization
Training will be slower as each epoch takes more than to compute the normalization. However, less epochs are usually required to reach convergence.
However, the results will not be good if the batch size is very small, since the mean and variance will not be representative of the dataset. Other effects are listed in this article.
Dropout
Dropout is one of the most popular regularization techniques for DNN. Proposed in 2012, it is a simple algothrim; for every iteration, it randomly selects neurons to be ignored during training, and thus prevents overfitting.

The hyperparameter dropout rate
usually set to 0.1-0.5, and is placed before a NN layer. Some tricks on adjusting the rate includes:
- Increase the dropout rate if the model is overfitting, and vice versa
- It might help to increase the dropout rate for large layers and reduce the smaller ones too.
- Many SOTA architectures only uses dropout dropout after the last hidden layer, so it might be worth a try if full dropout is too strong