Transformers
Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. It uses attention to boost the speed with which these models can be trained.
More on transformers.
History
Advances in NLP, before leading to transformers can mainly be classified into 3, of which this article gave an excellent explanation.
Type | Example | Limitaton |
---|---|---|
Bag of Words | TF-IDF | large feature matrix, no contextual relationship btw words |
Word Embedding | Word2vec | unable to represent a word with more than 1 meaning |
Language Models | ELMo | - |

Bert
BERT or Bidirectional Encoder Representations from Transformers was introduced by Google in 2019. Its key technical innovation is applying the bidirectional training of Transformers, to language modelling; called Masked Language Model.
BERT is pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers). It is now used within Google Search.
The pretrained models usually comes with two options, either base-cased|uncased, or large-cased|uncased.
Type | Desc |
---|---|
Base | comparable in size to OpenAI transformer to compare performance |
Large | huge model to achieve SOTA performance |
Cased | model differentiates between title casings |
Uncased | model does not differentiate between title casings |
Masked Language Model
Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.

This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
In reality, only 80% of the masked tokens are in reality replaced with [MASK]. 10% are actually replaced with a random word while the other 10% uses the original word. See more.
Next Sentence Prediction
The models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.
This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs.
DistilBert
DistilBert is a light-weight version of Bert, developed by HuggingFace. It is faster; can run 60% faster than BERT, and smaller; able to retain more than 95% of the performance while having 40% fewer parameters.
It uses a model optimization technique called (knowledge) distillation, where a large model, called the teacher, is compressed into a smaller model, called the student.