Transformers

Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. It uses attention to boost the speed with which these models can be trained.

History

Advances in NLP, before leading to transformers can mainly be classified into 3, of which this article gave an excellent explanation.

Type	Example	Limitaton
Bag of Words	TF-IDF	large feature matrix, no contextual relationship btw words
Word Embedding	Word2vec	unable to represent a word with more than 1 meaning
Language Models	ELMo	-

Advancements in NLP. Y-axis is the no. of parameters in millions. Source

Bert

BERT or Bidirectional Encoder Representations from Transformers was introduced by Google in 2019. Its key technical innovation is applying the bidirectional training of Transformers, to language modelling; called Masked Language Model.

BERT is pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers). It is now used within Google Search.

The pretrained models usually comes with two options, either base-cased|uncased, or large-cased|uncased.

Type	Desc
Base	comparable in size to OpenAI transformer to compare performance
Large	huge model to achieve SOTA performance
Cased	model differentiates between title casings
Uncased	model does not differentiate between title casings

Masked Language Model

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.

This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.

In reality, only 80% of the masked tokens are in reality replaced with [MASK]. 10% are actually replaced with a random word while the other 10% uses the original word. See more.

Next Sentence Prediction

The models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

DistilBert

DistilBert is a light-weight version of Bert, developed by HuggingFace. It is faster; can run 60% faster than BERT, and smaller; able to retain more than 95% of the performance while having 40% fewer parameters.

It uses a model optimization technique called (knowledge) distillation, where a large model, called the teacher, is compressed into a smaller model, called the student.