Skip to content

Transformers

Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. It uses attention to boost the speed with which these models can be trained.

More on transformers.

History

Advances in NLP, before leading to transformers can mainly be classified into 3, of which this article gave an excellent explanation.

Type Example Limitaton
Bag of Words TF-IDF large feature matrix, no contextual relationship btw words
Word Embedding Word2vec unable to represent a word with more than 1 meaning
Language Models ELMo -
Advancements in NLP. Y-axis is the no. of parameters in millions. Source

Bert

BERT or Bidirectional Encoder Representations from Transformers was introduced by Google in 2019. Its key technical innovation is applying the bidirectional training of Transformers, to language modelling; called Masked Language Model.

BERT is pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers). It is now used within Google Search.

The pretrained models usually comes with two options, either base-cased|uncased, or large-cased|uncased.

Type Desc
Base comparable in size to OpenAI transformer to compare performance
Large huge model to achieve SOTA performance
Cased model differentiates between title casings
Uncased model does not differentiate between title casings

Masked Language Model

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.

Masked Language Model. Source

This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.

In reality, only 80% of the masked tokens are in reality replaced with [MASK]. 10% are actually replaced with a random word while the other 10% uses the original word. See more.

Next Sentence Prediction

The models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

More on BERT. 1, 2.


DistilBert

DistilBert is a light-weight version of Bert, developed by HuggingFace. It is faster; can run 60% faster than BERT, and smaller; able to retain more than 95% of the performance while having 40% fewer parameters.

It uses a model optimization technique called (knowledge) distillation, where a large model, called the teacher, is compressed into a smaller model, called the student.