Naural Language

Natural Language Processing (NLP) is to train a model to recognize and interpret languages.

Tokenizing

An essential step of preprocessing natural language is to tokenize a text. This is a feature extraction technique to break a sentence into a list of words, word, subword, or character. See this article by Analytics Vidhya for a more detail explanation.

Note

To use a pretrained model, we also have to use its associated tokenizer so that it splits the text in the same way as the pretraining corpus.

Jay Alammar gave a wonderful visualization using the DistilBert Tokenizer from changing a sentence to their respective words and subwords, and converting into vectors.

In a table or dataframe, they look like this.

Table representation of tokenization. Source

Embedding

Embedding is a method of converting words as vectors of numbers, which also emable us to calculate who similar vectors are to each other. Below is an example of a word embedding of the word "king", using GloVe.

[ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , 
-0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , 
-0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , 
-0.55809 , 0.66421 , 0.1961 , -0.13495 , -0.11476 , -0.30344 , 
0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , 
-0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 , 0.34344 , 
-0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , 
-0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , -0.51042 ]

We can visualize these list of floats into a heatmap, and compare with two other words, "man" and "woman". It shows that the latter two are more similar to each other, thus showing that embeddings can capture information/meaning/associations of the words.

King-Man-Woman of word embedding. Source

Libraries

HuggingFace

HuggingFace has become the defacto site for NLP modelling with neural networks, amassing a wide variety of SOTA transformer architectures, and an easy to use API that is integrated with Pytorch and Tensorflow.

from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassification
import tensorflow as tf

model_nm = "distilbert-base-uncased"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_nm)
model = TFDistilBertForSequenceClassification.from_pretrained(model_nm, num_labels=2)

spaCy is a popular general-use NLP library.