How to Use Vectors, Tokens, and Embeddings to Create Natural Language Models?

Introduction

Natural language processing (NLP) is the field of computer science that deals with analyzing and generating natural language texts. NLP has many applications, such as machine translation, sentiment analysis, text summarization, chatbots, and more. However, to perform these tasks, machines need to understand the meaning and structure of human language, which is not an easy feat.

One of the main challenges of NLP is how to represent text data in a way that machines can process and learn from. This is where vectors, tokens, and embeddings come in. These are the building blocks of natural language models, which are the core components of NLP systems. In this blog post, we will explain what vectors, tokens, and embeddings are, how they are created, and how they are used to train natural language models.

What are vectors, tokens, and embeddings?

Vectors

A vector is a list of numbers that represents some information or data. For example, a vector can represent the coordinates of a point in a two-dimensional space, such as [2, 3]. Vectors can also have more than two dimensions, such as [1, 2, 3, 4]. The number of dimensions of a vector is also called its length or size.

Also Read: The AI Workhorse: Predictive Models in the GenAI Era

Vectors are useful for storing and manipulating numerical data, such as mathematical operations, linear algebra, statistics, etc. However, vectors can also be used to represent non-numerical data, such as text, images, audio, etc. This is done by mapping each element of the data to a number or a range of numbers, and then storing them in a vector. For example, we can map each letter of the alphabet to a number from 1 to 26, and then represent the word “cat” as a vector [3, 1, 20].

Tokens

A token is a unit of text that has some meaning or significance. For example, a token can be a word, a punctuation mark, a symbol, a number, etc. Tokens are the basic elements of a text, and they can be used to analyze the structure, syntax, semantics, and meaning of a text.

Tokenization is the process of dividing a text into tokens, based on some rules or criteria. For example, we can tokenize a sentence by splitting it at every space or punctuation mark, and then removing any whitespace or punctuation marks. This way, we can obtain a list of tokens that represent the words of the sentence. For example, the sentence “Hello, world!” can be tokenized as ["Hello", "world"].

However, tokenization is not always straightforward, and there can be different ways to tokenize the same text, depending on the language, the task, and the preference. For example, we can also tokenize a sentence by splitting it at every character, and then obtain a list of tokens that represent the letters of the sentence. For example, the sentence “Hello, world!” can be tokenized as ["H", "e", "l", "l", "o", ",", "w", "o", "r", "l", "d", "!"].

Embeddings

An embedding is a vector representation of a token, such that the vector captures some information or properties of the token, such as its meaning, context, or relation to other tokens. For example, an embedding can represent the word “cat” as a vector [0.2, -0.5, 0.7, 0.1], where each number corresponds to some feature or dimension of the word, such as its category, sentiment, frequency, etc.

Embeddings are useful for transforming text data into numerical data, which can then be fed to machine learning models, such as neural networks, to perform various NLP tasks. Embeddings can also be used to measure the similarity or distance between tokens, based on their vector representations. For example, we can use the cosine similarity to compare the embeddings of two words, and see how close or far they are in the vector space.

How to create vectors, tokens, and embeddings?

There are many methods and techniques to create vectors, tokens, and embeddings for text data, and each one has its own advantages and disadvantages. Here, we will briefly introduce some of the most common and popular ones, and provide some examples and references for further reading.

One-hot encoding

One-hot encoding is a simple and intuitive way to create vectors for tokens, by using a binary representation. The idea is to assign a unique index to each token in the vocabulary, and then create a vector of the same size as the vocabulary, with all zeros except for a one at the index of the token. For example, if we have a vocabulary of four words, ["cat", "dog", "bird", "fish"], and we assign the indices [0, 1, 2, 3] to them, then we can create one-hot vectors for each word as follows:

“cat” -> [1, 0, 0, 0]
“dog” -> [0, 1, 0, 0]
“bird” -> [0, 0, 1, 0]
“fish” -> [0, 0, 0, 1]

One-hot encoding is easy to implement and understand, and it can preserve the uniqueness and identity of each token. However, it also has some drawbacks, such as:

It is inefficient and sparse, as it requires a large vector for a large vocabulary, and most of the elements are zeros.
It does not capture any information or relation between tokens, as all the vectors are orthogonal and have the same distance from each other.

Bag-of-words

Bag-of-words is a simple and widely used way to create vectors for documents, by using a frequency-based representation. The idea is to count the number of occurrences of each token in the document, and then create a vector of the same size as the vocabulary, with the counts as the elements. For example, if we have a vocabulary of four words, ["cat", "dog", "bird", "fish"], and we have a document that contains the sentence “The cat and the dog play with the fish”, then we can create a bag-of-words vector for the document as follows:

“The cat and the dog play with the fish” -> [1, 1, 0, 1]

Bag-of-words is simple and effective, and it can capture the frequency and importance of each token in the document. However, it also has some drawbacks, such as:

It is order-insensitive and context-ignorant, as it does not preserve the order or position of the tokens in the document, and it does not consider the surrounding tokens or the syntax of the sentence.
It is prone to noise and overfitting, as it can be affected by common or rare tokens, and it can have a high dimensionality and sparsity.

N-grams

N-grams are a way to create tokens and vectors for text data, by using a sequence-based representation. The idea is to split the text into sequences of n consecutive tokens, and then treat each sequence as a token. For example, if we have a sentence “The cat and the dog play with the fish”, and we use n=2, then we can create n-grams tokens and vectors as follows:

“The cat and the dog play with the fish” -> ["The cat", "cat and", "and the", "the dog", "dog play", "play with", "with the", "the fish"]
["The cat", "cat and", "and the", "the dog", "dog play", "play with", "with the", "the fish"] -> [1, 1, 2, 2, 1, 1, 1, 1]

N-grams are useful and flexible, and they can capture the order and context of the tokens in the text, and they can also handle variable-length texts. However, they also have some drawbacks, such as:

They are computationally expensive and memory-intensive, as they require a large number of tokens and vectors, and they can have a high dimensionality and sparsity.
They are sensitive to the choice of n, as different values of n can produce different results and performance.

Term frequency-inverse document frequency (TF-IDF)

TF-IDF is a way to create vectors for documents, by using a weighted frequency-based representation. The idea is to calculate the term frequency (TF) and the inverse document frequency (IDF) for each token in the document, and then multiply them to obtain the TF-IDF score. The TF-IDF score reflects how important a token is to a document in a collection of documents, adjusted for the fact that some tokens appear more frequently in general. The higher the TF-IDF score, the more relevant the token is to the document.

Also Read: Unitxt: A New Library for Customizable Text Processing and Evaluation for Generative NLP

The term frequency (TF) is the number of times a token appears in a document, divided by the total number of tokens in the document. The term frequency measures how often a token occurs in a document, and it can vary from 0 to 1. For example, if we have a document that contains the sentence “The cat and the dog play with the fish”, and we have a vocabulary of four words, ["cat", "dog", "bird", "fish"], then we can calculate the term frequency for each word as follows:

“cat” -> 1 / 6 = 0.17
“dog” -> 1 / 6 = 0.17
“bird” -> 0 / 6 = 0
“fish” -> 1 / 6 = 0.17

The inverse document frequency (IDF) is the logarithm of the total number of documents in the collection, divided by the number of documents that contain the token. The inverse document frequency measures how rare or common a token is in the collection, and it can vary from 0 to infinity. For example, if we have a collection of three documents, ["The cat and the dog play with the fish", "The bird and the fish swim in the water", "The cat and the bird fly in the sky"], and we have a vocabulary of four words, ["cat", "dog", "bird", "fish"], then we can calculate the inverse document frequency for each word as follows:

“cat” -> log(3 / 2) = 0.58
“dog” -> log(3 / 1) = 1.1
“bird” -> log(3 / 2) = 0.58
“fish” -> log(3 / 2) = 0.58

The TF-IDF score is the product of the term frequency and the inverse document frequency. The TF-IDF score measures how important a token is to a document in a collection of documents, and it can vary from 0 to infinity. For example, if we have a collection of three documents, ["The cat and the dog play with the fish", "The bird and the fish swim in the water", "The cat and the bird fly in the sky"], and we have a vocabulary of four words, ["cat", "dog", "bird", "fish"], then we can calculate the TF-IDF score for each word in each document as follows:

“The cat and the dog play with the fish” -> [0.17 * 0.58, 0.17 * 1.1, 0 * 0.58, 0.17 * 0.58] -> [0.1, 0.19, 0, 0.1]
“The bird and the fish swim in the water” -> [0 * 0.58, 0 * 1.1, 0.17 * 0.58, 0.17 * 0.58] -> [0, 0, 0.1, 0.1]
“The cat and the bird fly in the sky” -> [0.17 * 0.58, 0 * 1.1, 0.17 * 0.58, 0 * 0.58] -> [0.1, 0, 0.1, 0]

TF-IDF is simple and effective, and it can capture the frequency and importance of each token in the document. However, it also has some drawbacks, such as:

It is order-insensitive and context-ignorant, as it does not preserve the order or position of the tokens in the document, and it does not consider the surrounding tokens or the syntax of the sentence.
It is prone to noise and overfitting, as it can be affected by common or rare tokens, and it can have a high dimensionality and sparsity.

To learn more about TF-IDF, you can read this blog post, this Wikipedia article, or this tutorial.

Word2Vec is a way to create embeddings for words, by using a neural network-based representation. The idea is to train a neural network to predict a word given its surrounding words (or vice versa), and then use the hidden layer of the network as the embedding for the word. For example, if we have a sentence “The cat and the dog play with the fish”, and we use a window size of 2, then we can train the network to predict “cat” given “The” and “and”, or to predict “The” and “and” given “cat”. This way, the network learns the context and meaning of each word, and assigns a vector to it.

Word2Vec is powerful and efficient, and it can capture the semantic and syntactic similarities and relations between words, as well as the word order and position. However, it also has some drawbacks, such as:

It requires a large amount of data and computational resources to train the network, and it can take a long time to converge.
It does not account for the polysemy of words, i.e., the fact that some words can have multiple meanings depending on the context. For example, the word “bank” can mean a financial institution or a river bank, but Word2Vec will assign the same vector to it regardless of the context.

To learn more about Word2Vec, you can read this blog post or this paper.

Conclusion

In this blog post, we covered some of the most common and popular methods and techniques to create vectors, tokens, and embeddings for text data. We discussed what they are, how they are created, and what are their advantages and disadvantages. We also provided some examples and references for further reading.

Also Read: ElevenLabs: The Voice Cloning AI Startup that Became a Unicorn in Two Years

Also Read: CoEdIT: A New System for Writing Assistance Based on User Instructions

Also Read: RAGxplorer: A New Tool for Visualizing and Building RAG Applications