In this article, we cover key components and layers of LLMs including embeddings and attention layers.
As computers don’t understand words directly, we convert them into numbers:
In language models, the units we process aren't always complete words. These units, called tokens, can be whole words, parts of words, single characters, or common phrases.
For large language models like GPT-4, the average token is about characters long. For example, common English words like "the" or "and" are single tokens, while longer words like "uncomfortable" might be split into multiple tokens ("un", "comfort", "able").
The simplest way to represent tokens is through one-hot encoding. Let's see an example with the sentence "the cat sat on the mat":
cat | mat | on | sat | the | |
---|---|---|---|---|---|
the | 0 | 0 | 0 | 0 | 1 |
cat | 1 | 0 | 0 | 0 | 0 |
sat | 0 | 0 | 0 | 1 | 0 |
on | 0 | 0 | 1 | 0 | 0 |
the | 0 | 0 | 0 | 0 | 1 |
mat | 0 | 1 | 0 | 0 | 0 |
This approach is inefficient for real language models where the vocabulary size can be enormous (e.g. around tokens). Each word would need a vector with elements, with only one position set to and the rest to .
Instead, we use dense embeddings where each token is represented by a much smaller vector (typically between and dimensions). We can organize these in an embedding matrix which has dimensions where is the vocabulary size and is the embedding dimension.
This embedding matrix converts tokens into numerical representations that capture semantic relationships. These embeddings are learned during the training process so that words with similar meanings end up with similar vectors. We can mathematically represent this with
where is the embedding matrix with size . To get a token's embedding, we multiply its one-hot encoding by this matrix:
Modern language models use clever tokenization methods like Byte-Pair Encoding (BPE) to break text into subword units. This helps the model handle rare words and new words it hasn't seen before.
BPE starts with individual characters and gradually merges the most frequently occurring pairs to form a vocabulary of tokens. For example, in GPT-2, some of the first merges are " t", " a", "he", "in", and "re", forming common English patterns.
These tokenization methods ensure that similar words share components in their representation, making the model more efficient and better at understanding language patterns.
What makes the transformer architecture truly powerful is something called attention. To make an analogy to humans, when we communicate we don’t just sequence words together randomly - we understand context, meaning, and nuance. In other words, we pay attention.
LLMs use multiple spotlights of attention to focus on different parts of a text at once. For example, in the sentence “The pizza disappeared from the table because it was delicious”, the transformer is able to detect a strong connection between “pizza” and “delicious” and a weak connection between “table” and “delicious”. This helps the LLMs understand what “it” refers to, just like our brains do naturally.
You will find that when you’re chatting to an LLM, you can ask questions about topics you have covered earlier in the chat. This is enabled thanks to attention together with memory.
Before we deploy a machine learning model, whether it be a chatbot, an image classifier, or a voice assistant, to end-users, we train them. LLMs typically go through three main training stages:
1. Pre-training: The model is fed hundreds of terabytes (TB) of unstructured text data, which includes billions of words from the internet, Wikipedia, books, and other curated language datasets. Here the LLM learns through a process called self-supervised learning: it reads through text and practices predicting what word should come next in a sequence, similar to early language models. At this stage, LLMs learn general language patterns including grammar rules.
2. Fine-tuning: The model is then fine-tuned on specialized datasets for specific tasks. Depending on the application, this could be a combination of conversation, coding, math, or medical knowledge. Here the LLM often learns through a process called supervised learning or supervised finetuning (SFT): it is given examples of correct outputs (e.g. labels, “10”) along with their inputs (e.g. “What is 5+5?”) and practices predicting the correct responses. At this stage, LLMs learn specific knowledge and skills required for the application use case. For example, LLMs like ChatGPT are fine-tuned on instruction-following and holding natural conversations with humans.
3. Reinforcement-learning: Finally the LLM is trained with reinforcement-learning (RL), where the model gets rewards for generating responses that align with human preferences. There are many different RL methods applied for different LLMs, but a popular one is reinforcement-learning from human feedback (RLHF). This process involves having human raters evaluate different possible responses. Responses that are helpful, truthful, and safe are rewarded, while harmful or incorrect responses are penalized. At this stage, LLMs start behaving more aligned with human values and expectations.