We acquire language from a very young age as an integral part of how we live our lives and communicate with others around us. Teaching machines to understand and generate human language, on the other hand, has been a major challenge for decades.
Imagine trying to teach a computer what the word “pizza” means. Unlike humans, computers can’t taste, smell, or experience pizza. Instead, they need to understand words through numbers and patterns. Much like large language models (LLMs), which are advanced pattern recognition systems that can process and generate human-like language.
Before we get there, we first need to cover previous work in natural language processing (NLP). Initial breakthroughs came with early language models which predicted the next word in a sequence. For example, an early language model was able to predict that “The sky is” sequence might be followed up with “blue”. However, these early models struggled with longer contexts and complex meanings.
Next word prediction might seem like a fairly trivial task, but actually it’s not! Consider the following sentences and try to choose the correct option between brackets for each:
Skill | Example |
---|---|
Grammar Rules | My sister and I [is , are ] going to the park. |
Common Sense | Birds use their wings to [dance , fly ]. |
Safety | When crossing the street, I should look [both ways , at my phone ]. |
Spatial Reasoning | Tom ran upstairs to his bedroom. His toy car rolled down the [sea , stairs ]. |
Domain Knowledge | The largest planet in our solar system is [Mars , Jupiter ]. |
Mathematics | If I add up 5 and 5, I get [10 , 15 ]. |
Sentiment Analysis | After winning the game, the team felt [happy , sad ]. |
You are able to answer each correctly due to the highlighted skill you have acquired through your human experience. Now let’s look at how we arrived at LLMs who are able to answer these correctly, unlike the earlier language models.
Language processing and understanding with AI has come a long way over time. Below we try to categorize the history of language models into five broad categories. As a reminder, we have a separate lesson on the broader history of AI here.
1. Early Days (1950s-2000s): In the early years, computers could only follow strict rules for processing language. Simple rule-based systems were created where computers followed fixed patterns like "if you see this word, respond with that word." While very basic, these systems laid the groundwork. In the 1980s, machine learning methods (e.g. neural networks) that could learn from data were introduced.
2. Foundation Era (2010-2016): Word embeddings like Word2Vec were developed, which helped models grasp relationships between words by converting them into numbers. Recurrent neural network (RNN) was the main architecture for language processing but they had two main limitations. They had to process text word by word in order making them slow and inefficient. They also struggled to maintain information from earlier parts of the text when processing longer sequences.
3. Transformer Revolution (2017-2019): The game-changing transformer architecture was introduced in 2017. This new approach gave models the ability to pay attention to different parts of a sentence at once, similar to how we focus on key words when reading. BERT and GPT-2 models showed that this attention mechanism could lead to much better language understanding.
4. Scaling Era (2020-2021): Language models grew in size and capability. GPT-3 showed that larger models trained on massive amounts of text could show remarkable capabilities like writing stories or answering questions, without needing specific training for each task. It was like the models developed a broad understanding of language, similar to how humans learn from reading many books.
5. Public LLM Era (2022-Present): LLMs became accessible to everyone through tools like ChatGPT. Large language models can now engage in natural conversations with humans and help with various tasks like writing, coding, and problem-solving. Models like GPT-4 and open-source alternatives have made AI a part of our daily lives, while raising important questions about how we should use this technology responsibly.
Large language models (LLMs) are often powered by the transformer architecture introduced in 2017 and its variants. They have billions of interconnected neurons and the neurons work together to process language patterns.
LLMs learn from vast amounts of text from the internet and curated language datasets to be able generate human-like responses. LLMs, much like most innovations in science, build on top of the previous findings in the NLP domain. For example, LLMs convert the words into numbers through word embeddings which helps the model grasp relationships between words.
Much like how CNNs utilize deep learning to recognize images, LLMs are trained to recognize language patterns. But, they are also generative as they can output human-like responses.
The transformer is a deep neural network and the engine of modern LLMs. It contains billions of adjustable parameters and was originally designed with two main parts:
1. Encoder: The reader network that reads and understands the input text. It uses word embeddings with attention mechanisms to analyze relationships between all tokens in the input and creates a rich contextual representation of the input text.
2. Decoder: The writer network that generates the responses. It takes as input the encoder's processed information and makes use of attention mechanisms much like the encoder. In technical terms, a decoder generates probability distributions over possible next tokens.
This original architecture was used for tasks like translation, where the encoder processes the source language and the decoder generates the target language. For example, in a English-to-French translation system, the encoder would process "I love pizza" while the decoder would generate "J'aime la pizza" using the encoded representation as context.
Modern language models have evolved from this original design:
Most modern LLMs (like GPT-4, Claude, and LLaMA) use the decoder-only architecture, which can both understand and generate text.
A modern decoder-only transformer consists of several key components:
This architecture processes text in parallel rather than sequentially, making it much more efficient than previous approaches like RNNs. This parallelization, combined with the powerful attention mechanism, has made transformers the foundation of modern language processing. We provided a basic overview of this architecture in a previous article. Here is a more detailed overview:
Here is the Python and PyTorch pseudocode for a transformer:
def transformer(tokens): # Input: sequence of tokens
embeddings = token_embedding(tokens) # Step 1: Convert tokens to embeddings
residual = embeddings + positional_embedding(tokens) # Step 2: Add positional information
for block in transformer_blocks: # Step 3: Pass through n transformer blocks
normalized = layer_norm(residual) # Apply layer normalization
attention_output = multi_head_attention(normalized) # Self-attention mechanism
residual_middle = residual + attention_output # First residual connection
normalized = layer_norm(residual_mid) # Apply another layer normalization
mlp_output = mlp(normalized) # MLP: feed-forward neural network
residual = residual_middle + mlp_output # Second residual connection
normalized = layer_norm(residual) # Step 4: Final layer normalization
logits = unembedding(normalized) # Step 5: Convert back to vocabulary space
probabilities = softmax(logits) # Step 6: Apply softmax to get probabilities
return probabilities
We explored several related layer types and their functions here, including MLPs. While batch normalization standardizes activations across the batch dimension, layer normalization instead standardizes activations across the feature dimension for each individual sample. This makes it independent of batch size and well-suited for recurrent models or situations with small batch sizes. We'll cover the remaining layer types in the next article.