You can think of deep neural networks like massive Lego structures. They're made of simple pieces that can achieve remarkable things when connected in clever ways. Let's explore the key components that make these networks work.
Activation functions are like the "decision makers" in neural networks. They determine whether a neuron should be activated ("fire") or not, adding the non-linearity that allows networks to learn complex patterns. Below table compiles a few popular activation functions used in modern deep neural networks.
Activation Function | Description | Formula |
---|---|---|
ReLU | Returns input for positive values, zero otherwise | |
LeakyReLU | Like ReLU, but allows small negative values | |
Sigmoid | S-shaped function, outputs between 0 and 1 | |
Tanh | Hyperbolic tangent function, outputs between -1 and 1 | |
Softmax | Converts values into probabilities that sum to 1 | |
Swish | Smooth function with non-monotonic properties | |
GELU | Gaussian Error Linear Unit | where is cumulative distribution |
PReLU | Parametric ReLU, learns α during training | where is learnable |
We know we've thrown a few formulas at you. As mentioned in the previous article, we usually don't have to compute them ourselves as software (e.g. PyTorch, TensorFlow) handles that for us. This applies to most of the concepts we cover in this unit and the next unit on large language models. The best way to understand these formulas and what they aim to achieve is to plug in a few relevant numbers and observe the output, or better yet plot them using a tool like Desmos. Let's look at three popular activation functions that you'll be using in more detail:
Given below are common types of layers used in deep neural networks. Each layer type serves a specific purpose in the neural network architecture, and modern deep learning systems often combine multiple layer types to achieve optimal performance.
Layer | Description | Formula |
---|---|---|
Fully Connected (FC) | Each neuron connects to all neurons in previous layer. Sometimes also known as dense layers. Use case: General purpose, final classification. | where = activation function, = weights, = bias. |
Convolutional | Applies filters to detect patterns. Use case: Image processing, pattern detection. | where = kernel (i.e. filter), = bias. |
Pooling | Reduces dimensionality by summarizing regions. Use case: Downsampling, feature extraction. | where = region around position . |
Recurrent | Processes sequences with memory. Use case: Time series, text, sequences. | where = hidden state at time , and = weight matrices. |
Dropout | Randomly deactivates neurons during training. Use case: Regularization, preventing overfitting. | where = element-wise multiply, = probability of keeping a neuron active. |
Batch Normalization | Normalizes layer inputs within each mini-batch. Use case: Stabilizing training, faster convergence. |
where = batch mean, = batch variance, and = learnable parameters. |
Embedding | Maps discrete items to vectors. Use case: Text processing, categorical data. | where = embedding matrix, = index of the input token. |
Attention | Weighs the importance of different parts of the input. Use case: Natural language processing, focusing. |
where , , = query, key, value matrices, = dimension of keys. |
Don’t worry if these concepts or formulas seem unfamiliar at this stage. We'll cover convolutional and pooling layers more in-depth in the next lesson on convolutional neural networks and embedding and attention layers in the next unit on large language models. Here, let's focus on the remaining ones. See below for annotated depictions of how these layers work.
Note on fully connected (FC) layers vs. perceptron: The FC layer is a direct evolution of the perceptron model we covered earlier. While a single perceptron can only learn linear boundaries, stacking multiple FC layers creates a multi-layer perceptron (MLP). Each neuron in a MLP functions similarly to a perceptron, computing a weighted sum of inputs followed by an activation function. However, FC layers use more sophisticated activation functions (e.g. ReLU) instead of the simple step function from the original perceptron, and we train them using backpropagation rather than the perceptron learning rule. This allows MLPs to learn complex, non-linear decision boundaries.
Loss functions measure how far the model's predictions are from the true values. Common loss functions include:
Loss Function | Description | Formula |
---|---|---|
Cross Entropy Loss | Measures the difference between predicted probability distribution and true distribution. Use case: Classification problems (binary and multi-class). | where = true probability, = predicted probability, = number of classes. |
Mean Squared Error (MSE) | Measures the average squared difference between predictions and actual values. Use case: Regression problems. Penalizes larger errors more heavily than smaller ones. | where = true value, = predicted value, = number of samples. |
Mean Absolute Error (MAE) | Measures the average absolute difference between predictions and actual values. Use case: Regression problems where you need more robustness to outliers compared to MSE. | where = true value, = predicted value, = number of samples. |
Binary Cross Entropy | Special case of cross entropy for binary classification problems. Use case: Binary classification problems. Effective with sigmoid activation in output layer. | where ∈ {0,1} = true label, = predicted probability. |
The key point is to intuitively understand which loss function to use for a given problem or dataset (e.g. "Is it classification or regression?"). You can review the fundamentals of classification versus regression here.
Optimization algorithms determine how to update the weights based on the calculated gradients. Common optimization algorithms include:
Optimization Algorithm | Description | Formula |
---|---|---|
Stochastic Gradient Descent (SGD) | Updates parameters in the direction of the negative gradient, using small random subsets of data. Use case: Simple, memory-efficient baseline algorithm. Works well with large datasets when properly tuned. | where = learning rate, = gradient of loss function. |
SGD with Momentum | Adds a momentum term to accelerate SGD, helping overcome local minima and oscillations. Use case: Problems with noisy gradients. Speeds up convergence and reduces oscillation. | where = momentum coefficient. |
Adam (Adaptive Moment Estimation) | Combines momentum and adaptive learning rates based on first and second moments of gradients. Use case: General-purpose optimizer that works well for most problems. Effective for large networks and noisy data. | |
RMSprop | Adapts learning rates by dividing by a running average of squared gradients. Use case: Good for recurrent neural networks and non-stationary problems. | where = decay rate. |
AdamW | Variant of Adam with decoupled weight decay for better regularization. Use case: Models requiring regularization. Offers improved generalization over Adam in many cases. | Adam update + separate weight decay: where = weight decay coefficient. |
Optimization algorithms aim to find the minimum of a loss function. A convex function is one where any line segment connecting two points on the function lies on or above the function's graph: think of a bowl shape where there's only one lowest point. In these ideal cases, algorithms like gradient descent are guaranteed to find the global minimum. Deep learning loss landscapes are typically non-convex with multiple local minima, saddle points, and plateaus. You'll remember some of these terms from calculus.
The plots below show a few optimization trajectories with different initialization points across the loss landscape of a function with two variables and . The white paths represent how different algorithms navigate toward minima (blue regions) while avoiding getting stuck in suboptimal areas. Notice how the algorithms must traverse through various contours (yellow, green, and orange regions representing higher loss values) to find the lowest points in the landscape.
Now that we’ve covered the key components of deep learning, let’s explore how they work together. In the previous article, we introduced the generic training process for deep neural networks using Python and PyTorch pseudocode. Below is an extended version that demonstrates how to define a model, set up loss functions and optimizers, and implement the training loop.
# Step 0: Imports for neural networks packages (PyTorch)
import torch
import torch.nn as nn
import torch.optim as optim
# Step 1: Define the neural network architecture
class NeuralNetwork(nn.Module):
def __init__(self, C):
super(NeuralNetwork, self).__init__()
self.fc1 = nn.Linear(input_size, 128) # Fully connected layer with 128 neurons
self.relu = nn.ReLU() # ReLU activation
self.dropout = nn.Dropout(0.2) # Dropout for regularization with p = 0.2
self.batchnorm = nn.BatchNorm1d(128) # Batch normalization for 128 neurons
self.fc2 = nn.Linear(128, C) # Fully connected layer with C neurons
self.softmax = nn.Softmax(dim=1) # Softmax activation for multi-class output
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.dropout(x)
x = self.batchnorm(x)
x = self.softmax(self.fc2(x))
# Step 2: Create the model instance
model = NeuralNetwork(C=3) # C = number of classes (assume 3)
# Step 3: Define the loss function
loss_function = nn.CrossEntropyLoss() # For multi-class classification
# Step 4: Define the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.01) # Adam is widely used (assume learning rate of 0.01)
# Step 5: Training loop
for epoch in range(10): # For some number of epochs (assume 10):
for batch_x, batch_y in train_data: # For each batch of training data:
predictions = model(batch_x) # 1. Forward pass to make predictions
loss = loss_function(predictions, batch_y) # 2. Calculate error
loss.backward() # 3. Calculate new gradients
optimizer.step() # 4. Adjust weights based on gradients
optimizer.zero_grad() # 5. Clear previous gradients
You’ll notice that the pseudocode above includes parameters that can be adjusted (e.g. lr
, number of neurons in hidden layers). These are known as hyperparameters and they are settings chosen before training begins. Finding the right values can be challenging but is crucial for performance.
lr
): Controls how much the model adjusts its learning in each step. If it’s too high, the model may jump around and never find the best solution. If it’s too low, learning becomes slow and may take forever to improve.p
): Helps prevent overfitting by randomly turning off some neurons during training. A higher dropout rate means more neurons are ignored in each pass, making the model more robust.