Aiphabet

Key Components of Deep Learning

You can think of deep neural networks like massive Lego structures. They're made of simple pieces that can achieve remarkable things when connected in clever ways. Let's explore the key components that make these networks work.

⚡ Activation Functions

Activation functions are like the "decision makers" in neural networks. They determine whether a neuron should be activated ("fire") or not, adding the non-linearity that allows networks to learn complex patterns. Below table compiles a few popular activation functions used in modern deep neural networks.


Activation Function    Description Formula
ReLU Returns input for positive values, zero otherwise f(x)=max(0,x)f(x) = \max(0, x)
LeakyReLU Like ReLU, but allows small negative values f(x)=max(αx,x) where α1f(x) = \max(\alpha x, x) \text{ where } \alpha \ll 1
Sigmoid S-shaped function, outputs between 0 and 1 σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}
Tanh Hyperbolic tangent function, outputs between -1 and 1 tanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
Softmax Converts values into probabilities that sum to 1 softmax(xi)=exijexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
Swish Smooth function with non-monotonic properties f(x)=xσ(x)=x1+exf(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}
GELU Gaussian Error Linear Unit f(x)=xΦ(x)f(x) = x \cdot \Phi(x) where Φ\Phi is cumulative distribution
PReLU Parametric ReLU, learns α during training f(x)=max(αx,x)f(x) = \max(\alpha x, x) where α\alpha is learnable

We know we've thrown a few formulas at you. As mentioned in the previous article, we usually don't have to compute them ourselves as software (e.g. PyTorch, TensorFlow) handles that for us. This applies to most of the concepts we cover in this unit and the next unit on large language models. The best way to understand these formulas and what they aim to achieve is to plug in a few relevant numbers and observe the output, or better yet plot them using a tool like Desmos. Let's look at three popular activation functions that you'll be using in more detail:

  • ReLU (Rectified Linear Unit): This simply means "output 00 if the input is negative, otherwise output the input value". It is computationally efficient, works well in most hidden layers, and helps prevent vanishing gradients (we'll talk about this a bit more later). However, it can suffer from the dying ReLU problem where neurons can get stuck outputting zeros - this is what the LeakyReLU tries to mitigate.
  • Sigmoid: Used for binary classification. It squishes input values to outputs between 00 and 11, creating an S-shaped curve. It provides a smooth gradient, but it can suffer from vanishing gradients and it's not zero-centered.
  • Softmax: Used in the output layer for multi-class classification. It converts values into probabilities that sum to 11. The formula reads "take ee raised to each value, then divide by the sum of all these values". We do the former to make values positive and the latter to make them sum up to 11.

undefined
Plots of common activation functions.

🧩 Common Types of Layers

Given below are common types of layers used in deep neural networks. Each layer type serves a specific purpose in the neural network architecture, and modern deep learning systems often combine multiple layer types to achieve optimal performance.


Layer    Description Formula
Fully Connected (FC) Each neuron connects to all neurons in previous layer. Sometimes also known as dense layers. Use case: General purpose, final classification. y=σ(Wx+b)y = \sigma(W x + b)
where σ\sigma = activation function, WW = weights, bb = bias.
Convolutional Applies filters to detect patterns. Use case: Image processing, pattern detection. yi,j=σ(m,nKm,nxi+m,j+n+b)y_{i,j} = \sigma(\sum_{m,n} K_{m,n} x_{i+m, j+n} + b)
where KK = kernel (i.e. filter), bb = bias.
Pooling Reduces dimensionality by summarizing regions. Use case: Downsampling, feature extraction. yi,j=maxm,nRi,jxm,ny_{i,j} = \max_{m,n \in R_{i,j}} x_{m,n}
where Ri,jR_{i,j} = region around position (i,j)(i,j).
Recurrent Processes sequences with memory. Use case: Time series, text, sequences. ht=σ(Wxxt+Whht1+b)h_t = \sigma(W_x x_t + W_h h_{t-1} + b)
where hth_t = hidden state at time tt, WxW_x and WhW_h = weight matrices.
Dropout Randomly deactivates neurons during training. Use case: Regularization, preventing overfitting. y=xBernoulli(p)y = x \odot \text{Bernoulli}(p)
where \odot = element-wise multiply, pp = probability of keeping a neuron active.
Batch Normalization Normalizes layer inputs within each mini-batch.
Use case: Stabilizing training, faster convergence.
y=γxμBσB2+ϵ+βy = \gamma \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} + \beta
where μB\mu_B = batch mean, σB2\sigma_B^2 = batch variance, γ\gamma and β\beta = learnable parameters.
Embedding Maps discrete items to vectors. Use case: Text processing, categorical data. y=W[i]y = W[i]
where WW = embedding matrix, ii = index of the input token.
Attention Weighs the importance of different parts of the input.
Use case: Natural language processing, focusing.
A(Q,K,V)=softmax(QKTdk)V\text{A}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
where QQ, KK, VV = query, key, value matrices, dkd_k = dimension of keys.

Don’t worry if these concepts or formulas seem unfamiliar at this stage. We'll cover convolutional and pooling layers more in-depth in the next lesson on convolutional neural networks and embedding and attention layers in the next unit on large language models. Here, let's focus on the remaining ones. See below for annotated depictions of how these layers work.

undefined
Depictions of common deep learning layers: fully connected (dense), dropout, and batch normalization. h, h', and h'' are respectively the first, second, and third hidden layers. Important to note: (1) Dropout is only applied during training. (2) Batch normalization normalizes activations across the batch dimension, denoted BB, i.e. across training samples in each iteration.

Note on fully connected (FC) layers vs. perceptron: The FC layer is a direct evolution of the perceptron model we covered earlier. While a single perceptron can only learn linear boundaries, stacking multiple FC layers creates a multi-layer perceptron (MLP). Each neuron in a MLP functions similarly to a perceptron, computing a weighted sum of inputs followed by an activation function. However, FC layers use more sophisticated activation functions (e.g. ReLU) instead of the simple step function from the original perceptron, and we train them using backpropagation rather than the perceptron learning rule. This allows MLPs to learn complex, non-linear decision boundaries.

📉 Loss Functions

Loss functions measure how far the model's predictions are from the true values. Common loss functions include:

Loss Function    Description Formula
Cross Entropy Loss Measures the difference between predicted probability distribution and true distribution. Use case: Classification problems (binary and multi-class). L=i=1Cyilog(y^i)L = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)
where yiy_i = true probability, y^i\hat{y}_i = predicted probability, CC = number of classes.
Mean Squared Error (MSE) Measures the average squared difference between predictions and actual values. Use case: Regression problems. Penalizes larger errors more heavily than smaller ones. MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
where yiy_i = true value, y^i\hat{y}_i = predicted value, nn = number of samples.
Mean Absolute Error (MAE) Measures the average absolute difference between predictions and actual values. Use case: Regression problems where you need more robustness to outliers compared to MSE. MAE=1ni=1nyiy^i\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
where yiy_i = true value, y^i\hat{y}_i = predicted value, nn = number of samples.
Binary Cross Entropy Special case of cross entropy for binary classification problems. Use case: Binary classification problems. Effective with sigmoid activation in output layer. L=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]L = -\frac{1}{n}\sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]
where yiy_i ∈ {0,1} = true label, y^i\hat{y}_i = predicted probability.

The key point is to intuitively understand which loss function to use for a given problem or dataset (e.g. "Is it classification or regression?"). You can review the fundamentals of classification versus regression here.

⚙️ Optimization Algorithms

Optimization algorithms determine how to update the weights based on the calculated gradients. Common optimization algorithms include:

Optimization Algorithm    Description Formula
Stochastic Gradient Descent (SGD) Updates parameters in the direction of the negative gradient, using small random subsets of data. Use case: Simple, memory-efficient baseline algorithm. Works well with large datasets when properly tuned. θt+1=θtαθJ(θ;x(i),y(i))\theta_{t+1} = \theta_t - \alpha \nabla_\theta J(\theta; x^{(i)}, y^{(i)})
where α\alpha = learning rate, θJ\nabla_\theta J = gradient of loss function.
SGD with Momentum Adds a momentum term to accelerate SGD, helping overcome local minima and oscillations. Use case: Problems with noisy gradients. Speeds up convergence and reduces oscillation. vt+1=γvt+αθJ(θt)v_{t+1} = \gamma v_t + \alpha \nabla_\theta J(\theta_t)
θt+1=θtvt+1\theta_{t+1} = \theta_t - v_{t+1}
where γ\gamma = momentum coefficient.
Adam (Adaptive Moment Estimation) Combines momentum and adaptive learning rates based on first and second moments of gradients. Use case: General-purpose optimizer that works well for most problems. Effective for large networks and noisy data. mt=β1mt1+(1β1)θJ(θ)m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla_\theta J(\theta)
vt=β2vt1+(1β2)(θJ(θ))2v_t = \beta_2 v_{t-1} + (1-\beta_2) (\nabla_\theta J(\theta))^2
m^t=mt1β1t\hat{m}_t = \frac{m_t}{1-\beta_1^t}
v^t=vt1β2t\hat{v}_t = \frac{v_t}{1-\beta_2^t}
θt+1=θtαm^tv^t+ϵ\theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
RMSprop Adapts learning rates by dividing by a running average of squared gradients. Use case: Good for recurrent neural networks and non-stationary problems. vt=βvt1+(1β)(θJ(θ))2v_t = \beta v_{t-1} + (1-\beta)(\nabla_\theta J(\theta))^2
θt+1=θtαvt+ϵθJ(θ)\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} \nabla_\theta J(\theta)
where β\beta = decay rate.
AdamW Variant of Adam with decoupled weight decay for better regularization. Use case: Models requiring regularization. Offers improved generalization over Adam in many cases. Adam update + separate weight decay:
θt+1=θt+1αλθt\theta_{t+1} = \theta_{t+1} - \alpha \lambda \theta_t
where λ\lambda = weight decay coefficient.

Optimization algorithms aim to find the minimum of a loss function. A convex function is one where any line segment connecting two points on the function lies on or above the function's graph: think of a bowl shape where there's only one lowest point. In these ideal cases, algorithms like gradient descent are guaranteed to find the global minimum. Deep learning loss landscapes are typically non-convex with multiple local minima, saddle points, and plateaus. You'll remember some of these terms from calculus.

The plots below show a few optimization trajectories with different initialization points across the loss landscape of a function with two variables x1x_1 and x2x_2. The white paths represent how different algorithms navigate toward minima (blue regions) while avoiding getting stuck in suboptimal areas. Notice how the algorithms must traverse through various contours (yellow, green, and orange regions representing higher loss values) to find the lowest points in the landscape.

undefined

🔄 Putting It All Together

Now that we’ve covered the key components of deep learning, let’s explore how they work together. In the previous article, we introduced the generic training process for deep neural networks using Python and PyTorch pseudocode. Below is an extended version that demonstrates how to define a model, set up loss functions and optimizers, and implement the training loop.

# Step 0: Imports for neural networks packages (PyTorch)
import torch
import torch.nn as nn
import torch.optim as optim

# Step 1: Define the neural network architecture
class NeuralNetwork(nn.Module):
    def __init__(self, C):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, 128)        # Fully connected layer with 128 neurons
        self.relu = nn.ReLU()                        # ReLU activation
        self.dropout = nn.Dropout(0.2)               # Dropout for regularization with p = 0.2
        self.batchnorm = nn.BatchNorm1d(128)         # Batch normalization for 128 neurons
        self.fc2 = nn.Linear(128, C)                 # Fully connected layer with C neurons
        self.softmax = nn.Softmax(dim=1)             # Softmax activation for multi-class output

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.batchnorm(x)
        x = self.softmax(self.fc2(x))

# Step 2: Create the model instance
model = NeuralNetwork(C=3)                           # C = number of classes (assume 3)

# Step 3: Define the loss function
loss_function = nn.CrossEntropyLoss()                # For multi-class classification

# Step 4: Define the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.01)  # Adam is widely used (assume learning rate of 0.01)

# Step 5: Training loop
for epoch in range(10):                              # For some number of epochs (assume 10):
    for batch_x, batch_y in train_data:              # For each batch of training data:
        predictions = model(batch_x)                 # 1. Forward pass to make predictions
        loss = loss_function(predictions, batch_y)   # 2. Calculate error
        loss.backward()                              # 3. Calculate new gradients
        optimizer.step()                             # 4. Adjust weights based on gradients
        optimizer.zero_grad()                        # 5. Clear previous gradients

Choosing Hyperparameters

You’ll notice that the pseudocode above includes parameters that can be adjusted (e.g. lr, number of neurons in hidden layers). These are known as hyperparameters and they are settings chosen before training begins. Finding the right values can be challenging but is crucial for performance.

  • Learning rate (lr): Controls how much the model adjusts its learning in each step. If it’s too high, the model may jump around and never find the best solution. If it’s too low, learning becomes slow and may take forever to improve.
  • Batch size: Training in smaller groups (batches) helps the model learn efficiently. It prevents memory overload, speeds up training, and helps the model generalize better rather than just memorizing the data.
  • Number of layers and neurons: Determines the model’s ability to learn patterns. More layers and neurons increase learning capacity but also require more data and tuning to avoid overfitting.
  • Dropout rate (p): Helps prevent overfitting by randomly turning off some neurons during training. A higher dropout rate means more neurons are ignored in each pass, making the model more robust.
  • Batch normalization: Normalizes activations in a layer to stabilize learning and speed up convergence. It helps prevent internal shifts in data distribution, improving training efficiency.
  • Number of epochs: Defines how many times the model will go through the entire dataset. More epochs improve learning but can lead to overfitting if too high.