CNNs process images through several types of layers each with a special job.
Layer Type
Nickname
Description
Function
Convolutional
Pattern detectors
Slides filters across the image to detect patterns
Identifies edges, textures, and shapes at varying levels of complexity
Pooling
Summarizers
Reduces the size of feature maps while preserving important information
Makes the network more efficient and robust to small position changes
Fully Connected
Decision makers
Connects all neurons to previous layer
Combines detected features to make final classifications
Together, these three layer types form the backbone of most CNN architectures. We'll now explore these layer types in more detail.
🦚 Convolutional Layers
Convolutional layers are the heart of CNNs. They work by sliding small filters (sometimes called kernels) across the image. Think of each filter as a tiny spotlight looking for a specific pattern.
How they work: Each filter is a small grid of numbers (usually 3×3 or 5×5 pixels)
What they do: As the filter slides across the image, it multiplies its values with the pixel values and adds them up
What they find: Early layers detect simple patterns like edges and corners, while deeper layers find more complex patterns like textures and object parts
A typical CNN uses multiple filters in each layer (often 32, 64, or 128 filters), each one searching for different patterns. The output of a convolutional layer is called a feature map, a map showing where each pattern was found in the image.
A convolutional layer applies the following formula:
y=σ(K∗x+b)yi,j=σ(m,n∑Km,nxi+m,j+n+b)
where y = output feature map, x = input image or feature map (depending on where the layer is situated at), ∗ is the symbol for convolution operation, K = kernel (i.e. filter) with learnable parameters, b = bias, σ = activation function (e.g. ReLU), (i,j) are coordinates of the output feature map, and (m,n) are coordinates within the kernel that move across the input. Let's look at a convolutional layer in action.
Convolutional layer in action. Assume filter size is 3, stride is 1, σ is ReLU, bias is 0.
Key Parameters for Convolutional Layers
Two important parameters control how a filter moves across the input:
Filter Size(F): Determines the dimensions of the filter. Larger filters can capture more complex patterns but require more computation. We use a 3×3 filter (F=3) above.
Stride(S): Controls how many pixels the filter moves in each step. With S=1 above, the filter moves one pixel at a time, creating overlapping receptive fields. Larger strides result in smaller output feature maps but may lose information.
The output dimensions of a convolutional layer can be calculated as:
Output Size=SInput Size−F+1
For example, with a 4×4 input, 3×3 filter, and a stride of 1, we get 14−3+1=2. This is a 2×2 output feature map as shown in the diagram.
Calculation Example for Convolutional Layers
Let's walk through the calculation of one output cell for convolution, the bottom-left which equals 6:
We position the 3×3 filter (denoted as K in the formula) over the corresponding region of the input:
Region:213692475Filter:10−120−210−1
We multiply each pair of corresponding values and sum them all:
Finally, we apply the ReLU activation function: ReLU(6)=max(0,6)=6
Additional Details on Convolutional Layers
Padding: Often, we add extra "border" pixels (usually zeros) around the input to control the output size and preserve information at the edges. With proper padding, we can maintain the input dimensions in the output.
Multiple filters: A convolutional layer typically applies many different filters such as 32, 64, or 128 to the same input, each detecting different patterns and creating multiple output feature maps.
Receptive field: Each neuron in a convolutional layer is connected to only a small region of the input (called its receptive field), unlike fully connected layers where each neuron connects to every input.
These parameters and design choices allow CNNs to efficiently learn hierarchical features: simple patterns in early layers and increasingly complex structures in deeper layers.
🎯 Pooling Layers
After finding patterns with convolutional layers, pooling layers simplify the information.
How they work: They divide feature maps into small regions (usually 2×2 pixels) and keep only the most important information from each region
What they do: The most common type is max pooling, which simply keeps the highest value from each region
Why they matter: Pooling reduces the size of the feature maps (making computation faster), helps the network focus on important features, and makes detection more robust to small changes in position
For example, if a pattern is shifted slightly in an image, max pooling will still detect it because it preserves the strongest signals.
A max pooling layer applies the following formula:
yi,j=m,n∈Ri,jmaxxm,n
where y = output feature map, x = input feature map, (i,j) are coordinates of the output feature map, and Ri,j is the region in the input feature map that corresponds to the output position (i,j). Let's look at a pooling layer in action.
Pooling layer in action. Assume pool size is 2 and stride is 2.
Key Parameters for Pooling Layers
Two important parameters control how pooling operates:
Pool Size(P): Determines the dimensions of the pooling region, typically 2×2. Larger pool sizes result in more aggressive downsampling but might lose more information.
Stride(S): Controls how many pixels the pooling window moves in each step. For pooling, stride is commonly set equal to the pool size (e.g., S=2 for a 2×2 pool) to create non-overlapping regions, but can be different.
The output dimensions of a pooling layer can be calculated as:
Output Size=⌊SInput Size−P⌋+1
The ⌊⌋ is the floor operation (round down to nearest integer). For example, with a 4×4 input, 2×2 pool size, and stride of 2, we get 24−2+1=2. This produces a 2×2 output feature map as shown in the diagram.
Calculation Example for Max Pooling
Let's walk through the calculation of one output cell for max pooling, the top-left which equals 7:
We get the corresponding 2×2 region from the input feature map:
Region:[7135]
For max pooling, we simply find the maximum value in this region: max(7,3,1,5)=7.
This maximum value (7) becomes the output for this region. No activation function is typically applied after pooling.
Types of Pooling
Several pooling variants exist, each with specific characteristics.
Pooling Type
Description
Formula
Max Pooling
Takes the maximum value from each region. Most common type that preserves strongest feature activations. Use case: Feature detection, when peak values matter most.
yi,j=maxm,n∈Ri,jxm,n where Ri,j = region around position (i,j).
Average Pooling
Takes the average of all values in each region. Preserves overall feature intensity. Use case: When overall intensity is more important than specific peaks.
yi,j=∣Ri,j∣1∑m,n∈Ri,jxm,n where ∣Ri,j∣ = number of elements in region Ri,j.
Global Pooling
Performs pooling across the entire feature map, reducing spatial dimensions to a single value per channel. Use case: Transition from spatial features to classification.
ymax=maxi,jxi,j,c yavg=H×W1∑i,jxi,j,c where c = channel, H = height, and W = width.
Additional Details on Pooling Layers
No learnable parameters: Unlike convolutional layers, standard pooling layers have no weights to learn. They perform a fixed mathematical operation.
Dimensionality reduction: Pooling significantly reduces the spatial dimensions of feature maps, decreasing computational load in deeper layers.
Translation invariance: Pooling helps the network become less sensitive to exact positions of features, allowing it to recognize objects even if they're slightly shifted or rotated.
Information loss: Pooling deliberately discards spatial information, which can be a downside if precise locations matter for your task.
Modern CNN architectures sometimes minimize or eliminate pooling layers, instead relying on strided convolutions to reduce dimensions while preserving more spatial information.
⚖️ Fully Connected Layers
After several rounds of convolution and pooling, fully connected layers take all the extracted features and make the final decision. We explored how these layers work in detail here.
How they work: Every neuron in these layers connects to every neuron in the previous layer
What they do: They learn which combinations of high-level features are associated with specific classes
Why they matter: These layers transform the spatial data from convolution into final classifications
For example, if the convolutional layers detected whiskers, pointed ears, and a tail, the fully connected layers might conclude cat based on this combination of features.
Now that we’ve covered the key layer types in CNNs, you’re ready to build your own CNN model in the upcoming coding exercise! 🚀