Aiphabet

CNN Layers

CNNs process images through several types of layers each with a special job.

Layer Type Nickname Description Function
Convolutional Pattern detectors Slides filters across the image to detect patterns Identifies edges, textures, and shapes at varying levels of complexity
Pooling Summarizers Reduces the size of feature maps while preserving important information Makes the network more efficient and robust to small position changes
Fully Connected Decision makers Connects all neurons to previous layer Combines detected features to make final classifications

Together, these three layer types form the backbone of most CNN architectures. We'll now explore these layer types in more detail.

🦚 Convolutional Layers

Convolutional layers are the heart of CNNs. They work by sliding small filters (sometimes called kernels) across the image. Think of each filter as a tiny spotlight looking for a specific pattern.

  • How they work: Each filter is a small grid of numbers (usually 3×33 \times 3 or 5×55 \times 5 pixels)
  • What they do: As the filter slides across the image, it multiplies its values with the pixel values and adds them up
  • What they find: Early layers detect simple patterns like edges and corners, while deeper layers find more complex patterns like textures and object parts

A typical CNN uses multiple filters in each layer (often 3232, 6464, or 128128 filters), each one searching for different patterns. The output of a convolutional layer is called a feature map, a map showing where each pattern was found in the image.

A convolutional layer applies the following formula:

y=σ(Kx+b)y = \sigma(K * x + b) yi,j=σ(m,nKm,nxi+m,j+n+b)y_{i,j} = \sigma(\sum_{m,n} K_{m,n} x_{i+m, j+n} + b) where yy = output feature map, xx = input image or feature map (depending on where the layer is situated at), * is the symbol for convolution operation, KK = kernel (i.e. filter) with learnable parameters, bb = bias, σ\sigma = activation function (e.g. ReLU), (i,j)(i,j) are coordinates of the output feature map, and (m,n)(m,n) are coordinates within the kernel that move across the input. Let's look at a convolutional layer in action.

undefined
Convolutional layer in action. Assume filter size is 33, stride is 11, σ\sigma is ReLU, bias is 00.

Key Parameters for Convolutional Layers

Two important parameters control how a filter moves across the input:

  • Filter Size (F)(F): Determines the dimensions of the filter. Larger filters can capture more complex patterns but require more computation. We use a 3×33 \times 3 filter (F=3)(F = 3) above.
  • Stride (S)(S): Controls how many pixels the filter moves in each step. With S=1S = 1 above, the filter moves one pixel at a time, creating overlapping receptive fields. Larger strides result in smaller output feature maps but may lose information.

The output dimensions of a convolutional layer can be calculated as:

Output Size=Input SizeFS+1\text{Output Size} = \frac{\text{Input Size} - F}{S} + 1 For example, with a 4×44 \times 4 input, 3×33 \times 3 filter, and a stride of 11, we get 431+1=2\frac{4 - 3}{1} + 1 = 2. This is a 2×22 \times 2 output feature map as shown in the diagram.

Calculation Example for Convolutional Layers

Let's walk through the calculation of one output cell for convolution, the bottom-left which equals 66:

  1. We position the 3×33\times 3 filter (denoted as KK in the formula) over the corresponding region of the input:
Region:[264197325]Filter:[121000121]\text{Region:} \begin{bmatrix} 2 & 6 & 4 \\ 1 & 9 & 7 \\ 3 & 2 & 5 \end{bmatrix} \quad \text{Filter}: \begin{bmatrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ -1 & -2 & -1 \end{bmatrix}
  1. We multiply each pair of corresponding values and sum them all:
(2×1)+(6×2)+(4×1)+(1×0)+(9×0)+(7×0)+(3×1)+(2×2)+(5×1)(2 \times 1) + (6 \times 2) + (4 \times 1) + \\ (1 \times 0) + (9 \times 0) + (7 \times 0) + \\ (3 \times -1) + (2 \times -2) + (5 \times -1)
  1. This calculates to: 2+12+4+0+0+0345=1812=62 + 12 + 4 + 0 + 0 + 0 - 3 - 4 - 5 = 18 - 12 = 6

  2. Finally, we apply the ReLU activation function: ReLU(6)=max(0,6)=6\text{ReLU}(6) = \max(0, 6) = 6

Additional Details on Convolutional Layers

  • Padding: Often, we add extra "border" pixels (usually zeros) around the input to control the output size and preserve information at the edges. With proper padding, we can maintain the input dimensions in the output.
  • Multiple filters: A convolutional layer typically applies many different filters such as 3232, 6464, or 128128 to the same input, each detecting different patterns and creating multiple output feature maps.
  • Receptive field: Each neuron in a convolutional layer is connected to only a small region of the input (called its receptive field), unlike fully connected layers where each neuron connects to every input.

These parameters and design choices allow CNNs to efficiently learn hierarchical features: simple patterns in early layers and increasingly complex structures in deeper layers.

🎯 Pooling Layers

After finding patterns with convolutional layers, pooling layers simplify the information.

  • How they work: They divide feature maps into small regions (usually 2×22 \times 2 pixels) and keep only the most important information from each region
  • What they do: The most common type is max pooling, which simply keeps the highest value from each region
  • Why they matter: Pooling reduces the size of the feature maps (making computation faster), helps the network focus on important features, and makes detection more robust to small changes in position

For example, if a pattern is shifted slightly in an image, max pooling will still detect it because it preserves the strongest signals.

A max pooling layer applies the following formula:

yi,j=maxm,nRi,jxm,ny_{i,j} = \max_{m,n \in R_{i,j}} x_{m,n} where yy = output feature map, xx = input feature map, (i,j)(i,j) are coordinates of the output feature map, and Ri,jR_{i,j} is the region in the input feature map that corresponds to the output position (i,j)(i,j). Let's look at a pooling layer in action.

undefined
Pooling layer in action. Assume pool size is 22 and stride is 22.

Key Parameters for Pooling Layers

Two important parameters control how pooling operates:

  • Pool Size (P)(P): Determines the dimensions of the pooling region, typically 2×22 \times 2. Larger pool sizes result in more aggressive downsampling but might lose more information.
  • Stride (S)(S): Controls how many pixels the pooling window moves in each step. For pooling, stride is commonly set equal to the pool size (e.g., S=2S=2 for a 2×22 \times 2 pool) to create non-overlapping regions, but can be different.

The output dimensions of a pooling layer can be calculated as:

Output Size=Input SizePS+1\text{Output Size} = \left\lfloor \frac{\text{Input Size} - P}{S} \right\rfloor + 1 The \lfloor \quad \rfloor is the floor operation (round down to nearest integer). For example, with a 4×44 \times 4 input, 2×22 \times 2 pool size, and stride of 22, we get 422+1=2\frac{4-2}{2}+1 = 2. This produces a 2×22 \times 2 output feature map as shown in the diagram.

Calculation Example for Max Pooling

Let's walk through the calculation of one output cell for max pooling, the top-left which equals 77:

  1. We get the corresponding 2×22 \times 2 region from the input feature map:
Region:[7315]\text{Region:} \begin{bmatrix} 7 & 3 \\ 1 & 5 \end{bmatrix}
  1. For max pooling, we simply find the maximum value in this region: max(7,3,1,5)=7\max(7, 3, 1, 5) = 7.

This maximum value (7)(7) becomes the output for this region. No activation function is typically applied after pooling.

Types of Pooling

Several pooling variants exist, each with specific characteristics.

Pooling Type    Description Formula
Max Pooling Takes the maximum value from each region. Most common type that preserves strongest feature activations. Use case: Feature detection, when peak values matter most. yi,j=maxm,nRi,jxm,ny_{i,j} = \max_{m,n \in R_{i,j}} x_{m,n}
where Ri,jR_{i,j} = region around position (i,j)(i,j).
Average Pooling Takes the average of all values in each region. Preserves overall feature intensity. Use case: When overall intensity is more important than specific peaks. yi,j=1Ri,jm,nRi,jxm,ny_{i,j} = \frac{1}{|R_{i,j}|} \sum_{m,n \in R_{i,j}} x_{m,n}
where Ri,j|R_{i,j}| = number of elements in region Ri,jR_{i,j}.
Global Pooling Performs pooling across the entire feature map, reducing spatial dimensions to a single value per channel. Use case: Transition from spatial features to classification. ymax=maxi,jxi,j,cy_{\text{max}} = \max_{i,j} x_{i,j,c}
yavg=1H×Wi,jxi,j,cy_{\text{avg}} = \frac{1}{H \times W} \sum_{i,j} x_{i,j,c}
where cc = channel, HH = height, and WW = width.

Additional Details on Pooling Layers

  • No learnable parameters: Unlike convolutional layers, standard pooling layers have no weights to learn. They perform a fixed mathematical operation.
  • Dimensionality reduction: Pooling significantly reduces the spatial dimensions of feature maps, decreasing computational load in deeper layers.
  • Translation invariance: Pooling helps the network become less sensitive to exact positions of features, allowing it to recognize objects even if they're slightly shifted or rotated.
  • Information loss: Pooling deliberately discards spatial information, which can be a downside if precise locations matter for your task.

Modern CNN architectures sometimes minimize or eliminate pooling layers, instead relying on strided convolutions to reduce dimensions while preserving more spatial information.

⚖️ Fully Connected Layers

After several rounds of convolution and pooling, fully connected layers take all the extracted features and make the final decision. We explored how these layers work in detail here.

  • How they work: Every neuron in these layers connects to every neuron in the previous layer
  • What they do: They learn which combinations of high-level features are associated with specific classes
  • Why they matter: These layers transform the spatial data from convolution into final classifications

For example, if the convolutional layers detected whiskers, pointed ears, and a tail, the fully connected layers might conclude cat based on this combination of features.


Now that we’ve covered the key layer types in CNNs, you’re ready to build your own CNN model in the upcoming coding exercise! 🚀