Aiphabet

Introduction to CNNs

How can computers recognize objects in images? How can they tell the difference between a cat and a dog, or know if someone is smiling in a photo? The answer lies in computer vision and convolutional neural networks (CNNs).

๐Ÿ‘๏ธ What are CNNs?

CNN is a neural network architecture specializing in understanding images and grid-like data. Just like how our eyes and brain work together to recognize objects, CNNs process images layer by layer to identify patterns, shapes, and eventually entire objects.

When humans look at the picture of a dog, we don't just see random pixels. We notice the ears, the tail, the nose, and put them all together to recognize it's a dog. CNNs work similarly, starting with simple patterns and building up to complex features. We introduced this concept in a previous article on deep learning. We also an overview of CNN architecture in our article on deep learning in practice. In this article, we'll explore CNNs in more detail focusing on image processing and key layer types (e.g. convolutional, pooling).

undefined
This diagram illustrates a CNN architecture. It processes an input image through convolutional layers to extract features, pooling layers to reduce dimensionality, and fully connected layers to make a final prediction.

๐Ÿ“œ Evolution of Computer Vision

Image processing and recognition with artificial intelligence has come a long way over time. Below we try to categorize the history of computer vision into four main categories. As a reminder, we have a separate lesson on the broader history of artificial intelligence here.

1. Early Days (1950s-1980s): In the early years, computers could only detect very simple edges and shapes. Simple rule-based systems were created where computers followed fixed patterns like "if you detect a corner, return to left." While very basic, these systems laid the groundwork.

2. Neural Network Era (1980s-2000s): Scientists started using neural networks, machine learning models that can learn from data, for image recognition. At first, these models weren't very accurate and needed a lot of hand-crafted features.

3. CNN Revolution (2012): A CNN called AlexNet shocked the world by dramatically improving image recognition accuracy. This sparked the modern era of computer vision.

4. Deep Learning Boom (2012-Present): With more data and compute, CNNs have become very powerful over the past few years. They are now able to recognize objects, faces, and even help self-driving cars understand their surroudings with great accuracy. We have also seen the rise of generative image models (e.g. generative adversarial networks, stable diffusion) that can generate full-scale images and artwork.

๐Ÿ–ผ๏ธ How Digital Images Work

Before we dive deeper into CNNs, let's understand what computers actually "see" when looking at images. Digital images are made up of tiny squares called pixels (short for "picture elements"). When you zoom in really close to any digital image, you'll see these squares.

Most digital images use the RGB color model, which stands for red (R), green (G), and blue (B). Each pixel contains three values (0(0 - 255)255) representing how much red, green, and blue that pixel is. A 100ร—100100 \times 100 pixel color image is stored as three 100ร—100100 \times 100 grids - one for red values, one for green, and one for blue. These are called channels.

Importantly, pixels and color models allows us to convert images into numerical data that a computer can process (a recurring theme in artificial intelligence). When a CNN processes an image, it's working with these numerical RGB values. The convolutional filters scan these grids of numbers, looking for patterns from simple edges to complex objects.

What about videos? Videos are simply sequences of images (called frames) shown rapidly (typically 24-30 frames per second). When shown quickly enough, our brain perceives smooth motion instead of separate images!

undefined
RGB color space visualization. There are other color spaces too (e.g. HSV, CMYK).

๐Ÿ“š Training Process

Training a CNN is like teaching a student: we show it lots of examples and help it learn from its mistakes. In a previous article, we explored the general training process for deep neural networks. For CNNs, this process similarly follows three main steps:

1. Show Examples: The model is fed thousands of labeled images from curated datasets like MNIST (handwritten digits from 0 to 9), CIFAR-10 (basic objects like cars, birds, and dogs), or ImageNet (millions of real-world images in thousands of categories).

2. Learning from Mistakes: The model makes a guess about what's in each image and the labels feed back to model if the guess was correct or incorrect. Through a process called backpropagation, the CNN adjusts itself to make fewer mistakes next time.

3. Getting Better: The more images a model sees during training, the better it gets at recognizing patters. We measure the progress by different metrics depending on the task. For example, we might look at accuracy for image classification (how many images it can correctly identify) and intersection over union (IoU, measures how well the model identifies the exact boundaries for objects in images) for image segmentation.

After training, CNNs often recognize objects they've never seen before similar to us! Humans can recognize a car even if it's a model you haven't seen before.

๐Ÿ’ก Applications of CNNs

CNNs are everywhere in our modern world:

  • Face Recognition: Unlocking your phone with your face
  • Medical Imaging: Helping doctors spot diseases in X-rays and MRI scans
  • Self-Driving Cars: Helping vehicles understand their surroundings
  • Social Media: Auto-tagging friends in photos
  • Gaming: Creating realistic graphics and animations

๐Ÿ”ฎ The Future

CNNs continue to improve and find new applications. There are few interesting directions:

  • They are now better at understanding complex scenes with many moving parts
  • They are more efficient and able to run on mobile devices
  • They can be combined with other modalities (e.g., speech, language) for even smarter systems
  • They can help solve important problems in medicine, science, and technology

Want to see a CNN in action? In the coding exercise for this lesson, you'll build your own CNN model ๐Ÿš€