How can computers recognize objects in images? How can they tell the difference between a cat and a dog, or know if someone is smiling in a photo? The answer lies in computer vision and convolutional neural networks (CNNs).
CNN is a neural network architecture specializing in understanding images and grid-like data. Just like how our eyes and brain work together to recognize objects, CNNs process images layer by layer to identify patterns, shapes, and eventually entire objects.
When humans look at the picture of a dog, we don't just see random pixels. We notice the ears, the tail, the nose, and put them all together to recognize it's a dog. CNNs work similarly, starting with simple patterns and building up to complex features. We introduced this concept in a previous article on deep learning. We also an overview of CNN architecture in our article on deep learning in practice. In this article, we'll explore CNNs in more detail focusing on image processing and key layer types (e.g. convolutional, pooling).
Image processing and recognition with artificial intelligence has come a long way over time. Below we try to categorize the history of computer vision into four main categories. As a reminder, we have a separate lesson on the broader history of artificial intelligence here.
1. Early Days (1950s-1980s): In the early years, computers could only detect very simple edges and shapes. Simple rule-based systems were created where computers followed fixed patterns like "if you detect a corner, return to left." While very basic, these systems laid the groundwork.
2. Neural Network Era (1980s-2000s): Scientists started using neural networks, machine learning models that can learn from data, for image recognition. At first, these models weren't very accurate and needed a lot of hand-crafted features.
3. CNN Revolution (2012): A CNN called AlexNet shocked the world by dramatically improving image recognition accuracy. This sparked the modern era of computer vision.
4. Deep Learning Boom (2012-Present): With more data and compute, CNNs have become very powerful over the past few years. They are now able to recognize objects, faces, and even help self-driving cars understand their surroudings with great accuracy. We have also seen the rise of generative image models (e.g. generative adversarial networks, stable diffusion) that can generate full-scale images and artwork.
Before we dive deeper into CNNs, let's understand what computers actually "see" when looking at images. Digital images are made up of tiny squares called pixels (short for "picture elements"). When you zoom in really close to any digital image, you'll see these squares.
Most digital images use the RGB color model, which stands for red (R), green (G), and blue (B). Each pixel contains three values - representing how much red, green, and blue that pixel is. A pixel color image is stored as three grids - one for red values, one for green, and one for blue. These are called channels.
Importantly, pixels and color models allows us to convert images into numerical data that a computer can process (a recurring theme in artificial intelligence). When a CNN processes an image, it's working with these numerical RGB values. The convolutional filters scan these grids of numbers, looking for patterns from simple edges to complex objects.
What about videos? Videos are simply sequences of images (called frames) shown rapidly (typically 24-30 frames per second). When shown quickly enough, our brain perceives smooth motion instead of separate images!
Training a CNN is like teaching a student: we show it lots of examples and help it learn from its mistakes. In a previous article, we explored the general training process for deep neural networks. For CNNs, this process similarly follows three main steps:
1. Show Examples: The model is fed thousands of labeled images from curated datasets like MNIST (handwritten digits from 0 to 9), CIFAR-10 (basic objects like cars, birds, and dogs), or ImageNet (millions of real-world images in thousands of categories).
2. Learning from Mistakes: The model makes a guess about what's in each image and the labels feed back to model if the guess was correct or incorrect. Through a process called backpropagation, the CNN adjusts itself to make fewer mistakes next time.
3. Getting Better: The more images a model sees during training, the better it gets at recognizing patters. We measure the progress by different metrics depending on the task. For example, we might look at accuracy for image classification (how many images it can correctly identify) and intersection over union (IoU, measures how well the model identifies the exact boundaries for objects in images) for image segmentation.
After training, CNNs often recognize objects they've never seen before similar to us! Humans can recognize a car even if it's a model you haven't seen before.
CNNs are everywhere in our modern world:
CNNs continue to improve and find new applications. There are few interesting directions:
Want to see a CNN in action? In the coding exercise for this lesson, you'll build your own CNN model ๐