Imagine you have a bunch of songs, but they're all mixed up and you don't know what genre they are. Unsupervised learning is like trying to sort these songs into playlists without knowing what the genres should be. You're just looking for patterns and similarities in the music.
Each song is like a data point , and all the musical information in the song (like instruments, tempo, lyrics, etc.) makes up its "dimensions" (d-dimensional space).
Clustering is a specific type of unsupervised learning where you're trying to group similar songs together.
You're creating a function that takes a song and puts it into a specific playlist or "cluster" . So, or , even though you didn't initially know what those categories were. The algorithm figures out the categories itself based on the music.
K-means, also known as Lloyd's algorithm, is one of the fundamental clustering methods. Its goal is to assign each example to one of clusters .
The Mathematics Behind K-means: K-means tries to minimize an objective function , which is written mathematically as:
where
Let's break down how K-means works step by step, using our song genre example:
Step 1: Initialization
You tell the computer you want K playlists. Let's say . The computer randomly picks three randomly generated points and designates them as the initial "center songs" for each playlist. These center songs will act as the initial representatives of each genre.
Step 2: Assignment
Each remaining song is assigned to the "playlist" of the closest "center song." "Closeness" here isn't physical distance, but rather how similar the songs are musically. This similarity is calculated based on the song's features (instruments, tempo, lyrics, etc.). The computer uses a distance formula to quantify this similarity – the smaller the distance, the more similar the songs.
Step 3: Calculate
For each of the three "playlists" the computer calculates a new "center song". This new center song isn't an actual song from the dataset, but rather a kind of average of all the songs in the playlist. It represents the average musical characteristics of that genre. The formula to calculate this "average" in multi-dimensional space:
where
Step 4: Update
The newly calculated "center songs" replace the old "center songs". These new, average songs now become the representatives of each playlist.
Step 5: Repeat
Steps 2, 3, and 4 are repeated. Songs might switch playlists if the "center songs" have moved (meaning the average characteristics of a genre have changed).
This process continues until the playlists stabilize, meaning songs mostly stop switching playlists. At this point, the algorithm has converged, and you have your playlists.
A challenge in unsupervised learning is evaluating results without having the labels. The Davies-Bouldin index is one way to measure how well a clustering algorithm performs.
where
A lower Davies-Bouldin index indicates better clustering, meaning a wider separation between different clusters and that points within each cluster are closer together.
Advantages:
Disadvantages: