Aiphabet

K-Means Introduction

What is Unsupervised Learning?

Imagine you have a bunch of songs, but they're all mixed up and you don't know what genre they are. Unsupervised learning is like trying to sort these songs into playlists without knowing what the genres should be. You're just looking for patterns and similarities in the music.

Each song is like a data point xix_i, and all the musical information in the song (like instruments, tempo, lyrics, etc.) makes up its "dimensions" (d-dimensional space).

Clustering is a specific type of unsupervised learning where you're trying to group similar songs together.

You're creating a function ff that takes a song xix_i and puts it into a specific playlist or "cluster" (C1,C2,...,Ck)(C_1,C_2,...,C_k). So, f(song)=rock playlistf(song) = \text{rock playlist} or f(song)=pop playlistf(song) = \text{pop playlist}, even though you didn't initially know what those categories were. The algorithm figures out the categories itself based on the music.

K-Means: The Most Common Clustering Method

K-means, also known as Lloyd's algorithm, is one of the fundamental clustering methods. Its goal is to assign each example x1,x2,...,xnx_1,x_2,...,x_n to one of kk clusters (C1,C2,...,Ck)(C_1,C_2,...,C_k).

The Mathematics Behind K-means: K-means tries to minimize an objective function JJ, which is written mathematically as:

J=j=1kxiCjxiμj2J = \sum_{j=1}^{k} \sum_{x_i \in C_j} ||x_i - \mu_j||^2

where

  • μj\mu_j is the mean (center) of all examples in the jj -th cluster
  • xiμj||x_i - \mu_j|| is the distance between a point and its cluster center

The K-Means Algorithm: Walkthrough

Let's break down how K-means works step by step, using our song genre example:

Step 1: Initialization

You tell the computer you want K playlists. Let's say K=3K = 3. The computer randomly picks three randomly generated points and designates them as the initial "center songs" for each playlist. These center songs will act as the initial representatives of each genre.

Step 2: Assignment

Each remaining song is assigned to the "playlist" of the closest "center song." "Closeness" here isn't physical distance, but rather how similar the songs are musically. This similarity is calculated based on the song's features (instruments, tempo, lyrics, etc.). The computer uses a distance formula to quantify this similarity – the smaller the distance, the more similar the songs.

Step 3: Calculate

For each of the three "playlists" the computer calculates a new "center song". This new center song isn't an actual song from the dataset, but rather a kind of average of all the songs in the playlist. It represents the average musical characteristics of that genre. The formula to calculate this "average" in multi-dimensional space:

μj=1CjxiCjxi\mu_j = \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i

where

  • μj\mu_j is the new "center song" (representing the average musical features of playlist $j$)
  • CjC_j is the jj -th playlist
  • xix_i is the ii -th song within the jj -th playlist

Step 4: Update

The newly calculated "center songs" replace the old "center songs". These new, average songs now become the representatives of each playlist.

Step 5: Repeat

Steps 2, 3, and 4 are repeated. Songs might switch playlists if the "center songs" have moved (meaning the average characteristics of a genre have changed).

This process continues until the playlists stabilize, meaning songs mostly stop switching playlists. At this point, the algorithm has converged, and you have your KK playlists.

How to Evaluate K-means Clusters

A challenge in unsupervised learning is evaluating results without having the labels. The Davies-Bouldin index is one way to measure how well a clustering algorithm performs.

BouldinIndex=1Ki=1Kmaxji[σi+σjd(ci,cj)]BouldinIndex = \frac{1}{K} \sum_{i=1}^{K} \max_{j \neq i} \left[ \frac{\sigma_i + \sigma_j}{d(c_i, c_j)} \right]

where

  • KK is the number of clusters
  • cic_i is the center of cluster ii
  • σi\sigma_i is the average distance of all points in cluster ii to cic_i
  • d(ci,cj)d(c_i, c_j) is the distance between centers cic_i and cjc_j

A lower Davies-Bouldin index indicates better clustering, meaning a wider separation between different clusters and that points within each cluster are closer together.

K-means Advantages and Disadvantages:

Advantages:

  • Easy to implement
  • Straightforward to understand

Disadvantages:

  • Must specify KK beforehand
  • Time complexity is O(Kndt)O(Kndt), where KK is the number of clusters, nn is the number of points, dd is the number of dimensions, and tt is the number of iterations.
  • Suffers from the curse of dimensionality. More dimensions = more places for things to hide. Like finding a marble in a tiny box vs. a giant one with marbles floating in the air! Harder to find patterns!
  • Doesn't guarantee optimal clustering. Try running it a few more times — that often helps!
  • Initial center choice affects results