Aiphabet

The Idea Behind Nearest Neighbors Selection

How Should We Classify Fish?

Let us begin with a simple exercise.

Imagine a private Alaskan fishing company has hired you to classify the daily catches from its fishermen. The fishermen themselves classify the fish at the end of the day. Recently, however, the fishermen have complained that the season's excessive fish catches often keep them classifying fish after regular business hours. The company wants to honor work-life balance, so they have decided to automate this labeling system. The work is essential as the classification determines whether the fish gets shipped to Whole Foods's bass supplier or packed for Trader Joe's salmon supplier.

The latest technology available in the fish industry can scan any fish for its length and width. From the 60 fish (30 salmon and 30 bass) sent by the company, you scanned all the fish to obtain the following distribution:

undefined

Now, you obtain a new fish of an unknown class. You scan it to obtain a width of 1.5 and a length of 1.2.

undefined

Question:

Without any calculations or online research, think to yourself: intuitively, what class (bass or salmon) would you classify this unknown fish to? Take a moment to think about what your intuition tells you, and why this might be the case.

Answer:

Visually, we can that, often, the fish with similar lengths and widths were often found to be salmon. However, there are a few exceptions where some outlier bass have lengths or widths that rival those of salmon. This makes the question non-trivial, as both results could be plausible with the information provided! We can't know for sure!

You might have thought to yourself, however, that the unknown fish was indeed a salmon. This is likely because a majority of the fish with similar lengths and widths were classified as salmons. In fact, given this training data, the unknown fish has a much higher probability of indeed being classified as a salmon This is a common machine learning algorithm called k-nearest neighbors, that you will now learn about!

Defining Nearest Neighbors

Nearest Neighbors (NN), a popular machine learning algorithm, takes the assumption that the classification label of an unknown item should be that of its closest example(s). In the previous section, we found that the unknown fish should be classified as a salmon, because of its closest examples (the fish with the closest lengths and widths) were also salmon.

NN will not always yield correct results. The assumption that an unknown item should be the same label as its closest example is dangerous, and we will get more into the negatives of this design choice in the next section. However, the likelihood that the closest example items share the same label as an unknown item is often very high, and thus, this assumption is accepted by many AI professionals.