In the past few sections, we've neglected the magnitudes of our measurements. In real-world applications, this heavily affects the accuracy of k-NN classifiers. To design well-functioning k-NN models, we need to understand the impact of measurement magnitudes on k-NN and how to avoid their adverse effects.
Let's go back to the fish classification example. Imagine that our fish business has heavily expanded ever since the application of our fish classification system. Now, we're reeling in more fish and, as a result, need to install more scanners to sustain the same fish throughput. However, the company that we previously bought our scanners from has begun investing in technology that replaces width measurement with weight. Thus, we have the following two dimensions for deducing a fish's classification with: length and weight.
From the 60 previous training fish, we rescan them to obtain the following training data:
Now, imagine we obtain a new unknown fish of weight 9 lb and length 1 in:
Using k-NN, where , what would you expect the closest fish to be? What final label do you think the fish will be assigned?
Because the three closest fish all turned out to be bass, we assigned the unknown fish the label of a bass. However, does this seem intuitive to you? Do you have any objections to how we classified our unknown fish?
The methodology in which we calculated the closest distances all performed perfectly to our KNN design. However, it might feel intuitive to you that most unknown fish underneath a line of, for example, a length of about 1.5 should be classified as salmon (as long as the weight is reasonable). After all, every fish in our training data with a length less than or equal to our unknown fish has salmon labels. Shouldn't one of the closest fish be salmon?
The reason why this occurred is because of the magnitude differences of our measurements. In our distance calculation, we compare the weight and length against each other without even considering the impact of quite a large magnitude of weight measurements on our k-NN model.
How do you suppose we might be able to take these magnitudes into account when designing k-NNs with different measurement types?
A good start to designing better KNN models would be to scale down the weight measurement or to scale up the length measurement. This would allow us to compare the weight and length more fairly so that we are not biased towards either large or small magnitudes in our system.
In application, we standardize our data. This is the technique of pre-processing our data to account for factors like mean and standard deviation. The common way to standardize data is by assuming the training data to be randomly picked from a normal (Gaussian) distribution, which allows us to use the following equation:
Where is our post-processed data point, is our pre-processed data point, is the mean measurement of the training data, and is the standard deviation of the training data.
After standardizing the training data and the unknown fish data, we obtain the following:
And after running k-NN with on our new standardized data, we obtain:
Wow! The results have completely shifted, and Now the three closest fish, after data standardized, are actually salmon! Therefore, we should assign our unknown fish the label of salmon.
The effects of data standardization can heavily influence a k-NN model. In application, AI specialists are very cognizant of this effect and often account for data measurement magnitudes biases on results with this technique and sometimes other niche data pre-processing methods.
Note the data preprocessing occurs for almost all machine learning models (sometimes in other forms). A similar bias effect can often be seen for other models when data preprocessing does not occur.