The Basics of the K-Nearest Neighbors (KNN) Algorithm
By Mbali Kalirane on
The K-Nearest Neighbors (KNN) algorithm stands out from other supervised learning algorithms due to its instance-based approach to machine learning tasks. In this article, you’ll learn the basics behind KNN, understand how it works, and you’ll see an example of how KNN is calculated.
Table of Contents
- What is KNN?
- How does KNN Work?
- How is Distance Calculated in KNN?
- Example of Calculating KNN between Points
- How to Choose the Value of K?
- Conclusion
- Sources
What is K-Nearest Neighbors (KNN)?
KNN is a machine learning algorithm used for both classification and regression tasks. Unlike many traditional supervised learning models such as linear regression, logistic regression, and decision trees, which learn an explicit model during training, KNN is an instance-based learning algorithm. In other words, it stores the entire training dataset and makes predictions based on the similarity of new instances to existing instances in the training data. The ‘k nearest neighbors’ refers to the existing data points that are most similar to the new data point. ‘k’ refers to the number of points. For example, 3 nearest neighbors, refers to the 3 data points which are closest to the new data point.
How Does KNN Work?
The core idea behind KNN is straightforward: similar data points are likely to belong to the same class or have similar output values. Here’s a breakdown of how KNN works:
-
KNN Stores Training Data The KNN algorithm begins by storing all available data points and their corresponding labels.
-
KNN Calculates Distances between Points When KNN encounters a new point which isn’t in the dataset, the algorithm calculates the distance between this point and every other point in the dataset. To calculate this distance, the KNN algorithm typically uses some common distance metrics, such as Euclidean distance, Manhattan distance, and Minkowski distance.
-
KNN Finds the Nearest Neighbors After calculating distances, KNN identifies the k nearest neighbors to the new data point. In other words, KNN identifies the points which are of closest proximity to the new data point, based on the distance calculated.
-
Majority Vote (Classification) / Average (Regression) The class of the new point is determined from the most common class among the k neighbors. In other words, if the majority of the k nearest neighbors belong to a certain class, then the new data point will belong to that class. Determining the class of the new point can be done in two ways, depending on the type of task being solved. For a classification task, the class of the new point will be determined through majority voting. This involves assigning the new data point to the class that is most common among the k nearest neighbors. For regression tasks, it involves calculating the average of the output values of the k nearest neighbors.
-
Make Predictions Finally, KNN assigns the predicted class label (for classification) or numerical value (for regression) to the new data point.
How Distance is Calculated Between the Points in KNN?
To calculate the distance between the new points and the k neighbors, KNN uses a variety of metrics. The choice of distance metric depends on the nature of the data and the problem at hand. The most commonly used distance metrics include:
-
Euclidean Distance The Euclidean distance between two points $( P(x_1, y_1) )$ and $( Q(x_2, y_2) )$ is given by:
\[d(P, Q) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}\] -
Manhattan Distance Also known as city-block or L1 distance, the Manhattan distance between two points $( P(x_1, y_1) )$ and $( Q(x_2, y_2) )$ is given by:
\[d(P, Q) = \text{abs}(x_2 - x_1) + \text{abs}(y_2 - y_1)\]This distance is calculated as the sum of the absolute differences of their coordinates.
-
Minkowski Distance This is generalized distance metric that can be used for both Euclidean and Manhattan distance calculations. For two points $( P )$ and $( Q )$ in an n-dimensional space:
\[d(P, Q) = \left( \sum_{i=1}^{n} |x_{2i} - x_{1i}|^p \right)^{\frac{1}{p}}\]where $( p )$ is a parameter.
when ( p = 2 ), it reduces to the Euclidean distance, and when ( p = 1 ), it becomes the Manhattan distance.
An Example of How to Calculate KNN Distances
Suppose we have a dataset with two features, such as age and income, and we want KNN to predict whether a person is likely to purchase a product based on these features.
Consider the following dataset with five data points:
\[\hspace{0cm} \begin{array}{|c|c|c|c|} \hline \text{Data Point} & \text{Age (years)} & \text{Income (\$)} & \text{Purchased} \\ \hline A & 25 & 50000 & \text{Yes} \\ B & 30 & 70000 & \text{No} \\ C & 35 & 90000 & \text{Yes} \\ D & 40 & 60000 & \text{No} \\ E & 45 & 80000 & \text{Yes} \\ \hline \end{array}\]1. Calculating the Distances
Assume we want to classify a new data point with age 35 and income USD 80,000. Let’s calculate the distances between this new point and the existing points in the dataset. We’ll use Euclidean distance as our distance metric.
Calculating the Euclidean distance between the new data point and each existing data point:
-
Distance to Data Point A (25, 50000):
\[\begin{align} & d(\text{New Point}, \text{A}) = \sqrt{(35 - 25)^2 + (80000 - 50000)^2} \\ & = \sqrt{100^2 + 30000^2} \approx 300.16 \\ \end{align}\] -
Distance to Data Point B (30, 70000):
\[\begin{align} & d(\text{New Point}, \text{B}) = \sqrt{(35 - 30)^2 + (80000 - 70000)^2} \\ & = \sqrt{25^2 + 10000^2} \approx 100.2 \\ \end{align}\] -
Distance to Data Point C (35, 90000):
\[\begin{align} & d(\text{New Point}, \text{C}) = \sqrt{(35 - 35)^2 + (80000 - 90000)^2} \\ & = \sqrt{0^2 + (-10000)^2} = 10000 \\ \end{align}\] -
Distance to Data Point D (40, 60000):
\[\begin{align} & d(\text{New Point}, \text{D}) = \sqrt{(35 - 40)^2 + (80000 - 60000)^2} \\ & = \sqrt{(-5)^2 + 20000^2} \approx 20000.2 \\ \end{align}\] -
Distance to Data Point E (45, 80000):
\[\begin{align} & d(\text{New Point}, \text{E}) = \sqrt{(35 - 45)^2 + (80000 - 80000)^2} \\ & = \sqrt{(-10)^2 + 0^2} = 10 \\ \end{align}\]
2. Finding the Nearest Neighbors
Let’s assume that we’ve chosen $( k = 3 )$ as our number of neighbors for this example. Let’s then select the three nearest neighbors based on the calculated distances. From our calculations above, we can see that the three nearest neighbors from the new data point are: Data Point A, Data Point B, and Data Point E.
3. Predicting the Class
For classification, we perform a majority vote among the nearest neighbors to determine the class label of the new data point. Among the three nearest neighbors, two (A and E) have a label of “Yes” and one (B) has a label of “No”. Therefore, we predict that the predicted label of the new data point is “Yes”.
How to Choose the Value of K in KNN?
So how do you choose the value of $( k )$ for the number of neighbors that you want? There are many factors that come into play when choosing the value of $( k )$ for KNN. It’s very important to ensure that you choose the value of $( k )$ appropriately, as it directly affects the performance of the algorithm. A k value that’s too small could lead to model overfitting. A k value too large may result in the model underfitting.
To ensure you choose the most appropriate value of k, here are some common methods you can use:
1. Grid Search
Grid search is a method to determine the optimal value of $( k )$. In this approach, a predefined range of values for $( k )$ is specified, and the model is trained and evaluated for each value within this range using a validation set or cross-validation. The value of $( k )$ that results in the best performance metric (such as, accuracy, F1 score, mean squared error) on the validation set is selected as the optimal $( k )$.
GridSearch can be easily applied on Python. Here’s an example of how to implement GridSearch on Python:
2. Rule of Thumb
A common rule of thumb would be to choose $( k )$ as the square root of the total number of data points in the dataset. However, this rule may not always give you the best results and you must use it carefully.
Conclusion
Compared to other supervised learning models which learn a model explicitly during training, KNN makes predictions based on the similarity of new instances to existing instances in the training data. One of KNN’s strengths is its ability to adapt to complex decision boundaries and handle nonlinear relationships in data. However, KNN’s performance is sensitive to the number of neighbors (K) as well as the distance metric used. Also, KNN’s reliance on the entire training dataset can make it memory-intensive, making it unsuitable for large datasets. Despite these limitations, KNN is useful for small to medium-sized datasets where interpretability and flexibility are prioritized over computational efficiency.
Sources
- IBM What is the K Nearest Neighbors Algorithm?
- Machine Learning Interviews How does KNN algorithm work ? What are the advantages and disadvantages of KNN ?