What’s the difference between a proxy server and a reverse proxy server?
2019-11-06
Logistic Regression Implementation From Scratch in Python
2019-11-13
Show all

Machine Learning From Scratch Series: K-means Clustering: K-Nearest Neighbors (KNN) Algorithm

8 mins read

Introduction

A famous quote states: “You are the average of the five people you spend the most time with.” Although we won’t be modeling the qualities of your friendships (portfolio project anyone?), this tutorial will teach a simple and intuitive algorithmic approach to classify data based on their neighbors. The k-nearest neighbors (KNN) algorithm is a supervised learning algorithm with an elegant execution and a surprisingly easy implementation. Because of this, KNN presents a great learning opportunity for machine learning beginners to create a powerful classification or regression algorithm, with a few lines of Python code. Codes are available in this notebook.

Algorithm

KNN is a supervised machine learning algorithm. A supervised model has both a target variable and independent variables. The target variable or dependent variable, denoted y, depends on the independent variables and is the value you seek to predict. The independent variables denoted x (single-valued) or X (multi-valued), are known ahead of time and are used to predict y.

Knn can be used for both classification and regressionClassification models predict a categorical target variable and regression models predict a numeric target.

Suppose you have a dataset of scalar attributes and classes corresponding to those attributes.

Here, n is the total number of data points and m is the total number of classes. Instead of y being a class, it could also be a scalar value, and KNN can be used as a regression, but for this tutorial, we will focus on classification.

With this two-dimensional example, we can easily visualize these points in two-dimensional space. Assuming that classes will tend to cluster with points of the same class in this space, we can classify a new point by the most frequently occurring class near it. Thus, at a given point k is specified as the number of neighbors to consider near said point, and from those neighbors, the most frequently occurring class is predicted to be the class of the point at hand.

Image by author.

Figure 1: This point is classified as group 1 when k=1.

Image by author.

Figure 2: This point is classified as group 0 when k=3.

Data

We’ll evaluate our algorithm with the UCI Machine Learning Repository iris dataset. However, any classification dataset consisting of scalar inputs will do. We’ll unpack the dataset, and standardize the attributes to have zero mean and unit variance. This is done because we don’t want to pass any judgments on which features are most important for predicting the class (for this analysis!). Lastly, we’ll split our dataset into training and testing sets, with the test set consisting of 20% of the original dataset.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Unpack the iris dataset, from UCI Machine Learning Repository
iris = datasets.load_iris()
X = iris['data']
y = iris['target']
# Preprocess data
X = StandardScaler().fit_transform(X)
# Split data into train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Model Creation

Helper Functions

We’ll need to figure out the most commonly occurring element in a given group at numerous times in this algorithm. The following function presents an intuitive and Pythonic method of finding the most common element in a list.

def most_common(lst):
    '''Returns the most common element in a list'''
    return max(set(lst), key=lst.count)

Next, we’ll need to calculate the distance between a point and every point in a dataset. The most common and intuitive method of doing this is by euclidean distance, however, other distance methods can be used.

def euclidean(point, data):
    '''Euclidean distance between a point  & data'''
    return np.sqrt(np.sum((point - data)**2, axis=1))

Implementation

Now, let’s begin to construct a KNN class. For a given KNN classifier, we’ll specify k and a distance metric. To keep the implementation of this algorithm similar to that of the widely-used scikit-learn suite, we’ll initialize the self.X_train and self.y_train in a fit method, however, this could be done on initialization.

class KNeighborsClassifier():
    def __init__(self, k=5, dist_metric=euclidean):
        self.k = k
        self.dist_metric = dist_metric
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

Next, the bulk of the KNN algorithm is performed by the prediction method. A dataset of attributes (X_test) is iterated through, point by point. For each data point the following steps are performed:

  1. Distances to every point in the training dataset are calculated
  2. The training dataset classes are sorted by distance to the data point
  3. The first k classes are kept and stored in the neighbor’s list Now we simply map the list of nearest neighbors to our most_common function, returning a list of predictions for each point passed in X_test.
class KNeighborsClassifier():
    def __init__(self, k=5, dist_metric=euclidean):
        self.k = k
        self.dist_metric = dist_metric
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train
    def predict(self, X_test):
        neighbors = []
        for x in X_test:
            distances = self.dist_metric(x, self.X_train)
            y_sorted = [y for _, y in sorted(zip(distances, self.y_train))]
            neighbors.append(y_sorted[:self.k])
        return list(map(most_common, neighbors))

Lastly, an evaluation method is defined to conveniently evaluate the performance of our model. A dataset of attributes and their classes are passed as X_test and y_test, and the predictions of the model from the attributes are compared to the actual classes.

class KNeighborsClassifier():
    def __init__(self, k=5, dist_metric=euclidean):
        self.k = k
        self.dist_metric = dist_metric
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train
    def predict(self, X_test):
        neighbors = []
        for x in X_test:
            distances = self.dist_metric(x, self.X_train)
            y_sorted = [y for _, y in sorted(zip(distances, self.y_train))]
            neighbors.append(y_sorted[:self.k])
        return list(map(most_common, neighbors))
    def evaluate(self, X_test, y_test):
        y_pred = self.predict(X_test)
        accuracy = sum(y_pred == y_test) / len(y_test)
        return accuracy

Believe it or not, we’re finished — we can easily deploy this algorithm to model classification problems. But, for completeness, we should optimize k for the iris dataset. We can do this by iterating through a range of k and plotting the performance of the model.

accuracies = []
ks = range(1, 30)
for k in ks:
    knn = KNeighborsClassifier(k=k)
    knn.fit(X_train, y_train)
    accuracy = knn.evaluate(X_test, y_test)
    accuracies.append(accuracy)
fig, ax = plt.subplots()
ax.plot(ks, accuracies)
ax.set(xlabel="k",
       ylabel="Accuracy",
       title="Performance of knn")
plt.show()
Image by author.

Figure 3: knn accuracy versus k

Looks like our knn model performs best at low k.

Conclusion

And with that, we’re done. We’ve implemented a simple and intuitive k-nearest neighbors algorithm with under 100 lines of python code (under 50 excluding the plotting and data unpacking). The entire project code is included below.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def most_common(lst):
    return max(set(lst), key=lst.count)
def euclidean(point, data):
    # Euclidean distance between points a & data
    return np.sqrt(np.sum((point - data)**2, axis=1))
class KNeighborsClassifier:
    def __init__(self, k=5, dist_metric=euclidean):
        self.k = k
        self.dist_metric = dist_metric
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train
    def predict(self, X_test):
        neighbors = []
        for x in X_test:
            distances = self.dist_metric(x, self.X_train)
            y_sorted = [y for _, y in sorted(zip(distances, self.y_train))]
            neighbors.append(y_sorted[:self.k])
        return list(map(most_common, neighbors))
    def evaluate(self, X_test, y_test):
        y_pred = self.predict(X_test)
        accuracy = sum(y_pred == y_test) / len(y_test)
        return accuracy
# Unpack the iris dataset, from UCI Machine Learning Repository
iris = datasets.load_iris()
X = iris['data']
y = iris['target']
# Split data into train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Preprocess data
ss = StandardScaler().fit(X_train)
X_train, X_test = ss.transform(X_train), ss.transform(X_test)
# Test knn model across varying ks
accuracies = []
ks = range(1, 30)
for k in ks:
    knn = KNeighborsClassifier(k=k)
    knn.fit(X_train, y_train)
    accuracy = knn.evaluate(X_test, y_test)
    accuracies.append(accuracy)
# Visualize accuracy vs. k
fig, ax = plt.subplots()
ax.plot(ks, accuracies)
ax.set(xlabel="k",
       ylabel="Accuracy",
       title="Performance of knn")
plt.show()

Returning to our quote “You are the average of the five people you spend the most time with”, the KNN classification should instead say “You are the most frequent of the k people you spend the most time with.” For an independent challenge adapt this code to better fit the original code by creating a KNN regression model, where a point is interpreted as the average scalar target value of its k nearest neighbors.

Source:

https://towardsdatascience.com/create-your-own-k-nearest-neighbors-algorithm-in-python-eb7093fc6339

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.