K-Nearest Neighbor(KNN) | Python

K Nearest Neighbor (KNN) algorithm falls under the Supervised Learning category and is used for classification and regression. However, it is more widely used in classification problems. In real life scenarios, K Nearest Neighbor is widely used as it is non-parametric which means it does not make any underlying assumptions about the distributions of data.

How K-Nearest Neighbor works?

K Nearest Neighbor algorithm works on the basis of feature similarity . The classification of a given data point is determined by how closely out-of-sample features resemble your training set. In classification, the output can be calculated as the class with the highest frequency from the K-most similar instances . Each instance in essence votes for their class and the class with the most votes is taken as the prediction.

Example

what is k-nearest neighbor (knn) algorithm Suppose you have a dataset with two variables, which plotted, looks like the one in the above figure. As shown in the figure , a total of 6 data points (3 green and 3 yellow). Yellow data points belong to 'Category-A' and green data points belong to 'Category-B'. Also, a Red data point in a feature space represents the new point for which a class is to be predicted. The K Nearest Neighbor algorithm starts by calculating the distance of Red data point from all the data points. It then finds the 3 nearest yellow data points with least distance to Red data point. Obviously, you can say it belongs to 'Category-A' (yellow data points), because its nearest neighbors belong to that class.

Python implementation of the KNN algorithm

Importing Libraries

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Importing the Dataset (Iris data)

It is a dataset that measures sepal-length, sepal-width, petal-length, and petal-width of three different types of iris flowers: Iris setosa, Iris virginica, and Iris versicolor. The task is to predict the "Class" to which these plants belong. To import the dataset and load it into our pandas dataframe , execute the following code:
IrisPath = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

Assign colum names to the dataset

headers = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Read dataset to pandas dataframe .
ds = pd.read_csv(IrisPath, names = headers)

Sample dataset

KNN Algorithm - Finding Nearest Neighbors

Data Pre-Processing

X = ds.iloc[:, :-1].values y = ds.iloc[:, 4].values
The X variable contains the first four columns of the dataset while y contains the labels.

Train Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40)
The above code splits the dataset into 60% train data and 40% test data .

Scale the Features

Before making any actual predictions , it is always a good practice to scale the features so that all of them can be uniformly evaluated.
scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test)

Fitting K-NN classifier to the Training data

Next step is to fit the K-NN classifier to the training data.
classifier = KNeighborsClassifier(n_neighbors = 8) classifier.fit(X_train, y_train)
n_neighbors: To define the required neighbors of the algorithm. Here it takes 8.

Predicting the Test Result

The final step is to make predictions on the test data to y_pred vector.
y_pred = classifier.predict(X_test)

Confusion Matrix and Classification Report

Create the Confusion Matrix and Classification Report for your K-NN model to see the accuracy of the classifier.
cfMatrix = confusion_matrix(y_test, y_pred) print("Confusion Matrix:") print(cfMatrix)
cReport = classification_report(y_test, y_pred) print("Classification Report:",) print (cReport)
accuracy = accuracy_score(y_test,y_pred) print("Accuracy:",accuracy)
Confusion Matrix: [[22 0 0] [ 0 20 1] [ 0 1 16]]
Classification Report: precision recall f1-score support Iris-setosa 1.00 1.00 1.00 22 Iris-versicolor 0.95 0.95 0.95 21 Iris-virginica 0.94 0.94 0.94 17 accuracy 0.97 60 macro avg 0.96 0.96 0.96 60 weighted avg 0.97 0.97 0.97 60
Accuracy: 0.9666666666666667
The Above results show that your KNN algorithm was able to classify all the 60 records in the test set with 96% accuracy. Although the algorithm performed very well with this dataset , don't expect the same results with all applications.

Full Source | Python

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report, confusion_matrix, accuracy_score IrisPath = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" headers = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class'] ds = pd.read_csv(IrisPath, names = headers) X = ds.iloc[:, :-1].values y = ds.iloc[:, 4].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40) scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) classifier = KNeighborsClassifier(n_neighbors = 8) classifier.fit(X_train, y_train) y_pred = classifier.predict(X_test) cfMatrix = confusion_matrix(y_test, y_pred) print("Confusion Matrix:") print(cfMatrix) cReport = classification_report(y_test, y_pred) print("Classification Report:",) print (cReport) accuracy = accuracy_score(y_test,y_pred) print("Accuracy:",accuracy)