K-Nearest Neighbor(KNN) | Python

K Nearest Neighbor (KNN) algorithm is indeed a versatile supervised learning technique used for both classification and regression tasks, although it is more commonly applied to classification problems. KNN is popular in real-life scenarios due to its non-parametric nature, meaning it does not assume any specific distribution of data. This flexibility makes KNN well-suited for various applications where the underlying data distribution may not be well-defined or known.

How K-Nearest Neighbor works?

The K Nearest Neighbor (KNN) algorithm operates based on the principle of feature similarity. When classifying a new data point, KNN considers how closely its features resemble the training set instances. In classification tasks, the output class is determined by the majority class among the K-most similar instances from the training set. Each neighboring instance effectively casts a "vote" for its class, and the class with the highest number of votes becomes the prediction for the new data point. This voting mechanism makes KNN a simple yet effective algorithm for solving classification problems based on the concept of feature similarity.

Example

what is k-nearest neighbor (knn) algorithm

In the given dataset, the K Nearest Neighbor (KNN) algorithm is applied to predict the class of the red data point. It starts by calculating the distances between the red data point and all the other data points in the dataset. In this case, K=3 is specified, so the algorithm finds the 3 nearest yellow data points with the least distance to the red data point.

Since all the nearest neighbors are yellow data points (belonging to 'Category-A'), the KNN algorithm predicts that the red data point also belongs to 'Category-A'. This is because the majority of the K-nearest neighbors are from 'Category-A', so the red data point is classified into that category. The KNN algorithm makes predictions based on the class of the most similar data points in the feature space, making it a simple yet effective classification method.

Python implementation of the KNN algorithm

Importing Libraries

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Importing the Dataset (Iris data)

It is a dataset that measures sepal-length, sepal-width, petal-length, and petal-width of three different types of iris flowers: Iris setosa, Iris virginica, and Iris versicolor. The task is to predict the "Class" to which these plants belong. To import the dataset and load it into our pandas dataframe , execute the following code:

IrisPath = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

Assign colum names to the dataset

headers = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

Read dataset to pandas dataframe .

ds = pd.read_csv(IrisPath, names = headers)

Sample dataset

KNN Algorithm - Finding Nearest Neighbors

Data Pre-Processing

X = ds.iloc[:, :-1].values y = ds.iloc[:, 4].values

The X variable contains the first four columns of the dataset while y contains the labels.

Train Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40)

The above code splits the dataset into 60% train data and 40% test data .

Scale the Features

Before making any actual predictions , it is always a good practice to scale the features so that all of them can be uniformly evaluated.

scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test)

Fitting K-NN classifier to the Training data

Next step is to fit the K-NN classifier to the training data.

classifier = KNeighborsClassifier(n_neighbors = 8) classifier.fit(X_train, y_train)
n_neighbors: To define the required neighbors of the algorithm. Here it takes 8.

Predicting the Test Result

The final step is to make predictions on the test data to y_pred vector.

y_pred = classifier.predict(X_test)

Confusion Matrix and Classification Report

Create the Confusion Matrix and Classification Report for your K-NN model to see the accuracy of the classifier.

cfMatrix = confusion_matrix(y_test, y_pred) print("Confusion Matrix:") print(cfMatrix)
cReport = classification_report(y_test, y_pred) print("Classification Report:",) print (cReport)
accuracy = accuracy_score(y_test,y_pred) print("Accuracy:",accuracy)
Confusion Matrix: [[22 0 0] [ 0 20 1] [ 0 1 16]]
Classification Report: precision recall f1-score support Iris-setosa 1.00 1.00 1.00 22 Iris-versicolor 0.95 0.95 0.95 21 Iris-virginica 0.94 0.94 0.94 17 accuracy 0.97 60 macro avg 0.96 0.96 0.96 60 weighted avg 0.97 0.97 0.97 60
Accuracy: 0.9666666666666667

The Above results show that your KNN algorithm was able to classify all the 60 records in the test set with 96% accuracy. Although the algorithm performed very well with this dataset , don't expect the same results with all applications.

Full Source | Python

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report, confusion_matrix, accuracy_score IrisPath = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" headers = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class'] ds = pd.read_csv(IrisPath, names = headers) X = ds.iloc[:, :-1].values y = ds.iloc[:, 4].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40) scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) classifier = KNeighborsClassifier(n_neighbors = 8) classifier.fit(X_train, y_train) y_pred = classifier.predict(X_test) cfMatrix = confusion_matrix(y_test, y_pred) print("Confusion Matrix:") print(cfMatrix) cReport = classification_report(y_test, y_pred) print("Classification Report:",) print (cReport) accuracy = accuracy_score(y_test,y_pred) print("Accuracy:",accuracy)

Conclusion

K-Nearest Neighbor (KNN) is a versatile supervised learning algorithm used for both classification and regression tasks. It operates based on the principle of feature similarity, making predictions by finding the K-most similar data points from the training set and assigning the class based on majority voting (for classification) or averaging (for regression). KNN is a non-parametric method that does not assume underlying data distributions, making it widely used and applicable to various real-life scenarios.