k-Nearest Neighbors Classifier

Neighbors based classification is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbors of each point. Specifically, k-Nearest Neighbors (kNN) is a type of instance-based learning or non-parametric learning algorithm, where the model is not explicitly trained, but instead makes predictions by comparing new data points to the nearest neighbors in the training data. The algorithm works by calculating the distances between the new data point and the training data, selecting the k-nearest data points, and then making a prediction based on the class or label of the selected neighbors.

Advantages: This algorithm is simple to implement, robust to noisy training data, and effective if training data is large.

Disadvantages: Need to determine the value of k and the computation cost is high as it needs to computer the distance of each instance to all the training samples. A feedback loop can be added to determine the number of neighbors.

k-Nearest Neighbors in Python

Here is an example of k-nearest neighbors in Python using the scikit-learn library:

from sklearn.neighbors import KNeighborsClassifier

# Create a k-nearest neighbors classifier with 3 neighbors
model = KNeighborsClassifier(n_neighbors=3)

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

In this example, we first create a k-nearest neighbors classifier with 3 neighbors using the KNeighborsClassifier class from scikit-learn. Then, we train the model on the training data using the fit method. Finally, we use the trained model to make predictions on the test set using the predict method. The number of neighbors is the hyperparameter for tuning the performance. Change n_neighbors to a higher number to gain more consensus for the classification.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(XA,yA)
yP = knn.predict(XB)

Optical Character Recognition with k-Nearest Neighbors

Optical character recognition (OCR) is the process of extracting text from images or scanned documents. OCR algorithms are typically based on machine learning models that are trained on large datasets of images containing text.

Here is an example of OCR using a k-nearest neighbors (kNN) classifier in Python using the scikit-learn library:

# Import necessary libraries
from sklearn.datasets import load_digits
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Load the dataset of images of handwritten digits
digits = load_digits()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(digits.data,
                                                    digits.target,
                                                    random_state=0)

# Create a k-nearest neighbors classifier with 3 neighbors
model = KNeighborsClassifier(n_neighbors=3)

# Train the model using the training set
model.fit(X_train, y_train)

# Evaluate the model's performance on the test set
accuracy = model.score(X_test, y_test)
print("Accuracy: %0.2f" % accuracy)

In this example, we first load the digits dataset from scikit-learn, which contains images of handwritten digits. Then, we create a k-nearest neighbors classifier with 3 neighbors using the KNeighborsClassifier class. Next, we train the model on the digits dataset. Finally, we use the trained model to make predictions on new images.

Note that this is just a simple example of OCR using a kNN classifier in scikit-learn, and many other more advanced OCR algorithms and libraries exist. kNN classifiers may not be the most effective model for OCR tasks, as they are not well-suited to high-dimensional datasets like images. Below is a complete example with a train/test dataset split.

from sklearn import datasets
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)

# The digits dataset
digits = datasets.load_digits()
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# Split into train and test subsets (50% each)
X_train, X_test, y_train, y_test = train_test_split(
    data, digits.target, test_size=0.5, shuffle=False)

# Learn the digits on the first half of the digits
classifier.fit(X_train, y_train)

# Test on second half of data
n = np.random.randint(int(n_samples/2),n_samples)
print('Predicted: ' + str(classifier.predict(digits.data[n:n+1])[0]))

# Show number
plt.imshow(digits.images[n], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()


k-Nearest Neighbors for Animal Classification

It is proposed to classify cats and dogs by weight. A total of 16 animals are measured and the data is split into a test (3) and train (13) set.

The accuracy of the classifier is defined as the number of correct test classifications divided by the total number in the test set.

$$Accuracy = \frac{Correct}{Total} \mathrm{x} 100\%$$

What is the accuracy of the dog and cat classifier with k=2?

Solution Help


k-Nearest Neighbors Programmming

Using only Python code (not a machine learning package), create a k-Nearest Neighbors classifier. Use the blobs test code to evaluate the performance of the code.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs

n_samples = 100
features, label = make_blobs(n_samples=n_samples, centers=2,\
                             n_features=2,random_state=7,
                             cluster_std=1.0, center_box=(-5.0, 5.0))
data = pd.DataFrame({'x':features[:,0],'y':features[:,1],\
                     'z':label})
sns.pairplot(data,hue='z')

Resources

Solutions

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs

n_samples = 100
features, label = make_blobs(n_samples=n_samples, centers=2,\
                             n_features=2,random_state=7,
                             cluster_std=1.0, center_box=(-5.0, 5.0))
data = pd.DataFrame({'x':features[:,0],'y':features[:,1],\
                     'z':label})

# split data
train = data[0:70].copy()
test = data[70:].copy().reset_index(drop=True)

# evaluate on test data
k = 3 # k=3 Nearest Neighbors
correct = 0
train['dist'] = 0
for s in range(len(test)):
    x,y,z = test.iloc[s]
    train['dist'] = np.sqrt((train['x']-x)**2 \
                           +(train['y']-y)**2)
    train = train.sort_values('dist')
    if int(train['z'].iloc[0:k].median())==int(z):
        correct += 1
print(f'Correct: {100*correct/len(test)}%')  

✅ Knowledge Check

1. Which statement best describes k-Nearest Neighbors (kNN) algorithm?

A. kNN is a type of eager learning that constructs a general internal model from training data.
Incorrect. kNN is a type of lazy learning, which means it doesn't construct a general internal model but makes predictions based on the training data instances.
B. kNN is a type of supervised learning where the model is trained explicitly using a training dataset.
Incorrect. While kNN is a type of supervised learning, it is not explicitly trained using a training dataset in the traditional sense. It relies on instance-based or non-parametric learning.
C. kNN classifies new data points by calculating distances between the new data point and existing data points in the training set.
Correct. The kNN algorithm works by calculating distances between a new data point and existing training data points and then makes a decision based on the majority vote from the nearest neighbors.
D. kNN does not require a hyperparameter to be set for determining the number of neighbors.
Incorrect. The number of neighbors, represented by 'k', is a hyperparameter in kNN that determines how many nearest neighbors will be considered when making a classification.

2. In the given Python example of kNN using the scikit-learn library for OCR, how many neighbors are used for classification?

A. 5 neighbors
Incorrect. In the provided OCR example, the number of neighbors specified is 3, not 5.
B. 2 neighbors
Incorrect. In the provided OCR example, the number of neighbors specified is 3.
C. 3 neighbors
Correct. In the provided OCR example, the KNeighborsClassifier is created with n_neighbors=3, which means it uses 3 neighbors for classification.
D. The number of neighbors is determined dynamically based on the dataset.
Incorrect. While it's possible to use techniques to dynamically determine the best 'k', in the provided example, it is explicitly set to 3.

See also k-Nearest Neighbors for Regression

Return to Classification Overview