k-Nearest Neighbors Regression

k-Nearest Neighbors (k-NN) regression is a type of non-parametric, instance-based learning algorithm that is used for regression tasks. It works by predicting the output value of a data point based on the output values of the k-nearest data points in the training set.

The k-NN algorithm has the following steps:

  • Choose the number of neighbors k and collect a training set. A set of training samples is (xsi, ysi) where xsi is the input (features) vector and ysi is the corresponding target (label) value for each set i.
  • Calculate the distance between the data point and all the points in the training set. The distance di between the test sample xt and a stored training sample xsi is usually calculated using the Euclidean distance.
$$d_i = \sqrt{\sum_{j=1}^{n} (xt_j - xs_{i,j})^2}$$
where n is the number of features in xs or xt.
  • Select the k-nearest points in the training set based on the distance.
  • Predict the output value of the data point as the average of the output values of the k-nearest points. k-NN regression finds the prediction for a test sample xt by computing the average of the k nearest training samples to xt. The prediction can be written as:
$$\hat{y}(xt) = \frac{1}{k} \sum_{i=1}^{k} ys_i$$
where ysi are the target values of the k-nearest neighbors to xt.

Neighbors based regression is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Regression result is computed from the k-nearest neighbors of each point as an average or local linear approximation.

Advantages: This algorithm is simple to implement, robust to noisy training data, and effective if training data is large.

Disadvantages: Need to determine the value of k and the computation cost is high as it needs to computer the distance of each instance to all the training samples. A feedback loop can be added to determine the number of neighbors.

k-Nearest Neighbors Regression in Python

Below is an example of how to implement k-NN regression in Python using the scikit-learn library:

from sklearn.neighbors import KNeighborsRegressor
import numpy as np

# Assume that we have a training set of data points with input features X and output values y
X = np.array([[0, 1], [1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([1, 2, 3, 4, 5])

# Create a k-NN regressor with k=3
knn = KNeighborsRegressor(n_neighbors=3)

# Fit the regressor to the training data
knn.fit(X, y)

# Predict the output value of a new data point
x_new = np.array([[1, 1]])
y_pred = knn.predict(x_new)
print(y_pred)  # Output: [1.66666667]

In this example, we have a training set of 5 data points with input features X and output values y. We create a k-NN regressor with k=3 and fit it to the training data. Then, we use the regressor to predict the output value of a new data point. The output value is predicted as the average of the output values of the 3-nearest data points in the training set. The number of nearest neighbors in the primary hyper-parameter to adjust the performance of the regressor, such as changing n_neighbors=5.

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X,y)
yP = knn.predict(x_new)

MATLAB Live Script


✅ Knowledge Check

1. Which of the following statements best describes how k-Nearest Neighbors (k-NN) regression works?

A. It predicts the output value of a data point based solely on its own features without considering any other data points.
Incorrect. k-NN regression considers the output values of the k-nearest data points in the training set to predict the output value of a new data point.
B. It works by calculating the mode of the output values of the k-nearest data points in the training set.
Incorrect. While mode-based decision might be a technique in k-NN for classification tasks, k-NN regression predicts the output value based on the average of the output values of the k-nearest points.
C. It predicts the output value of a data point based on the average of the output values of the k-nearest data points in the training set.
Correct. Exactly! k-NN regression predicts the output value by averaging the values of the k-nearest data points.
D. The algorithm tries to find the best fit line for the data points.
Incorrect. k-NN regression doesn't try to find a best fit line, unlike linear regression. It's based on instance-based learning.

2. Which of the following is NOT a disadvantage of the k-NN regression algorithm?

A. It's complex to implement.
Incorrect. One of the advantages of k-NN is its simplicity in implementation.
B. Determining the value of k can be challenging.
Correct. While determining the optimal value of k can indeed be challenging, this statement correctly identifies it as a disadvantage of the k-NN algorithm.
C. It has a high computational cost since it computes the distance to all training samples.
Incorrect. This is indeed a disadvantage of k-NN. It requires calculating distances between the test sample and every sample in the training dataset.
D. The algorithm tries to construct a general internal model for predictions.
Incorrect. k-NN is a type of lazy learning, meaning it does not construct a general internal model but simply stores instances of the training data.

See also k-Nearest Neighbors for Classification