k-Nearest Neighbors Regression

k-Nearest Neighbors (k-NN) regression is a type of non-parametric, instance-based learning algorithm that is used for regression tasks. It works by predicting the output value of a data point based on the output values of the k-nearest data points in the training set.

The k-NN algorithm has the following steps:

Choose the number of neighbors k and collect a training set. A set of training samples is (xs_i, ys_i) where xs_i is the input (features) vector and ys_i is the corresponding target (label) value for each set i.
Calculate the distance between the data point and all the points in the training set. The distance d_i between the test sample xt and a stored training sample xs_i is usually calculated using the Euclidean distance.

$$d_i = \sqrt{\sum_{j=1}^{n} (xt_j - xs_{i,j})^2}$$

where n is the number of features in xs or xt.

Select the k-nearest points in the training set based on the distance.
Predict the output value of the data point as the average of the output values of the k-nearest points. k-NN regression finds the prediction for a test sample xt by computing the average of the k nearest training samples to xt. The prediction can be written as:

$$\hat{y}(xt) = \frac{1}{k} \sum_{i=1}^{k} ys_i$$

where ys_i are the target values of the k-nearest neighbors to xt.

Neighbors based regression is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Regression result is computed from the k-nearest neighbors of each point as an average or local linear approximation.

Advantages: This algorithm is simple to implement, robust to noisy training data, and effective if training data is large.

Disadvantages: Need to determine the value of k and the computation cost is high as it needs to computer the distance of each instance to all the training samples. A feedback loop can be added to determine the number of neighbors.

k-Nearest Neighbors Regression in Python

Below is an example of how to implement k-NN regression in Python using the scikit-learn library:

from sklearn.neighbors import KNeighborsRegressor
import numpy as np

# Assume that we have a training set of data points with input features X and output values y
X = np.array([[0, 1], [1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([1, 2, 3, 4, 5])

# Create a k-NN regressor with k=3
knn = KNeighborsRegressor(n_neighbors=3)

# Fit the regressor to the training data
knn.fit(X, y)

# Predict the output value of a new data point
x_new = np.array([[1, 1]])
y_pred = knn.predict(x_new)
print(y_pred) # Output: [1.66666667]

[$[Get Code]]

In this example, we have a training set of 5 data points with input features X and output values y. We create a k-NN regressor with k=3 and fit it to the training data. Then, we use the regressor to predict the output value of a new data point. The output value is predicted as the average of the output values of the 3-nearest data points in the training set. The number of nearest neighbors in the primary hyper-parameter to adjust the performance of the regressor, such as changing n_neighbors=5.

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X,y)
yP = knn.predict(x_new)

[$[Get Code]]

MATLAB Live Script

kNN Regression MATLAB Live Script

✅ Knowledge Check

1. Which of the following statements best describes how k-Nearest Neighbors (k-NN) regression works?

A. It predicts the output value of a data point based solely on its own features without considering any other data points.

Incorrect. k-NN regression considers the output values of the k-nearest data points in the training set to predict the output value of a new data point.

B. It works by calculating the mode of the output values of the k-nearest data points in the training set.

Incorrect. While mode-based decision might be a technique in k-NN for classification tasks, k-NN regression predicts the output value based on the average of the output values of the k-nearest points.

C. It predicts the output value of a data point based on the average of the output values of the k-nearest data points in the training set.

Correct. Exactly! k-NN regression predicts the output value by averaging the values of the k-nearest data points.

D. The algorithm tries to find the best fit line for the data points.

Incorrect. k-NN regression doesn't try to find a best fit line, unlike linear regression. It's based on instance-based learning.

2. Which of the following is NOT a disadvantage of the k-NN regression algorithm?

A. It's complex to implement.

Incorrect. One of the advantages of k-NN is its simplicity in implementation.

B. Determining the value of k can be challenging.

Correct. While determining the optimal value of k can indeed be challenging, this statement correctly identifies it as a disadvantage of the k-NN algorithm.

C. It has a high computational cost since it computes the distance to all training samples.

Incorrect. This is indeed a disadvantage of k-NN. It requires calculating distances between the test sample and every sample in the training dataset.

D. The algorithm tries to construct a general internal model for predictions.

Incorrect. k-NN is a type of lazy learning, meaning it does not construct a general internal model but simply stores instances of the training data.

Machine Learning for Engineers

k-Nearest Neighbors Regression

✅ Knowledge Check

Search

Options: