k-Nearest Neighbors Regression
k-Nearest Neighbors (k-NN) regression is a type of non-parametric, instance-based learning algorithm that is used for regression tasks. It works by predicting the output value of a data point based on the output values of the k-nearest data points in the training set.
The k-NN algorithm has the following steps:
- Choose the number of neighbors k and collect a training set. A set of training samples is (xsi, ysi) where xsi is the input (features) vector and ysi is the corresponding target (label) value for each set i.
- Calculate the distance between the data point and all the points in the training set. The distance di between the test sample xt and a stored training sample xsi is usually calculated using the Euclidean distance.
- Select the k-nearest points in the training set based on the distance.
- Predict the output value of the data point as the average of the output values of the k-nearest points. k-NN regression finds the prediction for a test sample xt by computing the average of the k nearest training samples to xt. The prediction can be written as:
Neighbors based regression is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Regression result is computed from the k-nearest neighbors of each point as an average or local linear approximation.
Advantages: This algorithm is simple to implement, robust to noisy training data, and effective if training data is large.
Disadvantages: Need to determine the value of k and the computation cost is high as it needs to computer the distance of each instance to all the training samples. A feedback loop can be added to determine the number of neighbors.
k-Nearest Neighbors Regression in Python
Below is an example of how to implement k-NN regression in Python using the scikit-learn library:
import numpy as np
# Assume that we have a training set of data points with input features X and output values y
X = np.array([[0, 1], [1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([1, 2, 3, 4, 5])
# Create a k-NN regressor with k=3
knn = KNeighborsRegressor(n_neighbors=3)
# Fit the regressor to the training data
knn.fit(X, y)
# Predict the output value of a new data point
x_new = np.array([[1, 1]])
y_pred = knn.predict(x_new)
print(y_pred) # Output: [1.66666667]
In this example, we have a training set of 5 data points with input features X and output values y. We create a k-NN regressor with k=3 and fit it to the training data. Then, we use the regressor to predict the output value of a new data point. The output value is predicted as the average of the output values of the 3-nearest data points in the training set. The number of nearest neighbors in the primary hyper-parameter to adjust the performance of the regressor, such as changing n_neighbors=5.
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X,y)
yP = knn.predict(x_new)
MATLAB Live Script
✅ Knowledge Check
1. Which of the following statements best describes how k-Nearest Neighbors (k-NN) regression works?
- Incorrect. k-NN regression considers the output values of the k-nearest data points in the training set to predict the output value of a new data point.
- Incorrect. While mode-based decision might be a technique in k-NN for classification tasks, k-NN regression predicts the output value based on the average of the output values of the k-nearest points.
- Correct. Exactly! k-NN regression predicts the output value by averaging the values of the k-nearest data points.
- Incorrect. k-NN regression doesn't try to find a best fit line, unlike linear regression. It's based on instance-based learning.
2. Which of the following is NOT a disadvantage of the k-NN regression algorithm?
- Incorrect. One of the advantages of k-NN is its simplicity in implementation.
- Correct. While determining the optimal value of k can indeed be challenging, this statement correctly identifies it as a disadvantage of the k-NN algorithm.
- Incorrect. This is indeed a disadvantage of k-NN. It requires calculating distances between the test sample and every sample in the training dataset.
- Incorrect. k-NN is a type of lazy learning, meaning it does not construct a general internal model but simply stores instances of the training data.