K-Means Clustering
K-means clustering is a type of unsupervised machine learning algorithm that is used to identify clusters within a dataset. The goal of K-means clustering is to divide the data into a specified number of clusters (K) in a way that minimizes the variance within each cluster.
The algorithm works by first initializing K centroids, which are randomly selected points within the data. The data points are then assigned to their closest centroid, and the centroids are updated to the mean of the points assigned to them. This process is repeated until the centroids no longer move or the maximum number of iterations is reached.
Mathematically, the K-means clustering algorithm can be described as follows. Let `x_1, x_2,\ldots, x_n` be the data points, where `x_i \in \mathbb{R}^d` is a d-dimensional vector. Let `c_1, c_2,\ldots, c_k` be the centroids, where `c_j \in \mathbb{R}^d` is also a d-dimensional vector. The K-means algorithm works by iteratively updating the centroids and assigning each data point to its closest centroid.
The assignment step is defined as follows:
$$y_i = \arg \min_{j \in {1, 2, ..., k}} ||x_i - c_j||^2$$
This step assigns each data point xi to the cluster whose centroid is closest to it, as measured by the Euclidean distance.
The update step is defined as follows:
$$c_j = \frac{1}{|S_j|} \sum_{i \in S_j} x_i$$
This step updates the centroid of each cluster to be the mean of the data points assigned to it. Sj is the set of indices of the data points assigned to cluster j. The algorithm terminates when the centroids no longer move or when the maximum number of iterations is reached.
The main hyper-parameter is to specify how many possible clusters (or K) there are in the dataset. The algorithm then iteratively moves the K-centers and selects the datapoints that are closest to that centroid in the cluster.
Advantages: The most common and simplest clustering algorithm.
Disadvantages: Must specify the number of clusters although this can typically be determined by increasing the number of clusters until the objective function does not change significantly.
km = KMeans(n_clusters=2)
km.fit(XA)
yP = km.predict(XB)
# Arbitrary labels with unsupervised clustering may need to be reversed
if len(XB[yP!=yB]) > n/4: yP = 1 - yP
MATLAB Live Script
✅ Knowledge Check
1. In the K-means clustering algorithm, how are the initial centroids determined?
- Incorrect. The initial centroids are randomly selected points within the data.
- Incorrect. The initial centroids are not necessarily the first K data points. They are randomly selected.
- Correct. The initial centroids are indeed randomly selected from the data points.
- Incorrect. While there are various methods to initialize centroids, in the basic K-means description, they are randomly chosen.
2. How does the K-means clustering algorithm handle assigning data points to clusters?
- Incorrect. K-means assigns data points based on the minimum Euclidean distance, not the median distance.
- Incorrect. It assigns based on the minimum, not the maximum, Euclidean distance.
- Incorrect. Each data point is assigned to only one cluster based on the closest centroid.
- Correct. Exactly. The data point is assigned to the cluster with the centroid that is the closest in terms of Euclidean distance.
Return to Classification Overview