k-Means Clustering We'll talk about the cluster analysis algorithm k-means clustering in this course. By the end of this lesson, you should be able to identify the cluster centroid, describe the steps in thek-means method, and explain what the k in k-means stands for. Cluster analysis is a technique that divides samples in a data set into groups or clusters. The goal of cluster analysis is to group similar items in the same cluster where "similar" isdefined by some metric that measures the similarity between data samples. The idea is todivide data samples such that samples within a cluster are as close together as possible,and samples from different clusters are as far apart as possible. K-means is a classic algorithm used for cluster analysis. The algorithm is very simple. The first step is to select k initial centroids. A centroid is simply the center of a cluster as you seein the diagram here. Secondly, give each sample in a data set a centroid that is the closest to it. In other words, you measure the separation between each cluster's center and the sample, then you place itin the cluster with the closest centroid. Next, find a new centroid by calculating the mean oraverage of each cluster. Until a stopping point is achieved, these two stages are repeated.Below is an example showing how k-means functions. A in the diagram shows the original data set with some samples. In b, initial centroids are randomly selected. C shows the first iteration. Here, samples are assigned to the closestcentroid. In d, the centroids are re-calculated. E shows the second iteration where samplesare assigned to the closest centroid. Notice some samples changed cluster assignmentsduring this step. In f, the centroids are re-calculated again. Cluster assignments and centroidre-calculation are repeated until some stopping criterion is reached, and you get your finalclusters as shown in f. Yet, how are the initial centroids chosen? The problem is that initial centroids can affect eventual clusters. As a result, cluster results obtained with one set of initial centroids maynot be the same as those obtained with a different set of starting centroids. There arenumerous methods with varied degrees of sophistication for choosing the first centroids fork-means. Using k-means repeatedly with various initial centroids chosen at random tocluster your data set is the simplest and most popular strategy. From there, you can choosethe centroids that produce the best clustering outcomes. A measure of error called the Within-Cluster Sum of Squared Error (WSSE) can be used to assess the cluster results. The distance between a sample and the cluster centroiddetermines the error associated with that sample inside the cluster. The square of thatdistance is thus the sample's squared error. To determine the squared error for a cluster, weadd up all the squared errors for all samples in that cluster. In order to obtain the finalcomputation for the Within-Cluster Sum of Squared Error for all clusters in the outcomes of acluster analysis run, we then repeat the process for all clusters. When two clustering resultsare shown, the one with the smaller WSSE offers the best numerical answer.
However, as we've discussed before, there's no ground truth to mathematically determine which set of clusters is more correct than the other. In addition, note that increasing thenumber of clusters, that is a value for k, always reduces the WSSE. So WSSE should beused with caution. It only makes sense to use WSSE to compare two sets of clusters withthe same value for k and generated from the same data set. Also, the set of clusters with the smallest WSSE may not always be the best solution for the application at hand. Again, interpretation and domain knowledge about what the clustershould represent and how they will be used are crucial in determining which clusteringresults are best.