k-Means Vignette

Jeremy Werner

2019-04-07

k-means is a fairly simple algorithm that iteratively cycles through data of dimension \(n \times p\) and assigns labels to clusters of points based on their distance from centroids at each step in the algorithm. The centroids are plotted optionally and are indicated by the oversized icons of the same shape and color of the labeled points.

Consider the following examples that utilize the iris data set.

library(ArtisanalMachineLearning)
data(iris)

Example 1

Sepal Width and Petal Length from the iris data set being grouped into \(k = 2\) groups.

data = iris[,2:3]
head(data)
##   Sepal.Width Petal.Length
## 1         3.5          1.4
## 2         3.0          1.4
## 3         3.2          1.3
## 4         3.1          1.5
## 5         3.6          1.4
## 6         3.9          1.7

Step 1

Points are randomly assigned and the initial centroids are calculated.

k_means = aml_k_means(data, k = 2, maximum_iterations = 0)
## Warning in .run_k_means_until_convergence(data, k, maximum_iterations):
## Convergence was not met, stopped at iteration 0
plot(k_means, plot_centroids = TRUE)

Step 2

Points are reassigned new labels based on the label of the closest centroid by euclidian distance.

k_means = aml_k_means(data, k = 2, maximum_iterations = 1)
## Warning in .run_k_means_until_convergence(data, k, maximum_iterations):
## Convergence was not met, stopped at iteration 1
plot(k_means, plot_centroids = TRUE)

Step 3

This is repeated until the maximum number of iterations is met or the algorithm converges and does not change pointwise label assignments between iterations.

k_means = aml_k_means(data, k = 2)
plot(k_means, plot_centroids = TRUE)

This convergence was achieved in the following number of iterations.

k_means$iterations
## [1] 4

Example 2

Consider a data set with more than 2 dimensions. The same euclidian distance based pointwise relabeling applies, but visualization requires dimensionality reduction. Thus, the first two principal components are displayed.

data = iris[,1:4]
head(data)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

Run algorithm

k_means = aml_k_means(data, k = 5)
plot(k_means, plot_centroids = TRUE)

Even though we selected 5 clusters, only 3 were returned. This is not an error, as it’s possible for entire clusters to be eliminated if no points are close enough to that cluster’s centroid during convergence.

Also, keep in mind we are seeing data that has been reduced via principal component analysis, so the seemingly mislabeled points are not errors. AML also supports t-SNE dimensionality reduction for potential better viewing of clusters.

k_means = aml_k_means(data, k = 5)
plot(k_means, plot_centroids = TRUE, tsne_reduction = TRUE)
## Warning: In prcomp.default(X, retx = TRUE, center = pca_center, scale. = pca_scale, 
##     rank. = initial_dims) :
##  extra argument 'rank.' will be disregarded