Data Mining Research

Monday, March 26, 2007

Combining PCA and K-means

Although often used in practice, K-means has several drawbacks. The number of clusters has to be defined in advance and the algorithm is dependent upon the starting centroid locations. More details on how to handle these issues can be found on Data Mining Research (search for clustering in the upper bar).

A weakness, which is common to clustering in general, concerns the visualization of the obtained clusters. A possible solution is to preprocess the data using PCA (1). First, the PCA procedure is applied to the data. Using the principal components the data is mapped into the new feature space. Then, the k-means algorithm is applied to the data in the feature space. The final objective is to be better able to distinguish the different clusters. The following picture shows the difference between plotting the data with two random parameters and the two first principal components.

(1) I.T. Jolliffe. Principal Component Analysis. Springer, 2002.