Data Mining Research

Showing posts with label k-means. Show all posts

Thursday, August 16, 2007

MLDM 2007: Anil K. Jain's presentation on clustering

As written in the previous post, Anil K. Jain was the invited speaker of MLDM 2007. He gave an interesting presentation about clustering, focusing on the user's dilemma. He started with a comprehensive introduction on clustering and then showed some of the future work he is involved in: semi-supervised clustering and clustering with co-association. Below is the abstract of his presentation:

Data clustering is a long standing research problem in pattern recognition, computer vision, machine learning, and data mining with applications in a number of diverse disciplines. The goal is to partition a set of n d-dimensional points into k clusters, where k may or may not be known. Most clustering techniques require the definition of a similarity measure between patterns, which is not easy to specify in the absence of any prior knowledge about cluster shapes. While a large number of clustering algorithms exist, there is no optimal algorithm. Each clustering algorithm imposes a specific structure on the data and has its own approach for estimating the number of clusters. No single algorithm can adequately handle various cluster shapes and structures that are encountered in practice. Instead of spending our effort in devising yet another clustering algorithm, there is a need to build upon the existing published techniques. In this talk we will address the following problems: (i) clustering via evidence accumulation, (ii) simultaneous clustering and dimensionality reduction, (iii) clustering under pair-wise constraints, and (iv) clustering with relevance feedback. Experimental results show that these approaches are promising in identifying arbitrary shaped clusters in multidimensional data.

He made some interesting remarks during his talk. I have noted three of them:

K-means has been invented in 1955, 1957, 1965 and 1967 (!)
In a good feature space, any simple clustering algorithm will work
A clustering method is not the same as a clustering algorithm (an algorithm is an implementation of a particular method)

If interested, you can find more information related to his work.

Continue reading... Sphere: Related Content

Monday, March 26, 2007

Combining PCA and K-means

Although often used in practice, K-means has several drawbacks. The number of clusters has to be defined in advance and the algorithm is dependent upon the starting centroid locations. More details on how to handle these issues can be found on Data Mining Research (search for clustering in the upper bar).

A weakness, which is common to clustering in general, concerns the visualization of the obtained clusters. A possible solution is to preprocess the data using PCA (1). First, the PCA procedure is applied to the data. Using the principal components the data is mapped into the new feature space. Then, the k-means algorithm is applied to the data in the feature space. The final objective is to be better able to distinguish the different clusters. The following picture shows the difference between plotting the data with two random parameters and the two first principal components.

(1) I.T. Jolliffe. Principal Component Analysis. Springer, 2002.

Continue reading... Sphere: Related Content

Tuesday, March 06, 2007

K-means: random starting centroids

One of the most common technique for clustering is K-means (1). I have already written a few words about clustering algorithms on this blog.

The main drawbacks of K-means are certainly the numeric consideration of the parameters, the unknown number of clusters K and the random starting centroid locations. The paper by Huang (2) is a possible solution to the first. For the second, cluster validation techniques can be used to find a reliable value for K.

In this post, I want to discuss about the third one. In K-means, the K initial centroids are chosen randomly. Therefore, running P times can result in P different clustering of the same data. A possible strategy for avoiding such a problem is multiple runs. The K-means procedure is done P times on the data and evaluated using a cluster validity index. Then, the run with the maximum (or minimum depending on the validity index) value is chosen as the clustering result.

(1) Jain, A. and Dubes, R.: 1988, Algorithms for Clustering Data, Prentice Hall.
(2) Huang, Z.: 1998, Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values, Data Mining and Knowledge Discovery 2(3), 283.

Continue reading... Sphere: Related Content

Wednesday, November 22, 2006

Cluster validity: Clustering algorithms

Now that the clustering ideas have been introduced, let's look at existing clustering strategies. Several clustering techniques can be found in the literature. They can be divided in four main categories (1): partitional clustering (K-means, etc.), hierarchical clustering (BIRCH, etc.), density-based clustering (DBSCAN, etc.) and grid-based clustering (STING, etc.). In the literature, clustering can be found under different expression such as unsupervised learning, numerical taxonomy and partition (2).

One of the most common technique for clustering is K-means (3). Main reasons can be found among other categories drawbacks (even if k-means has its own drawbacks). Hierarchical clustering, for example, usually has a higher complexity such as O(n^2). Density-based clustering algorithms often have non-intuitive parameters to tune. Finally, grid-based clustering algorithms not always give clusters of good quality (1).

Main advantages of K-means are its computational efficiency and its simplicity to understand the results. Bolshakova and Azuaje (4) thinks that K-means is the most widely used clustering algorithm in practice. This last point is a good indicator of its efficiency in real-life situations. The main drawbacks of K-means are certainly the random centroid locations and unknown number of clusters K. This number has to be known in advance and is an input in the standard K-means algorithm. That's where cluster validity enters in the game. And this is for the next post.

(1) M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. J. of Intelligent Information Systems, 17(2-3):107-145, 2001.
(2) S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.
(3) A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
(4) N. Bolshakova and F. Azuaje. Cluster validation techniques for genome expression data. Signal Process., 83(4):825-833, 2003.

Continue reading... Sphere: Related Content