My blog has moved! Redirecting...

You should be automatically redirected. If not, visit and update your bookmarks.

Data Mining Research - Cluster validity: Introduction to clustering

I'm a Data Miner Collection (T-shirts, Mugs & Mousepads)

All benefits are given to a charity association.

Tuesday, November 21, 2006

Cluster validity: Introduction to clustering

In the near future, I will use this blog to write about recent research I'm involved in. I start today (and the following days) by an introduction on the topic I'm interested in: cluster validity.

Clustering is certainly the best known example of unsupervised learning. The goal of clustering is to group data points that are similar according to a given similarity metric (by default Euclidean distance is used). As Jain et al. write in (1), "clustering is a subjective process [...] This subjectivity makes the process of clustering difficult". Clustering techniques have been applied in various domains such as text mining, color image segmentation, sensory time series, information exploration and automatic counting in video sequences. In these domains, the number of clusters is usually not known in advance.

On goal of cluster validity is to estimate the most reliable number of clusters in a dataset. Before going into more details about cluster validity, next post will focus on the clustering techniques.

(1) A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264-323, 1999.

Sphere: Related Content


Will Dwinnell said...

I'm curious as to whether you've investigated k-harmonic means (KHM) clustering? Some authors claim that KHM produces "better" clusters, though I suspect that this translates to clusters which better satisfy those authors' preferred measure of validity.


Sandro Saitta said...

Thanks for the references, I will have a look at them as I don't know KHM.

Clicky Web Analytics