Cluster validity: Existing indices

Thursday, November 23, 2006

Cluster validity: Existing indices

The third - and final - post on cluster validity is about existing validity indices. As written in (1), the two fundamentals issues in cluster validity are 1) the number of clusters present in the data and 2) how good is the clustering itself.

Several indices have been proposed in the literature. The main idea with these indices is to plot them with regard to the number of clusters and then analyze this plot. Dunn Index (2) combines dissimilarity between clusters and their diameters to estimate the most reliable number of clusters. Dunn Index is computationally expensive and sensitive to noise. Silhouette index (3) uses average dissimilarity between points to show the structure of the data and consequently its possible clusters. Silhouette index is only suitable for estimating the first choice or best partition. The concepts of dispersion of a cluster and dissimilarity between clusters are used to compute Davies-Bouldin index (4). According to (5), Davies-Bouldin index is among the best indices.

Silhouette, Dunn and Davies-Bouldin indices require the definition of at least two clusters. Finally, I want to point out the fact that several other indices exist in the literature. Some are computationally expensive while other are unable to discover the real number of clusters in certain datasets (5).

(1) U. Maulik and S. Bandyopadhyay. Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. Pattern Anal. Mach. Intell., 24(12):1650-1654, 2002.
(2) J.C. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95-104, 1974.
(3) L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
(4) D.L. Davies and W. Bouldin. A cluster separation measure. IEEE PAMI, 1:224-227, 1979.
(5) M. Kim and R.S. Ramakrishna. New indices for cluster validity assessment. Pattern Recogn. Lett., 26(15):2353-2363, 2005.

Sphere: Related Content

9 comments:

jackie said...: hello, i heard that there is an indice of clustering validity, called the overall mean inner cluster similarity and the overall mean inter cluster similarity, but i don't know how to compute it. do you know that? can you tell me?; 10:00 AM
Sandro Saitta said...: Hi Jackie,

I would be glad to help you if you give me some more details about the validity index you're talking about.
Do you have an author name, a date?; 10:38 PM
jackie said...: oh, yes, i know it's from the book: Finding groups in data. an introduction to cluster analysis, but i can't find the source. so i would appreciate it if you can help me. it's urgent, thanks!; 9:21 AM
Sandro Saitta said...: The book you mention is from Kaufman and Rousseeuw. They have developed the Silhouette width. I don't know if this is the validity index you're talking about. I will see if I can get the book to have a look inside it.; 9:44 AM
jackie said...: thank you for help. i just want to know how to compute the index of "the overall mean inner cluster similarity" and "the overall mean inter cluster similarity". by the way, i'm doing some work on clustering the web logs right now, do you know any validity index except the recommendation accuracy because i don't want to do recommendation right now. thanks again!; 10:19 AM
Sandro Saitta said...: Sorry, I cannot find any information on "overall mean inner cluster similarity". I have look in the book you mentioned, but was not able to find anything. I have also look on Google and the search about "overall mean inner cluster similarity" is only giving two hits (on of them is this blog). So it is quite low for a validity index. I think it should have a different name in the literature.

My advise is to use another validity index such as Silhouette or Davies-Bouldin. Dunn index is not computationaly efficient. There are many other validity indices existing. I think Silhouette is a good first try. A paper for a very short introduction (in order to implement it) can be this one: "Cluster Validation Techniques for Genome Expression Data". You can find it using Google.; 12:43 PM
jackie said...: hi, merry christmas! thank you for your advice. i'm trying,....i'll let you know as soon as i have the result.; 1:00 PM
Anonymous said...: hi there, does anyone know the SDbw index proposed by Halkidi et al.? I've implemented the SDbw index, but my implementation of the index does not minimise with the optimal number of clusters. Any code or a link to source code of this index will be appreciated. Thx.; 7:53 PM
Sandro Saitta said...: Hello,

I'm not sure, but I think you can find a Matlab version of this code on Mathworks. You should look for codes using "clustering", "cluster index", "cluster indices" as search terms.

Hope it helps.; 2:31 PM