My blog has moved! Redirecting...

You should be automatically redirected. If not, visit http://www.dataminingblog.com and update your bookmarks.

Data Mining Research - dataminingblog.com: November 2006

I'm a Data Miner Collection (T-shirts, Mugs & Mousepads)

All benefits are given to a charity association.

Thursday, November 30, 2006

Data mining people: Heikki Mannila

Here is a new post about data mining people. Today, Heikki Mannila is introduced. He has Ph.D. in computer science from the University of Helsinki. He worked for companies such as Microsoft and Nokia. He also was a research director in Helsinki Institute for Information Technology. He is currently an academy professor.

Heikki Mannila is well known in the data mining community due to his book Principles of Data Mining, written with David Hand and Padhraic Smyth. This book covers many topics in data mining and is a very good introduction to this field.

Some of his research fields are in algorithms, databases and data mining. He applies these domains in areas such as computational biology, paleontology, linguistics and ubiquitous computing. These informations have mainly been found on Mannila's webpage.

Continue reading... Sphere: Related Content

Wednesday, November 29, 2006

Data mining explained

On the blog of Devipriya, there is a very interesting and complete introduction to data mining named "Who is mining your data?". This clearly written introduction is mainly intended to people who wants to know what motivates data mining and what are the possible applications. Only the minimum technical terms are used so that any reader can understand what data mining is about.

Continue reading... Sphere: Related Content

Monday, November 27, 2006

Juice Analytics' Blog

It's always a pleasure for me to find interesting blogs about data mining and to present them here. Juice Analytics is company that... well, let's them define what they do with their own words: "Juice Analytics helps small and mid-market companies develop deep prospect and customer understanding through visualization and analytics of existing data". They have a weblog which is a mine of useful information about data and all processes related to data. Their posts clearly reflect their experience as well as their strong relationships with real problems coming from the industry. I warmly recommend you this blog.

Continue reading... Sphere: Related Content

Friday, November 24, 2006

Now boarding!

Here is some food for the week-end:

  • Will is explaining a good alternative to the standard Euclidean distance by introducing the Mahalanobis distance on his blog
  • Andy is writing about the fact that Google seems to start integrating blog post in its results (pointed by Matthew)
By the way, I would like to thank Joël Arnold for the nice drawing he made for me (picture on the right).

Continue reading... Sphere: Related Content

Thursday, November 23, 2006

Cluster validity: Existing indices

The third - and final - post on cluster validity is about existing validity indices. As written in (1), the two fundamentals issues in cluster validity are 1) the number of clusters present in the data and 2) how good is the clustering itself.

Several indices have been proposed in the literature. The main idea with these indices is to plot them with regard to the number of clusters and then analyze this plot. Dunn Index (2) combines dissimilarity between clusters and their diameters to estimate the most reliable number of clusters. Dunn Index is computationally expensive and sensitive to noise. Silhouette index (3) uses average dissimilarity between points to show the structure of the data and consequently its possible clusters. Silhouette index is only suitable for estimating the first choice or best partition. The concepts of dispersion of a cluster and dissimilarity between clusters are used to compute Davies-Bouldin index (4). According to (5), Davies-Bouldin index is among the best indices.

Silhouette, Dunn and Davies-Bouldin indices require the definition of at least two clusters. Finally, I want to point out the fact that several other indices exist in the literature. Some are computationally expensive while other are unable to discover the real number of clusters in certain datasets (5).

(1) U. Maulik and S. Bandyopadhyay. Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. Pattern Anal. Mach. Intell., 24(12):1650-1654, 2002.
(2) J.C. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95-104, 1974.
(3) L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
(4) D.L. Davies and W. Bouldin. A cluster separation measure. IEEE PAMI, 1:224-227, 1979.
(5) M. Kim and R.S. Ramakrishna. New indices for cluster validity assessment. Pattern Recogn. Lett., 26(15):2353-2363, 2005.

Continue reading... Sphere: Related Content

Wednesday, November 22, 2006

Cluster validity: Clustering algorithms

Now that the clustering ideas have been introduced, let's look at existing clustering strategies. Several clustering techniques can be found in the literature. They can be divided in four main categories (1): partitional clustering (K-means, etc.), hierarchical clustering (BIRCH, etc.), density-based clustering (DBSCAN, etc.) and grid-based clustering (STING, etc.). In the literature, clustering can be found under different expression such as unsupervised learning, numerical taxonomy and partition (2).

One of the most common technique for clustering is K-means (3). Main reasons can be found among other categories drawbacks (even if k-means has its own drawbacks). Hierarchical clustering, for example, usually has a higher complexity such as O(n^2). Density-based clustering algorithms often have non-intuitive parameters to tune. Finally, grid-based clustering algorithms not always give clusters of good quality (1).

Main advantages of K-means are its computational efficiency and its simplicity to understand the results. Bolshakova and Azuaje (4) thinks that K-means is the most widely used clustering algorithm in practice. This last point is a good indicator of its efficiency in real-life situations. The main drawbacks of K-means are certainly the random centroid locations and unknown number of clusters K. This number has to be known in advance and is an input in the standard K-means algorithm. That's where cluster validity enters in the game. And this is for the next post.

(1) M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. J. of Intelligent Information Systems, 17(2-3):107-145, 2001.
(2) S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.
(3) A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
(4) N. Bolshakova and F. Azuaje. Cluster validation techniques for genome expression data. Signal Process., 83(4):825-833, 2003.

Continue reading... Sphere: Related Content

Tuesday, November 21, 2006

Cluster validity: Introduction to clustering

In the near future, I will use this blog to write about recent research I'm involved in. I start today (and the following days) by an introduction on the topic I'm interested in: cluster validity.

Clustering is certainly the best known example of unsupervised learning. The goal of clustering is to group data points that are similar according to a given similarity metric (by default Euclidean distance is used). As Jain et al. write in (1), "clustering is a subjective process [...] This subjectivity makes the process of clustering difficult". Clustering techniques have been applied in various domains such as text mining, color image segmentation, sensory time series, information exploration and automatic counting in video sequences. In these domains, the number of clusters is usually not known in advance.

On goal of cluster validity is to estimate the most reliable number of clusters in a dataset. Before going into more details about cluster validity, next post will focus on the clustering techniques.

(1) A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264-323, 1999.

Continue reading... Sphere: Related Content

Monday, November 20, 2006

Google #1 in 2007

If you thought, like me, that Google was the most visited website in the world, then you're wrong. At the moment, the most visited website is Yahoo! with 130 million visitors a month. However, this will change in 2007 according to an article of MarketWatch (citing Citigroup). According to predictions, Google will be the most visited website worldwide in 2007. And as you perhaps know, Google is keeping trace of every web search made on their search engine. Can you imagine the quantity of data Google will then be able to mine? (Picture from www.vivelavie.fr)

Continue reading... Sphere: Related Content

Friday, November 17, 2006

Now boarding!

Now Boarding! is a new kind of posts on Data Mining Research. Every Friday, I will propose you to make a trip to discover posts from other blogs dealing with data mining. It can be a reading for the week-end or just a glance at what is written on other data mining blogs worldwide. This Friday I propose you three destinations:

  • Dean and Will are writing about free and inexpensive data mining tools. Their post can be seen as an extension to the poll started here a few days ago.
  • Matthew is writing about geographic visualization and makes an interesting comparison between Google Earth and Microsoft's Virtual Earth.
  • Will has a very exhaustive post about finding Matlab codes and tools on the web on his blog about data mining using Matlab.
Enjoy your trip and have a nice week-end!

Continue reading... Sphere: Related Content

Thursday, November 16, 2006

Mining data with Microsoft SQL Server

I have recently discussed with people working at Microsoft. Unfortunately, they are not doing data mining or machine learning research in Switzerland. I think that everything concerning research is in Redmond. I suppose that they are involved in data mining for at least two reasons: MSN Search and Microsoft SQL Server. The website IT-director has an article concerning SQL Server. According to its author, David Norris, data mining services in SQL Server are very interesting. He writes that "what Microsoft has done is to make data mining available on the desktop to everyone". However, no details are given in his article about data mining techniques used. Although I have never used SQL Server, I suspect that it is more about data analysis than really data mining.

Continue reading... Sphere: Related Content

Wednesday, November 15, 2006

Robots learning to grasp objects

The Stanford news service writes about robots learning to grasp objects. The complexity of this takes involves several research areas in addition to machine learning such as speech processing, navigation, manipulation, planning, reasoning and vision. The author of this interesting article states that "the ultimate aim for artificial intelligence is to build a robot that can create and execute plans to achieve a goal". Although this is an exciting aim, it is certainly not easy to achieve. Can you imagine the diversity of everyday actions to learn to obtain a realistic behavior. Finally, I hope that the created plans to achieve a goal will not be the sames as the ones chosen by the artificial intelligence in the movie I, Robot...

Continue reading... Sphere: Related Content

Tuesday, November 14, 2006

Small book review: The advanced internet searcher's handbook

In this post, I will give my mind about a recent reading on internet search. First of all, I should say that I really enjoy reading this book. It is written in an informal way, which is really nice for people whose mother language is not English (like me). When you read this book, you feel like Phil Bradley is speaking to you personally and it's a real pleasure.

Basically, The advanced internet searcher's handbook shows you how to get information from the web with simple and clear examples. The book covers nearly all kind of research on the internet. The main ideas that come out of this reading could be summarized in one sentence: Don't trust a unique web engine and keep checking information with other sources of information. The book clearly shows the drawbacks of search engines such as Google. For example, a whole chapter is dedicated to the hidden web (not accessible to Google).

I conclude by saying that this book widely covers basics of search engine (the advanced in the title is certainly too much) and has very interesting chapters about weblogs, mailing lists and newsgroup. To my mind, the chapters about finding people and search tips are useless (straightforward). Finally, the author is a librarian, not a computer scientist. Therefore, the "how to find information" aspect is very nice, but the "how to use the web" is often less interesting (e.g. the finding people chapter).

Continue reading... Sphere: Related Content

Microsoft and Google on campus

These days, there is the annual companies forum on our campus at EPFL (Switzerland). Among these companies are Microsoft and Google. Today and tomorrow, I will go and discuss with them about what they do in data mining. Be sure that relevant gossips will be reported on this blog :-)

Continue reading... Sphere: Related Content

Monday, November 13, 2006

Kmining

As mentioned by Ralf on this post, Kmining is an important data mining resource. You might be interested in Kmining mainly if you 1) want to know about data mining conferences and related events or 2) want a list of people involved somehow in data mining. In addition, you will find news and basic data mining definitions. I warmly recommend you this source of information, which is, to my mind, complementary to KDnuggets.

Continue reading... Sphere: Related Content

Thursday, November 09, 2006

Finding Chinese people

I was recently looking for an article about sensor placement on the web. Google and Google Schoolar were both not able to give me the pdf version of it (certainly because it is not free). However, I found the first author's email on the website of the journal where the article came from. As you can imagine, it was no more valid. So I decided to find the webpage of the author to ask for a pdf version by email. That's were the troubles started.

It is certainly easier to find information on a longer and less common name. As an example, Saitta gives 535'000 hits on Google. With a name such as Liu you get 75'000'000 hits. The problem with Chinese names is that 1) they are short 2) many people have composed nouns. This means that a name like Liu will certainly often be found among names in China. This makes the problem of finding Chinese people on the internet not straightforward to my mind.

Continue reading... Sphere: Related Content

Wednesday, November 08, 2006

Data mining platform poll

To continue with the YALE post, I propose a poll concerning your favorite data mining platform. You are certainly using several different platforms for data mining. For this poll, the idea is to vote only for the one you prefer or most use. I think that it can give a good idea about tools used for data mining by both researchers and practitioners.


Create polls and vote for free. dPolls.com

Use the scroll bar on the right to see all choices.

Continue reading... Sphere: Related Content

Tuesday, November 07, 2006

Data Mining and Predictive Analytics

I had another subject in mind to post today, but after reading Will Dwinnell's comment on my blog, I started reading his blog (curiosity is a good skill!). In fact, the blog is co-written by Dean Abbott. Posts on their blog are very technical and when reading them, you can easily see that they have experience in the domain of data mining. Their blog originally started in 2003 and seems to have a second life since October 2006. These days, they write on topics such as clustering and softwares, among others. I strongly recommend you this blog.

Continue reading... Sphere: Related Content

Monday, November 06, 2006

Detecting fake Van Gogh

Here is yet another exciting data mining application: detecting fake paintings. According to NewScientist, researchers from the Maastricht University (Netherlands) are using data mining techniques to discover whether paintings are original or not. For example, they train a neural network on some Van Gogh's painting to discover the trends let by its famous author. Even if this technique seems promising, the human intervention cannot be avoided. The human is still involved in the process of the training set establishment. Indeed, a human expert has once to say to the machine which painting really is an authentic one.

Continue reading... Sphere: Related Content

Friday, November 03, 2006

When web mining meets clustering

Google is nowadays the most widely used search engine on the planet. A lot of people use it and are satisfied by its performances. However, Google suffers from several drawbacks. For example, a lot of results are redundant. It sometimes happens that Google gives you too much answers. Assume that you have an information on a .pdf file linked from a specific webpage itself belonging to an overall website. Google will perhaps give you three different links (the main website, the specific webpage and the .pdf file itself). Another drawback of Google (and many other free-text search engine) is the lack of structure among results. Information is given in a raw manner, without themes, hierarchies or categories. So, it often happens to be drowned under the information obtained. A search on the term data mining, for example, results in 52,600,000 hits.

Clusty, a recent search engine (Pittsburgh, 2004), is a good alternative to Google. Clusty is a meta search engine, which means it queries top search engines and combines the results for the user. Clusty use clustering techniques to group results into categories. The results are automatically clustered according to selected key-words. For the example of the term data mining, Clusty proposes 246 results that are part of 36,244,144 hits found. The figure below shows the results obtained.

Click on the picture to enlarge.

Clusty proposes clusters and sub-clusters that can be browsed (left part of the figure). Information is not raw as in Google, but rather organized. Up to now, the only drawback I have noticed regarding Clusty is about ads. They are to close to the results obtained and this sometimes induce confusion to the user.

Continue reading... Sphere: Related Content

Thursday, November 02, 2006

Mining crime information

Using data mining techniques to help fighting crime sounds good. It is certainly an interesting topic. Can you imagine saying that you have caught a criminal using a decision tree? :-) Although this view is very simplistic, data mining seems to be helpful in some situations as pointed out by LocalTechWire. The main idea is to use data mining methods to identify crime trends and then anticipate crimes. This is certainly a trendy topic for data mining, proved by the publication of a new book by Colleen McCue: Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis.

Continue reading... Sphere: Related Content
 
Clicky Web Analytics