When web mining meets clustering

Friday, November 03, 2006

When web mining meets clustering

Google is nowadays the most widely used search engine on the planet. A lot of people use it and are satisfied by its performances. However, Google suffers from several drawbacks. For example, a lot of results are redundant. It sometimes happens that Google gives you too much answers. Assume that you have an information on a .pdf file linked from a specific webpage itself belonging to an overall website. Google will perhaps give you three different links (the main website, the specific webpage and the .pdf file itself). Another drawback of Google (and many other free-text search engine) is the lack of structure among results. Information is given in a raw manner, without themes, hierarchies or categories. So, it often happens to be drowned under the information obtained. A search on the term data mining, for example, results in 52,600,000 hits.

Clusty, a recent search engine (Pittsburgh, 2004), is a good alternative to Google. Clusty is a meta search engine, which means it queries top search engines and combines the results for the user. Clusty use clustering techniques to group results into categories. The results are automatically clustered according to selected key-words. For the example of the term data mining, Clusty proposes 246 results that are part of 36,244,144 hits found. The figure below shows the results obtained.

Click on the picture to enlarge.
Clusty proposes clusters and sub-clusters that can be browsed (left part of the figure). Information is not raw as in Google, but rather organized. Up to now, the only drawback I have noticed regarding Clusty is about ads. They are to close to the results obtained and this sometimes induce confusion to the user.

Sphere: Related Content

2 comments:

Abhinav said...: your article on search engine made me think abt the future of search engines .. i end up writing abt future search engine in my blog.

I am not very much satisfied with the results of clusty but it is a good start to begin with.

i think if you have some 1000 results for a some search. then it is really impossible to visit each result. Infact in reality it will be much more than 1000. So if we find 10 categories or important keywords(in my case ) and assuming each keyword/category is further categorize/associated into many then it is really possible to see all the important search results...It will be some kind of tree structure; 9:45 AM
Anonymous said...: Yes, that was definitely true, google is widely used along the world and used as an avenue for references and resources. This is only reason on the existence of Search Engine Optimization. Web mining on the other hand is the application that uses data mining and designed to analyze and discover patterns from the web.; 4:17 PM