What Google can't mine

Thursday, January 18, 2007

What Google can't mine

While I was reading a book about search and information, I found a particular chapter about the hidden web very interesting. Basically, the hidden web is the part of the Internet that is accessible to people but not to bots (such as Google bots). In other words, these pages exist, but they are not referenced in search engines simply because it is too difficult (or sometimes impossible?) to index them. Examples are dynamically generated webpages, most databases behind websites, pages requiring password access, etc. More details can be found on wikipedia under the term deep web.

Michael Bergman is estimating the size of the hidden web as 400-550 times the visible web. I think the metaphor of an iceberg can easily be used in this situation. The question now is how will Google and other search engines do to access this (or at least part of this) information?

Sphere: Related Content

1 comment:

Will Dwinnell said...: Another issue is the near-total reliance of some people on Google to search the World Wide Web. While I have found Google to be an effective search engine, I have found it useful to utilize a number of other search engines. Using more than one engine provides diversity of response and helps avoid search dead-ends ("Well, I can't find it using Google... It must not be on the Web.")

I suggest these alternatives, but there are others:

AllTheWeb
AltaVista
Clusty
Devil Finder
hakia
Ixquick; 2:13 PM