Wednesday, September 03, 2008

Petabyte Age, Data Mining and Science

Natalie Glatzel has written a interesting post on the blog Tasty Data Goodies about an article of Chris Anderson, editor in chief at Wired. Chris' opinion is that scientific theory is now, in the age of Petabyte, becoming obsolete. He writes that "[...] science can advance even without coherent models [...]". Basically, according to Chris, mining huge amount of data to get knowledge kills scientific theory.

As written by Natalie Glatzel, data mining is not meant to replace science and discovery in general. She writes that

"Data mining can really only point us in the right direction of new discovery by showing us relationships between data points; it can't generate new discoveries alone."
My opinion is that the issue pointed by Chris Anderson is not due to the "petabyte age" but rather to the concepts behind data mining itself. Statisticians build a model and then test it. Data miners test the data and then tries to understand them. This is the basic difference between statistics and data mining. And this is distinct from the petabyte issue. Of course data mining is one possible answer to the petabyte age. But in the late 80's, data mining was already used on "small" data sets (comparing to nowadays). Finally, we should remind that there is a big difference between getting knowledge and using it! As written by Natalie Glatzel:
"Although data mining may change the rules of the science game, it's definitely not the end of theory."
For more information, here is the link to Natalie Glatzel's post.

Romakanta said...

i consider myself a data miner. in my company, i work with a lot of statisticians on different projects. one of the biggest challenges i face is that we sometimes talk in different languages!!

Anonymous said...

Hi,good day to you.What's your opinion about the pro's and con's of doing predictive model using data sets that contain "0" value.I'm using daily data where certain day the value of data is "0". My supervisor advise me to do weekly aggregate data but I'm somehow think that will not produce genuine result.


Sandro Saitta said...

Romakanta: I have never worked with statisticians but I can imagine how difficult it may be (especially if they don't know about data mining).

Anonymous: Thanks for your question. I think you should give more details about your data set. You will certainly have more answers if your post you question at KDnuggets forum.

