How many times have you read a paper about data mining which does not illustrate its results using a common database available on the web? This is a provocative question, of course. However, due to my personal reading experience in data mining, I estimate that 7 out 10 papers use common data available on the web, for example:
These data are clearly done for the precise purpose of testing new data mining or machine learning algorithms. They should represent real-world problems. The main drawback is that people keep using these few databases and think they represent a good proportion of real-world problems. Seriously, these databases certainly represent less than 0.1% of existing real-life problem that
can take advantage of data mining methodologies.
I agree with the paper of Lavrac (1) stating that "
[...] its existence (UCI database) has indirectly promoted a very narrow view of real-world data mining". I discussed recently with a professor of Carnegie Mellon University. He is using data mining in civil engineering and he told me that for each application, he needed to adapt or develop a new data mining algorithm that fitted his task. This is an excellent example of the fact that data mining algorithms are not suited for real-world problems.
Finally, always according to (1), data mining algorithm may be "
overfitting the UCI repository". This is certainly true, and the best thing to do would be to collect more data from real-world problems and see the difficulty we have to apply standard algorithms on them.
(1) Lavrac N., Motoda H. and Fawcett T., Data Mining Lessons Learned, Machine Learning, 57, 5-11, 2004.
Continue reading...
Sphere: Related Content