UCI/NIST Databases
How many times have you read a paper about data mining which does not illustrate its results using a common database available on the web? This is a provocative question, of course. However, due to my personal reading experience in data mining, I estimate that 7 out 10 papers use common data available on the web, for example:
I agree with the paper of Lavrac (1) stating that "[...] its existence (UCI database) has indirectly promoted a very narrow view of real-world data mining". I discussed recently with a professor of Carnegie Mellon University. He is using data mining in civil engineering and he told me that for each application, he needed to adapt or develop a new data mining algorithm that fitted his task. This is an excellent example of the fact that data mining algorithms are not suited for real-world problems.
Finally, always according to (1), data mining algorithm may be "overfitting the UCI repository". This is certainly true, and the best thing to do would be to collect more data from real-world problems and see the difficulty we have to apply standard algorithms on them.
(1) Lavrac N., Motoda H. and Fawcett T., Data Mining Lessons Learned, Machine Learning, 57, 5-11, 2004. Sphere: Related Content
No comments:
Post a Comment