Data Mining Research

Friday, October 20, 2006

Data mining to protect databases

According to Computing News, Secerno, a UK company involved in security, databases main weakness concerns inside attacks. The problem is related to the overabundance of people that have access to confidential data. To counter this weakness, Secerno has developed a program that mine standard usage of the database over a period of time. Once the program has learned, he could be used to detect non-usual queries. To my point of view, the main drawback lies in this critical period of learning. If bad usages of the database are made at this point, the system can allow future attacks.

Continue reading... Sphere: Related Content

Monday, August 14, 2006

UCI/NIST Databases

How many times have you read a paper about data mining which does not illustrate its results using a common database available on the web? This is a provocative question, of course. However, due to my personal reading experience in data mining, I estimate that 7 out 10 papers use common data available on the web, for example:

These data are clearly done for the precise purpose of testing new data mining or machine learning algorithms. They should represent real-world problems. The main drawback is that people keep using these few databases and think they represent a good proportion of real-world problems. Seriously, these databases certainly represent less than 0.1% of existing real-life problem that can take advantage of data mining methodologies.

I agree with the paper of Lavrac (1) stating that "[...] its existence (UCI database) has indirectly promoted a very narrow view of real-world data mining". I discussed recently with a professor of Carnegie Mellon University. He is using data mining in civil engineering and he told me that for each application, he needed to adapt or develop a new data mining algorithm that fitted his task. This is an excellent example of the fact that data mining algorithms are not suited for real-world problems.

Finally, always according to (1), data mining algorithm may be "overfitting the UCI repository". This is certainly true, and the best thing to do would be to collect more data from real-world problems and see the difficulty we have to apply standard algorithms on them.

(1) Lavrac N., Motoda H. and Fawcett T., Data Mining Lessons Learned, Machine Learning, 57, 5-11, 2004.

Continue reading... Sphere: Related Content