Garbage in, garbage out
I was recently going through the book of Hornick et al. named Java Data Mining. Unlike this earlier post, the focus today is not about using Java for data mining. However, I read a line in this book which makes me react. Here is the line: "These users [marketers] either know nothing about the techniques of data mining or do not need to know anything about data mining to reap its benefits". I remember following a doctoral course on data mining where the teacher, a professor in machine learning, was claiming that people should not use data mining as a black box tool. If this is the case, the so called garbage in, garbage out situation is likely to happen. Do you think using data mining as a black box tool is justifiable? What is your mind about that?
Sphere: Related Content
6 comments:
The suggestion that unsophisticated users could use data mining tools to seriously analyze their data is silly. While the basic idea of empirically modeling data is simple, its execution is not. A large portion of what constitutes expertise in this field is understanding various pitfalls of analysis and how to contend with them.
Consider the example of a metrologist I worked with once, on a modeling project. With his knoledge of mathematics, especially geometry, he fell into the first trap that plagues beginning modelers: overfitting. He was quite insistent that the model which went through all the data points was the best possible model, and it took some explanation to get him to realize that generalization, not interpolation was the goal. Many of the other hazards faced by the statistical modeler (many candidate predictors, missing values, etc.) are a good deal more subtle than overfitting.
In general, I think that end users could use descriptive algorithms and data visualizations profitably to navigate the data. However, prediction and regression are much more subtle, and should likely be handled by experts. In between is clustering and summarization, where a carefully designed interface could likely prevent mistakes in use. In short, data mining is not a single thing: there are many techniques, many algorthms, many tasks, and many interfaces; some are suitable for most users, and some aren't.
There was a time when I might have agreed with the above comment by John, but long experience observing business analysts and mid-level managers drawing conclusions which would never withstand any sort of serious statistical scrutiny has left me very doubtful.
Spreadsheets and later BI tools offered leaps in ease-of-use and data manipulation power, but few seemed to take the corresponding responsibility seriously. My concern can be summed up thus: When a business analyst with no statistical training runs into the room, excitedly proclaiming that "Two out of three of our customers also buy brand X!", does he mean, literally, 2 out of 3, or does he mean 67% of thousands of customers? I'm not saying that no-one uses these tools well, but, having worked as a post-sales consultant for Cognos, I was in the trenches alongside many users and saw all manner of abuse of data.
John and Will, thanks for these very good comments. It's nice to have point of view of people with experience in the domain. I think it might by interesting to have a list of possible pitfalls that people using data mining could encounter (see new post).
Being one of the authors of the mentionned book, I thought I would try to expose the other side of the coin... it is no fun when everybody agrees on a blog ;-)
At the last KDD conference, there was a panel organized trying to understand why data mining is not a multi-billion dollars business as, for example, it is the case with Business Intelligence (dealing with reports and OLAP: what I call the 'low end' of analytics, or human powered analytics by opposition to the maths powered analytics).
It was striking to see that almost all data mining experts that were present (and there is a lot in KDD) were claiming that you do need statisticians or data miners to do data mining, and KXEN representative (Rob Cooley) was almost the ony one claiming that automated data mining is possible. So, the point of view of the people participating to this blog is 'mainstream'... But, progress has always be made because, one day, one guy stands up and say: "Wait a minute, is this really true?"
I agree with all that has been said on problems linked with missing values, outliers, good performance indicators, overfitting and underfitting, imbalanced classes, model validation, variable selection, leak variable detection, variable encoding, curse of dimensionality, deviation detection when applying a model, and descriptive power (ouch... This list is a good start for your other blog topic).
But, I do not agree when people (experts) are claiming that all these topics cannot be solved with automated processes providing good solutions in 95% of the cases, because that is what we (KXEN) have done (for each of the topics mentionned above). And the solution is very simple: even the experts use books, articles, and techniques that they have been trained on, I do not see any reason why a software could not use the same techniques in an automated way...
The real question is: 'Is data mining technology mature enough in 2006 to solve automatically 95% of the business situations?' My answer is clearly yes.
And this is linked with: where is data mining used today? I have read the example of the meteorologist, and I was thinking, for each meteo mathematical model, there must be 1 million models developped on Earth to detect "what customers will buy next month?", "will my customer leave in the next 3 months?", "how many customers will not reimbourse their credit", "is this credit card transaction fraudulent?", "how many products will I sell next week?". If you count the number of mathematical models produced per year on our small planet, the very vast majority are in: CRM, Risk, Quality, and any kind of forecasts/predictions. There is a growing concern in the bio sphere, but, in terms of masses, it is nothing compared with CRM (a single KXEN customer generates thousands of mathematical models per year). For this, see http://www.kdnuggets.com/polls/2005/successful_data_mining_applications.htm
So the sentence of the book relates to the fact that, yes, we can fully automate Marketing Campaign Optimization; We can fully automate computation of Credit Risk probability of defect. Wherever we can translate a business problem in a suite of data mining tasks, we can automate now, today... with results comparable to a very good expert.
Be careful, nobody claims that, if you put very bright people for one year on the same problem, they will not get better results. Of course they will! But there is not enough experts on Earth for all the problems that can be optimized today using state of the art automated data mining techniques...
Now, this said, there are phases which will always be 'human powered' (I am waiting for a guy to stand up and say: "Wait a minute..."):
1/ How to go from a (business) problem definition to a decomposition of data mining tasks using data mining functions (I am NOT talking about algorithms here, but functions as defined in JDM).
2/ How to perform a business validation of the findings of the maths powered engines. This is why all useful data mining function implementations must be verbose and 'speak' to business users in a way they can understand it.
3/ How to access to the data in a way a normal person (and not a database adminstrator) would like to. This is where collaboration with BI solutions is very interesting.
4/ How to discover new algorithms and techniques that will work in 97% of the cases...
OK. I am finihsed. I hope that I have managed to convey the passion of the other side of the force...
P.S. By the way, besides this one sentence, how was the book?? ;-)
I definitely agree with you Erik on the fact that agreement is not funny :-) To my mind, disagreement is the key to discovery and exchange of ideas.
To begin with, it is very nice to have a comment from you, as an author of the book. Your comment is very interesting and I have discovered KXEN (a company I didn't know before). Challenges at KXEN are for sure exciting since automating data mining is certainly not an easy task.
Concerning the book, I haven't read it now (out of the mentioned sentence). However, now that one sentence of the book is as controversial as Da Vinci Code, Java Data Mining is number one in my reading list :-) I will certainly write a post about it in January.
Post a Comment