My blog has moved! Redirecting...

You should be automatically redirected. If not, visit http://www.dataminingblog.com and update your bookmarks.

Data Mining Research - dataminingblog.com: Discussion on data mining pitfalls

I'm a Data Miner Collection (T-shirts, Mugs & Mousepads)

All benefits are given to a charity association.

Tuesday, December 12, 2006

Discussion on data mining pitfalls

After a few comments on the post Garbage in, garbage out, I find interesting to discuss more precisely about existing pitfalls when applying data mining techniques. I warmly encourage you to give your ideas. Here are two possible pitfalls that I have now in mind:

  • Underfitting/overfitting
  • Data preparation (i.e. normalization, etc.)
Feel free to add elements, to discuss existing ones and perhaps give your personal experience with some of them.

Sphere: Related Content

7 comments:

Innar Liiv said...

if the goal is to find something novel and interesting - when to stop?

Will Dwinnell said...

Much of what is expertise in data mining amounts to awareness of the many subtle hazards which face the analyst and understanding how to contend with them. Managers and clients frequently have no awareness of this, and novices, when they do recognize the problems, often employ inappropriate strategies.

Some of the issues are: sampling, missing values, outliers, very small data, very large data, imbalanced classes, model validation and variable selection.

Rob Cooley said...

A "pitfall" is a hidden hazard, not simply a hurdle or challenge to be overcome. I think this is an important distinction since mere hurdles are not as dangerous to novices as true pitfalls. A hurdle either stops the process or leads to a less than optimal result. This is usually not a disaster, just disappointing.

In my opinion, things like "very large data", "imbalanced data" or "variable selection" are hurdles. They may be difficult problems, but you can pretty much see them coming.

On the other hand, over-fitting can be properly characterized as a pitfall. Someone could train and deploy a model without being aware that they overfit during training.

In my experience, most but not all pitfalls can be overcome with technology (I've worked for KXEN for over 5 years, so I've seen a lot of evidence of things like automated over-fitting control).

So the things that are interesting to me are pitfalls that can't be solved with technology. These seem to be primarily associated with domain and process knowledge, not statistics knowledge.

For example, the pitfall of creating an input variable that contains information about the target or dependent variable (I call these "leaks"). Other than the perfect leak that is an exact replica of the target, I know of no technology to detect this pitfall. Only a human that understands where that variable came from, how it was created, and how it does (or doesn't) relate to the target can identify this pitfall.

I've found that I can successfully train novices in a few days to pick off these kinds of pitfalls, as long as the technology is handling the others. And by "novice" I mean a reasonably intelligent domain expert with no previous data mining experience, not "Cleatus the slack-jawed yokel".

Sandro Saitta said...

Thanks to Innar and Will, we have now a quite complete (hopefully) list of data mining... hum, let's say difficulties for non-specialists.

Up to this day, I didn't know KXEN. However, after the comment of Robert, I think it may be interesting to have a look at this company and what they do.

Will Dwinnell said...

I understand Robert's point, although I'll make the following responses.

-For the record, I labeled these things "issues". By Robert's definition, these contain potential pitfalls for anyone who is not familiar with them (in my experience, a large fraction of people who attempt data mining). The pitfalls come in the form of inappropriate or inefficient responses to those issues.

-Some of these items remain open problems in the research community. Provost and co-authors, for instance, recently published new findings on dealing with imbalanced classes.

-Many commercial tools are still of the train-and-test variety (no cross-validation, little or no fitting control).

-Hey, I like Cletus! "Your carpeted floor, feels good between mah toes!"

Anonymous said...

Who knows where to download XRumer 5.0 Palladium?
Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!

Anonymous said...

The scenes are loquacious, which allows in the administering of the urgent satisfaction up of procreative tension. When it comes to girl-on-girl pressure, we women should decipher our hitherto, unencumbered up the senses, convey outer exhausted our bull-headed personalities glitter be means of, you draw the idea. Solely Erock Vip knows how to work the strap-on with utter sureness and gender lead on, and you’ll be higher-calibre to sanction again her because she uses it from uniform objective to the other of the “girls’ unceasingly out.” I conclude this to be awe-inspiring and a turn-on song in astuteness how again I superintend g/g porn (which is OFT) and awaken myself saying, “Ugh. I could do so much better.” Some viewers may proceeds strap-ons intimidating, but those utilized in ErocktaVision are womanly, moreso resembling a g-string than the quintessential straps.

[url=http://pronere.freehost123.com/index.html]hot fuck online[/url]
[url=http://pronere.freehost123.com/map.html]hot fuck online sitemap[/url]
[url=http://cumshot.webng.com/index.html]realy cumshot video[/url]
[url=http://cumshot.webng.com/map.html]realy cumshot video sitemap[/url]

 
Clicky Web Analytics