Comments on Data Mining Research - dataminingblog.com: Readability of Decision Trees

Thanks for your comment James!

2009-01-08T17:05:00.000+01:00

Thanks for your comment James!

Random forests have the advantage of opening up pr...

2009-01-08T01:07:00.000+01:00

Random forests have the advantage of opening up problems that a single decision tree might not deal with well, such as when a class of interest is relatively rare. I prefer boosted trees for this situation, though.

Thanks for your contribution Shane!

2008-12-01T17:12:00.000+01:00

Thanks for your contribution Shane!

I think random forest are both useful and powerful...

2008-11-30T23:30:00.000+01:00

I think random forest are both useful and powerful... with some caveats... you need lots of memory for big data (so not practical for all tasks) and readability is also a problem. Rattle solves the random forest readability issues by producing an importance chart.

Hi Tim,Thanks for your comment!1) Binning the data...

2008-11-28T14:31:00.000+01:00

Hi Tim,

Thanks for your comment!

1) Binning the data into buckets is a nice way to avoid this "unreadability" problem. I have always used normalization or standardization, but never used binning. Also the fact that you have the same number of occurrences in each bin avoid the issue of outliers.

2) Regarding random forests, I definitely agree on the issues when using this technique (that's why I don't use random forests). However, I really like the concept of several models voting for the output class.

Hi Sandro,I have a few thoughts.a) Hey, this bit;"...

2008-11-28T06:10:00.000+01:00

Hi Sandro,

I have a few thoughts.

a) Hey, this bit;
"In most projects, data have to be normalized before using decision tree. Therefore, once you plot the tree, values are meaningless."
- I reckon not necessarily true!

Depends on your normalisation. You can normalise your data with meaning!

I like binning into 100 buckets, each with the same numbers of occurrences (say, customer rows). I do this for a few reasons, one being that I can then report customers as being in the top 5% buckets etc. It is also an easy and fast way to rescale lots of data in SQL. CART or C5.0 models using this type of normalised data is actually quite easy to make sense of (eg, “if stock price is above 70% bucket” etc etc).

b) Random forest doesn't work well with big datasets (millions rows). I use fairly easy CART or C5.0. Sometimes I build a handful of models on subsets samples and average the models, but I'm not convinced hundred of models is the best way to go. I always take time creating new derived 'information rich columns' and using these as additional inputs to a decision tree or neural net.

I agree with the problems you describe and, for those reasons you mention, I don't follow the steps you describe. Maybe I'm jaded, but I believe Random Forests is a classic example of mad academia over practicality (and yes, I know that's controversial considering the brilliant guy who created random forests...).

- Tim