Readability of Decision Trees
One of the most often cited advantage of decision trees is their readability. Several data miners (to whom I belong) justify the use of this technique since it is quite easy to understand the obtained model (no black box). However, there are certain issues that make decision trees unreadable.
First, there is normalization (or standardization). In most projects, data have to be normalized before using decision tree. Therefore, once you plot the tree, values are meaningless. Of course, you can map the data back in the original format, but it has to be done.
Second is the number of trees. In the project I carry on at my job, I can have 100 or more decision trees by month (see this post for more details). It is clearly impossible to read all these trees even if they are independently understandable. The same happens with random forests. When there are 1000 trees voting for a given class, how can one understand the process (or rules) that produce the class output?
Decision trees still have a lot of advantages. However, the "readability" advantage must be taken with care. It may be valid in some applications, but can often be a mirage.
6 comments:
Hi Sandro,
I have a few thoughts.
a) Hey, this bit;
"In most projects, data have to be normalized before using decision tree. Therefore, once you plot the tree, values are meaningless."
- I reckon not necessarily true!
Depends on your normalisation. You can normalise your data with meaning!
I like binning into 100 buckets, each with the same numbers of occurrences (say, customer rows). I do this for a few reasons, one being that I can then report customers as being in the top 5% buckets etc. It is also an easy and fast way to rescale lots of data in SQL. CART or C5.0 models using this type of normalised data is actually quite easy to make sense of (eg, “if stock price is above 70% bucket” etc etc).
b) Random forest doesn't work well with big datasets (millions rows). I use fairly easy CART or C5.0. Sometimes I build a handful of models on subsets samples and average the models, but I'm not convinced hundred of models is the best way to go. I always take time creating new derived 'information rich columns' and using these as additional inputs to a decision tree or neural net.
I agree with the problems you describe and, for those reasons you mention, I don't follow the steps you describe. Maybe I'm jaded, but I believe Random Forests is a classic example of mad academia over practicality (and yes, I know that's controversial considering the brilliant guy who created random forests...).
- Tim
Hi Tim,
Thanks for your comment!
1) Binning the data into buckets is a nice way to avoid this "unreadability" problem. I have always used normalization or standardization, but never used binning. Also the fact that you have the same number of occurrences in each bin avoid the issue of outliers.
2) Regarding random forests, I definitely agree on the issues when using this technique (that's why I don't use random forests). However, I really like the concept of several models voting for the output class.
I think random forest are both useful and powerful... with some caveats... you need lots of memory for big data (so not practical for all tasks) and readability is also a problem. Rattle solves the random forest readability issues by producing an importance chart.
Thanks for your contribution Shane!
Random forests have the advantage of opening up problems that a single decision tree might not deal with well, such as when a class of interest is relatively rare. I prefer boosted trees for this situation, though.
Thanks for your comment James!
Post a Comment