Stratification for data mining
One common issue in data mining is the size of the data set. It is often limited. When this is the case, the test of the model is an issue. Usually, 2/3 of the data are used for training and validation and 1/3 for final testing. By chance, the training or the test set may not be representative of the overall data set. Consider for example a data set of 200 samples and 10 classes. It is likely that one of these 10 classes is not represented in the validation or test set.
To avoid this problem, you should take care of the fact that each class should be correctly represented in both the training and testing sets. This process is called stratification. One way to avoid doing stratification, regarding the training phase is to use k-fold cross-validation. Instead of having only one given validation set with a given class distribution, k different validation sets are used. However, this process doesn't guarantee a correct class distribution among the training and validation sets.
And what about the test set? The test set can only be used once, on the final model. Therefore, no method such as cross-validation can be used. There is no guarantee that the test contains all the classes that are present in the data sets. However, this situation is more likely to happen when the number of samples is small and the number of class is high. In this situation, the stratification process may be crucial. I'm wondering if people usually apply stratification or not and why. Feel free to comment on this issue regarding your personal experience.
More details about stratification can be found in the book Data Mining: Practical Machine Learning tools and techniques, by Witten and Frank (2005).
5 comments:
Good point and really worth noting, otherwise you might be wondering about bad results after modeling. Some more details at wikipedia: stratified sampling
Hi Georg! Thanks for the link. I have just discovered your data mining blog. I will add it to my list soon.
Here is a post by Will about stratified sampling with a particular application in Matlab.
The answer depends a lot on the size of the data. I usually work with fairly large datasets so stratification on independent variables doesn't really make much sense. It's rather hard to know in advance which variables will be in the final model, and that makes stratification rather difficult.
On the other hand, I almost always do sampling on dependent variable, e.g., if I'm predicting rare event then in training sample use all events and some non-events.
Thanks for sharing your experience Jeff. I think this is the case with most real-life data sets. However, in the case of the UCI repository, several data sets are quite small and have many classes. In this case, I think stratification may be useful.
Post a Comment