Introduction to feature selection (part 2)

Friday, September 28, 2007

Introduction to feature selection (part 2)

This post continues the previous one about feature selection. I now give some examples of wrapper-based techniques for feature selection.

In wrapper techniques, using the classification algorithm as a black box, any search strategy can be used in combination. This makes wrapper approaches universal. The accuracy of the classification algorithm may be used as the objective function of the search strategy. As with any classification algorithm, wrapper feature selection techniques face the overfitting problem that may happen while training. One way to reduce the overfitting problem is to use a k-fold cross-validation strategy. Correlation coefficient and mutual information are cheaper to evaluate than cross-validation. However, when the number of features is high, they become difficult to evaluate.

Several search algorithms that select a subset of m features out of d exist in the literature. Although exhaustive search is guaranteed to find the optimal feature subset, it is not feasible when d is not small. The branch and bound procedure explores only a part of all possible feature combinations. It is guaranteed to find the optimal feature subset in less time than needed for exhaustive search. Its main drawback concerns the monotony assumption of the feature selection objective function. That is, if a variable is added to the feature set, it should never decrease the value of the objective function, which is not always the case.

Individual ranking procedures can be named naive methods. The idea is to individually rank each feature at a time, according to its prediction power. This technique is valid only if every feature is independent, which is usually not the case in practice. Sequential forward selection (SFS) starts with a single feature and iteratively add a feature to increase the classification criteria. Sequential Backward Selection (SBS) starts with all features and iteratively remove a single feature to increase the classification accuracy. Although combination of features are taken into account with this technique, a high number of computations are necessary since it starts with the set of all features. This may not be feasible for very high dimensional data set. The plus l take away r method combines the former two techniques by first applying SFS for l features and then SBS for r features. Although this method seems promising, a value for l and r have to be given by the user. Sequential Forward Floating Search (SFFS) and Sequential Backward Floating Search (SBFS) are generalization of the two former methods. In this case, l and r are determined automatically.

I hope these two posts have introduced you to feature selection and more specifically to wrapper-based techniques. Feel free to comment.

(1) François, D., High-dimensional data analysis: optimal metrics and feature selection, Université catholique de Louvain, Louvain-la-Neuve, Belgium, 2007.
(2) Reunanen, J., Overfitting in Making Comparisons Between Variable Selection Methods, Journal of Machine Learning Research, 2003, 3, 1371-1382.
(3) Jain, A.K., Duin, R.P.W. and Mao, J., Statistical Pattern Recognition: A Review, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22, 1, 4-37.

Sphere: Related Content

6 comments:

Anonymous said...: For sake of completeness it might be worth mentioning known problems associated with popular stepwise feature selection methods, and alternatives such as PLS and LARS/Lasso.; 6:06 AM
Sandro Saitta said...: Thanks for the link Jeff! The discussion seems really interesting and relevant. I will read it soon.; 5:51 PM
Anonymous said...: Hey Sandro, thanks for citing my thesis as a reference :) !; 9:02 AM
Sandro Saitta said...: Damien,

I also refer your work in my thesis since I found it relevant (in particular the discussion of the Euclidean distance in high-dimensional spaces).; 8:59 AM
John Aitchison said...: Hi Sandro

You might also want to alert your readers to the possibility of using Bayesian Model Averaging .. including procedure bicreg.

A good example is here
http://www.r-project.org/doc/Rnews/Rnews_2005-2.pdf

regards

John Aitchison; 8:14 AM
Sandro Saitta said...: Hi John,

Thanks for the link! I will have a look at this Bayesian Model Averaging. And I just saw that you're back in the blogosphere! Cool!; 5:37 PM