Data Mining Research

Showing posts with label feature selection. Show all posts

Wednesday, January 09, 2008

Data Mining Methodologies for Supporting Engineers during System Identification

After four years of work, I am now close to finishing my PhD. Below is the abstract of my thesis which is about data mining for system identification support:

Data alone are worth almost nothing. While data collection is increasing exponentially worldwide, a clear distinction between retrieving data and obtaining knowledge has to be made. Data are retrieved while measuring phenomena or gathering facts. Knowledge refers to data patterns and trends that are useful for decision making. Data interpretation creates a challenge that is particularly present in system identification, where thousands of models explain a given set of measurements. Manually interpreting such data is not reliable by hand. One solution is to use data mining. This thesis thus proposes an integration of techniques from data mining, a field of research where the aim is to find knowledge from data, into an existing multiple-model system identification methodology.

It is shown that, within a framework for decision support, data mining techniques constitute a valuable tool for engineers performing system identification. For example, clustering techniques group similar models together in order to guide subsequent decisions since they might indicate different possible states of a structure. A main issue concerns the number of clusters, which, usually, is unknown.

For determining the correct number of clusters in data and estimating the quality of a clustering algorithm, a score function is proposed. The score function is a reliable index for estimating the number of clusters in a given data set, thus increasing clustering results understanding for engineers. Furthermore, useful information for engineers performing system identification is achieved through the use of feature selection techniques. They allow selection of relevant parameters that explain candidate models. The core algorithm is a feature selection strategy based on global search.

In addition to providing information about the candidate model space, data mining is found to be a valuable tool for supporting decisions related to subsequent sensor placement. When integrated in a methodology for iterative sensor placement, clustering is found to provide useful support through providing a rational basis for subsequent sensor placement on existing structures. Regarding initial sensor placement, greedy and global search strategies should be selected according to the context. Experiments show that whereas global search is more efficient for initial sensor placement, a greedy strategy is more suitable for iterative sensor placement.

Continue reading... Sphere: Related Content

Friday, September 28, 2007

Introduction to feature selection (part 2)

This post continues the previous one about feature selection. I now give some examples of wrapper-based techniques for feature selection.

In wrapper techniques, using the classification algorithm as a black box, any search strategy can be used in combination. This makes wrapper approaches universal. The accuracy of the classification algorithm may be used as the objective function of the search strategy. As with any classification algorithm, wrapper feature selection techniques face the overfitting problem that may happen while training. One way to reduce the overfitting problem is to use a k-fold cross-validation strategy. Correlation coefficient and mutual information are cheaper to evaluate than cross-validation. However, when the number of features is high, they become difficult to evaluate.

Several search algorithms that select a subset of m features out of d exist in the literature. Although exhaustive search is guaranteed to find the optimal feature subset, it is not feasible when d is not small. The branch and bound procedure explores only a part of all possible feature combinations. It is guaranteed to find the optimal feature subset in less time than needed for exhaustive search. Its main drawback concerns the monotony assumption of the feature selection objective function. That is, if a variable is added to the feature set, it should never decrease the value of the objective function, which is not always the case.

Individual ranking procedures can be named naive methods. The idea is to individually rank each feature at a time, according to its prediction power. This technique is valid only if every feature is independent, which is usually not the case in practice. Sequential forward selection (SFS) starts with a single feature and iteratively add a feature to increase the classification criteria. Sequential Backward Selection (SBS) starts with all features and iteratively remove a single feature to increase the classification accuracy. Although combination of features are taken into account with this technique, a high number of computations are necessary since it starts with the set of all features. This may not be feasible for very high dimensional data set. The plus l take away r method combines the former two techniques by first applying SFS for l features and then SBS for r features. Although this method seems promising, a value for l and r have to be given by the user. Sequential Forward Floating Search (SFFS) and Sequential Backward Floating Search (SBFS) are generalization of the two former methods. In this case, l and r are determined automatically.

I hope these two posts have introduced you to feature selection and more specifically to wrapper-based techniques. Feel free to comment.

(1) François, D., High-dimensional data analysis: optimal metrics and feature selection, Université catholique de Louvain, Louvain-la-Neuve, Belgium, 2007.
(2) Reunanen, J., Overfitting in Making Comparisons Between Variable Selection Methods, Journal of Machine Learning Research, 2003, 3, 1371-1382.
(3) Jain, A.K., Duin, R.P.W. and Mao, J., Statistical Pattern Recognition: A Review, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22, 1, 4-37.

Continue reading... Sphere: Related Content

Tuesday, September 25, 2007

Introduction to feature selection (part 1)

Feature selection is a technique used to reduce the number of features before applying a data mining algorithm. Irrelevant features may have negative effects on a prediction task. Moreover, the computational complexity
of a classification algorithm may suffer from the curse of dimensionality caused by several features. When a data set has too many irrelevant variables and only a few examples, overfitting is likely to occur. In addition, data are usually better characterized using fewer variables. Feature selection has been applied in fields such as multimedia database search, image classification and biometric recognition. A comprehensive introduction to feature selection can be found in the paper by Guyon et al.

Feature selection techniques can be divided in three main categories: embedded approaches (feature selection is part of the classification algorithm, i.e. decision tree), filter approaches (features are selected before the classification algorithm is used) and wrapper approaches (the classification algorithm is used as a black box to find the best subset of attributes). Due to its very definition, embedded approaches are limited since they only suit a particular classification algorithm. A relevant feature is not necessarily relevant for a given classification algorithm. Filter methods, however, do the assumptions that the feature selection process is independent from the classification step. The work done by Kohavi et al. (1995) recommends to replace filter approach by wrappers. The latter provide usually better results, the price being higher computational complexity. Although already known in statistics and pattern recognition, wrappers are new in the data mining community.

Selected references:

(1) Kudo, M. and Sklansky, J., Comparison of algorithms that select features for pattern classiers, Pattern Recognition, 2000, 33, 25-41.
(2) Blum, A.L. and Langley, P., Selection of relevant features and examples in machine learning, Artificial Intelligence, 1997, 97, 245-271.
(3) Kohavi, R. and John, G., Feature Selection for Knowledge Discovery and Data Mining, The Wrapper Approach, Kluwer Academic Publishers, 1998, 33-50.

Continue reading... Sphere: Related Content

Wednesday, February 07, 2007

Feature selection

One of the most interesting and well written paper I have read regarding data mining is certainly "An Introduction to Variable and Feature Selection" (Guyon and Elisseeff, 2003). It is freely available on the Journal of Machine Learning Research website. After reading this paper, you should have a good view of what feature selection really is about. Although not popularized, the paper is written in a very readable way.

Feature selection may be useful for facilitating data visualization, reducing storage requirements and increasing performances of learning algorithms. The paper starts by a checklist of crucial points to discuss before applying any learning algorithm on your data. Then, topics such as variable ranking and variable subset selection are covered. A clear distinction is made between three different techniques for variable selection: wrappers, filters and embedded methods.

The article continues on dimensionality reduction and validation techniques in the case of variable selection. Finally examples of open problems are outlined. I have read several papers in data mining and related topics, and this is certainly the most comprehensive and readable one. In addition to the paper, and for more details about Matlab implementation, you can have a look at this post on Will's blog.

Continue reading... Sphere: Related Content

Friday, January 05, 2007

Around the world

After a short week of posts, let's have a look at what's going on around the world. Here are some recent posts on different data mining blogs:

If you want to know more about feature selection, Will's blog has a post about that containing a practical discussion and a possible implementation in Matlab.
Juice Analytics has a nice post about pie charts and their drawbacks.
Abbott's Analytics has a post titled Data Mining and Terrorism where Dean writes about a recent article from Jim Harper on this subject.

Have a nice week-end!

Continue reading... Sphere: Related Content