My blog has moved! Redirecting...

You should be automatically redirected. If not, visit and update your bookmarks.

Data Mining Research - A note on correlation

I'm a Data Miner Collection (T-shirts, Mugs & Mousepads)

All benefits are given to a charity association.

Tuesday, March 13, 2007

A note on correlation

Correlation is often used as a preliminary technique to discover relationships between variables. More precisely, the correlation is a measure of the linear relationship between two variables. Pearson's correlation coefficient is defined as:

As written above, the main drawback of correlation is the linear relationship restriction. If the correlation is null between two variables, they may be non-linearly related. As written in Tan et al. (2006), x and x^2 have a correlation of zero but are non-linearly related. Remind that non-linear does not mean polynomial. Consider for example x and cos(x). Although their correlation is close to zero, they are related.

P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2006.

Sphere: Related Content


Kevin Hillstrom said...

What metric do you recommend using in place of correlation? In other words, what metric do you recommend, one that is good at telling me that "x" and "x^2" are essentially correlated?

Dean Abbott said...

Ahh, the ol' x vs. x^2 example! I use that one all the time in teaching about correlations. I usually use correlations as a first step in eliminating variables that are replicates of one another: if two variables are correlated at .95 or above (or -0.95 or below), they bring essentially the same information to the table, and only one is needed to convey that piece of information.

Regarding Kevin's question--it is a good one. I actually wrote a paper (not a very good one) for an IEEE Systems, Man, and Cybernetics conference on the topic of "nonlinear" correlations, and proposed an algorithm to find these nonlinear relationships. It basically fit simple nonlinear models of every variable and used a scoring metric (like R^2) to assess how related the variables were to each other nonlinearly. But I have never found anything very clean in this regard, and typically don't worry about these nonlinear relationships.

The biggest reason people worry about the linear correlation problem is that co-linearity is a destructive problem in regression models. But "nonlinear correlation" is not, and in fact, including both terms in a regression model can be quite a good idea precisely because they are orthogonal (linearly).


damien fran├žois said...

Indeed correlation is only able to spot linear dependencies between variables. For nonlinear dependencies, one can consider an order-based version of the correlation, known as rank correlation that correlates ranks instead of values. This approach is still not suitable for detecting non monotonuous relationships (as is x->x^2 over a domain centered on zero). Then, mutual information can be used, but it is much more difficult to estimated than correlation.

Amit said...

Damien has answered that question to an extent. Spearman's Rank Correlation is a way out of measuring correlation in monotonic relationships.

And to add, it depends on the two variables in question, whether Pearson's Correlation, described by Sandro, will give good results.

For example, if you are talking about 2 non-continuous variables, the story changes and needs either a Chi-square, Chi-square, or a point-biserial correlationPoint-Biserial Correlation,

Push said...


I am using pearson correlation for a movie rank prediction problem. What I am wondering about is will I get good results even if there is no linear relationship between users who rank movies.

Pushkar Raste

Jeff Zanooda said...

In credit scoring, information value is routinely used in univariate analysis.

Another popular approach is to look at both Spearman's rank correlation and Hoeffding's measure of dependence.

Clicky Web Analytics