Standardization vs. normalization
In the overall knowledge discovery process, before data mining itself, data preprocessing plays a crucial role. One of the first steps concerns the normalization of the data. This step is very important when dealing with parameters of different units and scales. For example, some data mining techniques use the Euclidean distance. Therefore, all parameters should have the same scale for a fair comparison between them.
Two methods are usually well known for rescaling data. Normalization, which scales all numeric variables in the range [0,1]. One possible formula is given below:
On the other hand, you can use standardization on your data set. It will then transform it to have zero mean and unit variance, for example using the equation below:
Both of these techniques have their drawbacks. If you have outliers in your data set, normalizing your data will certainly scale the "normal" data to a very small interval. And generally, most of data sets have outliers. When using standardization, you make an assumption that your data have been generated with a Gaussian law (with a certain mean and standard deviation). This may not be the case in reality.
So my question is what do you usually use when mining your data and why?
Note: Thanks to Benny Raphael for fruitful discussions on this topic.
6 comments:
Sometimes perhaps we can take logarithms of input data when they contain order-of-magnitude larger and smaller values. However, since logarithms are defined for positive values only, we need to take care when the input data may contain zero and negative values.
You did a very good work on your blog! :)
A few points come to mind:
1. Monotonic scaling of the data (assuming that distinct values are not collapsed) will have no affect on the most common logical learning algorithms (tree- and rule-induction algorithms).
2. There are robust alternatives, such as: subtract the median and divide by the IQR, or scale linearly so that the 5th and 95th percentiles meet some standard range.
3. Outliers (technically, and high leverage points) present an interesting challenge. One possibility is to Winsorize the data after scaling it.
Thanks for your comment fay. I agree with you on taking the log. I use to work with data in the range 10^6 to 10^12 for example. And thanks for the remark :-)
Will, your suggestions seem very interesting. I don't know the "winsorize" technique, but it seems it could be used in addition to normalization.
For readers who are not aware of this technique: "Winsorizing" data simlpy means clamping the extreme values.
This is similar to trimming the data, except that instead of discarding data: values greater than the specified upper limit are replaced with the upper limit, and those below the lower limit are replace with the lower limit. Often, the specified range is indicate in terms of percentiles of the original distribution (like the 5th and 95th percentile).
This process is sometimes used to make conventional measures more robust, as in the Winsorized variance.
Will, can you tell me how I can scale linearly so that the 5th and 95th percentiles meet some standard range?
Can this be done with both negative and positive values?
Another question:
If I want to compute an index where not only the units and scales are different, but also the input metrics into the index have different interpretations - specifically, one metric is better if the values are higher and another one is better if the values are lower, how can I compute an index that represents all numbers concisely and meaningfully?
Let's say I have Expenses ($), Profits($) and Turnover (%). Expenses and Turnover are better if lower, but Profits are better if higher.
If comparing two companies on these metrics, and I want to compute one index to show the "best" performing company on these parameters, how can I do this?
Sorry, not strictly data-mining relevant, but thought someone here might have an answer!
Tried using z-scores and normalizing but doesnt work due to different hi-low interpretations.
Eventually used a reverse-rank for Expenses and Turnover so that all have same order. However, rank does not show quantity difference between the two companies, just their ranks!
this is a great blog, thanks to all for helpful comments.
First, you can normalize/standardize your data. Or, on the contrary, you can maybe decide to manually fix weights to each of these metrics.
You can for example use an objective function. Let say you want to maximize a function of the Expenses, Profits and Turnover. In the objective function, give a negative weight to Expenses and Turnover and a positive one to Profits. I don't know if this will work for your problem, but that would be my first guess.
Post a Comment