Data Mining Research

Showing posts with label statistics. Show all posts

Thursday, November 29, 2007

Math Stats and Data Mining

I recently found the new data mining blog named "Math Stats and Data Mining" written by Rachel Graham. It is a very nice blog with a particular focus on statistics and making sense of data. I really like the way posts are written: readable and entertaining with a personal viewpoint. Certain posts are particularly interesting, such as the one on the Pythagorean Theorem or the one entitled "Why is Statistics So Scary?". You should definitively add it to your reading list if you're somehow interested in maths: www.statisticgraphs.com

[End of post]

Continue reading... Sphere: Related Content

Friday, November 23, 2007

The two cultures according to Breiman

In a recent post on Data Mining Research, Will mentioned a paper entitled Statistical Modeling: The Two Cultures. This paper, written by Leo Breiman (the father of decision trees) and published in 2001 in Statistical Science is intended to both statisticians and data miners. As indicated in the title, Breiman compares two different cultures: the statistical culture assuming data models and the data mining culture using algorithmic models.

The whole paper is about comparing these two ways of thinking and solving problems. The author suggests that algorithmic models should be used instead of data models. One of his main argument is that data models are not applicable to a wide range of current problems. The power of this article is to explain complex ideas in a readable manner. Breiman is very good at showing the difference between the two approaches.

According to Breiman, the problem with statisticians can be explained this way:

"This enterprise has at its heart the belief that a statistician, by imagination and by looking at the data, can invent a reasonable good parametric class of models for a complex mechanism devised by nature"

This is of course not possible in the case of very complex problems. This is one of the limitations of the statistician approach. On the contrary, in data mining we consider the "mechanism devised by nature" complex and unknown. The article then deals with topics such as the multiplicity of good models and the curse of dimensionality.

The aim of Breiman is not to say that data miners are more efficient than statisticians, but rather that statisticians should be open to a wider variety of tools. As a conclusion, I think this paper is worth reading, whether you area a statistician or a data miner. I have read several papers during my PhD and this is certainly one of the most interesting one.

Thanks to Will Dwinnell for mentioning this article.

Continue reading... Sphere: Related Content

Thursday, November 08, 2007

Data mining and statistics

I have recently found an interesting paper about the connection between data mining and statistics. It is written by Diego Kuonen, who is now working at Statoo Consulting in Switzerland. The basic question that leads his paper is whether data mining is statistical déjà vu.

After explaining what is statistics and why it is needed, he explains data mining using several definitions. He points out an interesting fact by writing that defining and understanding the business process are most important parts of data mining tasks. He argues that:

"Even the most advances algorithms cannot figure out what is most important."

He also refers to the garbage in, garbage out issue that has been previously discussed on Data Mining Research. He then concludes that data mining cannot be ignored by companies since the advantages of knowledge extraction for businesses are enormous. I would like to quote a sentence I liked where he emphasizes differences between data miners, statisticians and clients:

"[...] computer scientists focus upon database manipulations and processing algorithms; statisticians focus upon identifying and handling uncertainties; and clients focus upon integrating knowledge into the knowledge domain."

If you're interested, feel free to read the article.

Continue reading... Sphere: Related Content

Monday, June 04, 2007

Statistics vs data mining

I recently came across an article from DMReview about differences between statistics and data mining. The article from Kathy Lange has a business point of view (it is in general the point of view of the journal). After a short introduction comparing statistics and data mining, the author focus on the use of predictive analytics for business and the so called Data-Driven Decision-Making. One conclusion of the paper is that "From a business perspective, it doesn't really matter what you call it: statistics, data mining or predictive analytics." I guess it matter from the data mining point of view...

Continue reading... Sphere: Related Content

Friday, March 09, 2007

Data analysis blog

DataSciences Analytics is a blog dealing with any kind of data analyses such as statistics and data mining. The author, John Aitchison, claims that his blog is non-technical and it is the case. All posts are readable. Subjects covered are statistics in general, marketing and news related to data analysis. Posts are comprehensive and constitute good reading materials. Unfortunately, it is not possible to let a comment on his blog (due to spam problems). However, posts are worth reading and comments can be made to the author through a special form.

Continue reading... Sphere: Related Content