Data Mining Research

Wednesday, October 24, 2007

Data, Information, Knowledge and Wisdom

The aim of data mining is to draw understandable knowledge from raw data. Behind these notions of data and knowledge, a more complex hierarchy exists. This hierarchy originates independently from knowledge management, design and information science (1). In knowledge management, the Data Information Knowledge Wisdom (DIKW) hierarchy or pyramid has been initiated by Cleveland in 1982, Zeleny (2) in 1987 and Ackoff in 1989 separately.

Zeleny translates the different parts of the DIKW hierarchy respectively by know-nothing, know-what, know-how and know-why. Ackoff (3) proposes comprehensive definitions for such terms. He writes that "[data] are products of observation". It simply exist and has no significance. Information, which is inferred from data, answers questions such as who, what, where, when and how many. It consists of data linked together by relational connection. Knowledge is know-how and is acquired through learning. Knowledge is a useful collection of information. Ackoff proposes an additional layer named Understanding. It represents the why and allows to synthesize new knowledge from previous one. Finally, Wisdom is the ability to evaluate any choice. As written in Bellinger (4), "it asks questions to which there is no (easily-achievable) answer". In information science, Cleveland mentions a hierarchy for information, knowledge and wisdom.

Illustration of information, knowledge and wisdom. Originally published in THE FUTURIST (1992). Used with permission from the World Future Society, 7910 Woodmont Avenue, Suite 450, Bethesda, Maryland 20814. Telephone:301-656-8274; www.wfs.org.
According to Cleveland, this hierarchy is mentioned for the first time by T.S. Eliot - a poet - in 1934:

"Where is the Life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?"

I recently had a discussion about this hierarchy with a colleague. He told me this hierarchy is wrong and too old. He had two main arguments. First, the meaning of the word information is now different as it was before, since English is an evolving language. To his opinion it is above both Data and Knowledge. Second, he argues that databases contain knowledge. However, according to the above mentioned schema, raw data do not contain knowledge. According to him, the functional dependencies in databases are knowledge. To my opinion, functional dependencies come from the user (the domain knowledge) and is not found from the data. What to you think of that? Do you think this DIKW hierarchy is obsolete? It is an open question, so any comment is welcome.

(1) Sharma, N., The Origin of the Data Information Knowledge Wisdom Hierarchy, 2005, School Of Information, University of Michigan.
(2) Zeleny, M., Management Support Systems: Towards Integrated Knowledge Management, Human Systems Mangement, 1987, 7, 1, 59-70.
(3) Ackoff, R.L., From Data to Wisdom, Journal of Applied Systems Analysis, 1989, 16, 3-9.
(4) Bellinger, G. and Castro, D. and Mills, A., Data, Information, Knowledge, and Wisdom, 2005.

Continue reading... Sphere: Related Content

Thursday, January 18, 2007

What Google can't mine

While I was reading a book about search and information, I found a particular chapter about the hidden web very interesting. Basically, the hidden web is the part of the Internet that is accessible to people but not to bots (such as Google bots). In other words, these pages exist, but they are not referenced in search engines simply because it is too difficult (or sometimes impossible?) to index them. Examples are dynamically generated webpages, most databases behind websites, pages requiring password access, etc. More details can be found on wikipedia under the term deep web.

Michael Bergman is estimating the size of the hidden web as 400-550 times the visible web. I think the metaphor of an iceberg can easily be used in this situation. The question now is how will Google and other search engines do to access this (or at least part of this) information?