Panel Discussion on Foundations of Data Mining at RSCTC2004 J. T. Yao University of Regina Web:
What is the Foundations of Data Mining? DM research mainly focuses on algorithms and methodologies. There is a lack of study on mathematical modeling of, or foundations of, data mining The study of foundations of data mining is in its infancy, and there are probably more questions than answers. (Mannila 2000)
What is the Foundations of Data Mining? Chen's approach (2002): data mining can be studied from three different but related dimensions. The philosophical dimension deals with the nature and scope of data mining. The technical dimension covers data mining methods and techniques. The social dimension concern the social impact and consequences of data mining.
What is the Foundations of Data Mining? Xie and Raghavan's approach (2002): logical foundation of data mining based on Bacchus' probability logic. Precise definition of intuitive notions, such as ``pattern'', ``previously unknown knowledge'' and ``potentially useful knowledge''. A logic induction operator is defined for discovering ``previously unknown and potentially useful knowledge''.
What is the Foundations of Data Mining? Lin's (2002), Tsumoto (2002), and Yao's (2001) approaches: Granular computing as a basis for data mining. A concept consists of two parts, the intension and extension of the concept. The intension of a concept consists of properties objects. The extension of a concept is the set of instances. A rule can be expressed in the form, φ=>ψ where φ and ψ are intensions of two concepts. Rules are interpreted using extensions of the two concepts.
A Multi-level Framework for Modeling Data Mining The kernel focuses on the study of knowledge without reference to data mining algorithms. The technique levels focus on data mining algorithms without reference to particular application. The application levels focus on the utility of discovered knowledge with respect to particular domains of applications.
How do Rough Sets Contribute to FDM? Knowledge is an entity in the semantic levels of data. Knowledge embedded in data is related to semantic interpretations of data. The existence of knowledge in data is unrelated to whether we have an algorithm to extract it. We need to separate the study of knowledge and the study of data mining algorithms, and in turn to separate them from the study of utility of discovered knowledge.
How do Rough Sets Contribute to FDM? Concepts are used as a primitive notion of data mining: Every concept is understood as a unit of thoughts that consists of two parts, the intension and the extension of the concept. Tarski's approach is used to study concepts through the notions of a model and satisfiability. An information table is used as a model. The intension of a concept is expressed by a formula of a decision language in the information table. The extension of a concept is expressed by a subset of objects.
How do Rough Sets Contribute to FDM? Rules are used to express relationships. Rules can be interpreted and classified in terms of extensions of concepts and are based on probability theory. Many classes of rules can be defined: association rules, exception rules, peculiarity rules, similarity, negative association, conditional association rules. Both concepts and rules are used as examples to illustrate the focus of discussion at kernel level.
References Chen, Z. The three dimensions of data mining foundation, FDM’02, , Lin, T.Y. Issues in modeling for data mining, COMPSAC’02, , Mannila, H. Theoretical frameworks for data mining, SIGKDD Explorations, (2), 30-32, Tsumoto, S.,T.Y Lin, J.F. Peters. Foundations of Data Mining via Granular and Rough Computing. COMPSAC’02, , 2002 Yao, Y.Y. Modeling data mining with granular computing, COMPSAC’01, , Yao, Y.Y., A step towards the foundations of data mining, SPIE Vol. 5098, , 2003.