Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

Slides:



Advertisements
Similar presentations
The Robert Gordon University School of Engineering Dr. Mohamed Amish
Advertisements

Rule extraction in neural networks. A survey. Krzysztof Mossakowski Faculty of Mathematics and Information Science Warsaw University of Technology.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
Building Global Models from Local Patterns A.J. Knobbe.
EvoNet Flying Circus Introduction to Evolutionary Computation Brought to you by (insert your name) The EvoNet Training Committee The EvoNet Flying Circus.
Proof Clustering for Proof Plans Matt Humphrey Working with: Manuel Blum Brendan Juba Ryan Williams.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Frequent Pattern Mining Toon CaldersBart Goethals ADReM research group.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
GUHA - a summary 1. GUHA (General Unary Hypotheses Automaton) is a method of automatic generation of hypotheses based on empirical data, thus a method.
Knowledge Acquisitioning. Definition The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
GUHA - a summary 1. GUHA (General Unary Hypotheses Automaton) is a method of automatic generation of hypotheses based on empirical data, thus a method.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Automated Changes of Problem Representation Eugene Fink LTI Retreat 2007.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 7: Expert Systems and Artificial Intelligence Decision Support.
Machine Learning: Symbol-Based
Machine Creativity. Outline BackgroundBackground –The problem and its importance. –The known algorithms and systems. Summary of the Creativity Machine.
Building Knowledge-Driven DSS and Mining Data
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
Data Mining – Intro.
SEWEBAR - a Framework for Creating and Dissemination of Analytical Reports from Data Mining Jan Rauch, Milan Šimůnek University of Economics, Prague, Czech.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
S/W Project Management
Martin Ralbovský KIZI FIS VŠE The GUHA method Provides a general mainframe for retrieving interesting information from data Strong foundations.
1 Lyle H. Ungar, University of Pennsylvania What is AI? “Artificial Intelligence is the study of how to make computers do things at which, at the moment,
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 9 = Review for midterm exam.
 Knowledge Acquisition  Machine Learning. The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
Perception-Based Classification (PBC) System Salvador Ledezma April 25, 2002.
Empirical Explorations with The Logical Theory Machine: A Case Study in Heuristics by Allen Newell, J. C. Shaw, & H. A. Simon by Allen Newell, J. C. Shaw,
Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved. Decision Support Systems Chapter 10.
Development in the Ferda project December 2006 Martin Ralbovský.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Ferda Visual Environment for Data Mining Martin Ralbovský.
BE-SECBS FISA 2003 November 13th 2003 page 1 DSR/SAMS/BASP IRSN BE SECBS – IRSN assessment Context application of IRSN methodology to the reference case.
1 Introduction to Software Engineering Lecture 1.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Data Mining and Decision Support
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A field of study that encompasses computational techniques for performing tasks that require intelligence when performed by humans. Simulation of human.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Developing a diagnostic system through integration of fuzzy case-based reasoning and fuzzy ant colony system Expert Systems with Applications 28(2005)
 Knowledge Acquisition  Machine Learning. The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
Artificial Intelligence
Why Intelligent Data Analysis? Joost N. Kok Leiden Institute of Advanced Computer Science Universiteit Leiden.
1 2. Knowledge Management. 2  Structuring of knowledge enables effective and efficient problem solving dynamic learning strategic planning decision making.
Introduction to Machine Learning, its potential usage in network area,
Machine Learning with Spark MLlib
School of Computer Science & Engineering
Introduction C.Eng 714 Spring 2010.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehousing and Data Mining
CSc4730/6730 Scientific Visualization
Feature Selection Methods
Lecture 6: Knowledge Application Systems
Presentation transcript:

Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April

2   Idea of Self-Organised Data Mining GUHA-80 revival   Process of Self-Organised Data Mining Key factors for Self-Organised Data Mining Metabase, Knowledge Base, etc.   Proposed EverMiner system for Self-Organised Data Mining Agenda

3 Introduction  Motivation: support X-Miner users Best practices, known problems collection Best practices, known problems collection  Muller, Lemke: Self-Organising Data Mining (2000)  My thesis: Design/test strings of jobs for EverMiner Design/test strings of jobs for EverMiner Formalization/using heuristics Formalization/using heuristics

4 References (1)  Hájek, P. – Havránek, T.: GUHA 80: An Application of Artificial Intelligence to Data Analysis. Computers and Artificial Intelligence, Vol. 1, 1982, pp  Hájek, P. – Ivánek, J.: Artificial Intelligence and Data Analysis. Proc. COMPSTAT’82, Wien, Physica Verlag 1982, pp

5 References (2)  Hájek, P. – Havránek, T.: GUHA-80 – An Application of Artificial Intelligence to Data Analysis. Matematické středisko biologických ústavů ČSAV, Praha, 1982  Jirků, P. – Havránek, T.: On Verbosity Levels in Cognitive Problem Solvers. Proc. Computational Linguistics, 1982,

6 References (3)  Rauch, J.: EverMiner – studie projektu. Dokumentace projektu LISp-Miner,  Mueller, J.-A. – Lemke, F.: Self-Organising Data Mining. Extracting Knowledge from Data. Dresden, Berlin, 2000.

7 GUHA-80: Main Features  Application of artificial intelligence to exploratory data analysis  To generate interesting views onto given empirical data (recognize interesting logical patterns)  Views: relevant, useful

8 GUHA-80 Sources (1)  GUHA Automatically generate all interesting hypotheses Automatically generate all interesting hypotheses  Lenat’s AM Jobs (tasks) Jobs (tasks) Agenda of jobs Agenda of jobs Hundreds of heuristical rules Hundreds of heuristical rules Concepts Concepts

9  GUHA-80 vs. Lenat’s AM Data Data Data-processing proceduresData-processing procedures  Statistical program packages Effective modules Effective modules GUHA-80 Sources (2)

10 GUHA-80 Paradigm  Open-ended data analysis To maximize interestingness value To maximize interestingness value  Hundreds of heuristic rules Guide to define and study next step Guide to define and study next step  Access potentially relevant rules, Find truly relevant rules, Follows truly relevant rules

11 Interestingness in GUHA-80  No explicit definition  Determined by interplay Heuristical rules Heuristical rules Weighting mechanisms Weighting mechanisms Testing in practice (adequately behaviour?) Testing in practice (adequately behaviour?)  No algorithm, but constraints

12 Principles of GUHA-80  Domain dependence (…exploratory data analysis)  Join human possibilities with machine  More heuristics are relevant  Interactivity with user  Non routine (GUHA-80 not for every-day data processing)

13 GUHA-80 Structure (1)

14 GUHA-80 Structure (2)  Input empirical data  Input parameters How understood “interestingness” How understood “interestingness”  Effective modules (system’s knowledge) Clustering procedures Clustering procedures GUHA procedures GUHA procedures  Agenda of jobs (priority/weight)

15  Heuristics: optimal way to realize a job  Changing system of concepts  Hierarchy of concepts (applicability)  Possible unification of heuristics, jobs,… GUHA-80 Structure (3)

16

17

18

19

20 GUHA-80 Input  Data  Input information Decompositions/orderings of sets of quantities Decompositions/orderings of sets of quantities Help understand “interestingness” Help understand “interestingness”

21 GUHA-80 Effective modules  Evaluation of usual statistical characteristics,…  Complicated procedures  Synthesis of parameters (“job on job”)

22 GUHA-80  Hundreds of heuristic rules  No explicit definition of interestingness (exploration in a space)  Interactivity with the user  Non-routine character

23 Process of S-O Data Mining Empirical Data Chains of Data & Knowledge Processing Tasks Domain Knowledge,… All Interesting Views, Patterns DataSource, TimeTransf, SumatraTT, 4ft, KL, CF, …

24 Process of S-O Data Mining

25 Key Factors of S-O Data Mining  Data Preparation  Modeling  Evaluation  Knowledge Base  Domain Knowledge

26 Data Preparation  Discretization Attribute Type dependent: Attribute Type dependent: Nominal/Ordinal/Interval/Ratio Nominal/Ordinal/Interval/Ratio Type of coefficient dependent Type of coefficient dependent Discretization-Modeling Cycle (KL, 4ft, CF,…) Discretization-Modeling Cycle (KL, 4ft, CF,…) Known problem with intervals of categories without values Known problem with intervals of categories without values Usually not one target attribute Usually not one target attribute

27 Attribute type dependent discretization  Nominal Classes of values Classes of values  Ordinal Extrem/missing values Extrem/missing values Type of coefficient Type of coefficient Usually not one target attribute Usually not one target attribute

28 Intervals of Categories without Values

29 Intervals of Categories without Values Solution: Statistics – extrem values Statistics – extrem values 4ft Task: correlations, implications 4ft Task: correlations, implications Potentially interesting patterns Potentially interesting patterns

30 Extrem/Missing Values 4ft: Find associations between extrem/missing values (impl/correl) CF, KL: Find patterns with extrem/missing values

31 Data Preparation  Classes of attributes Partial cedents Partial cedents Associations between attributes in one class Associations between attributes in one class Associations between partial cedents Associations between partial cedents

32 Evaluation-Modeling  Input information for partial cedents  Mining for Interesting Patterns Exceptions Exceptions Missing values Missing values Extrem values Extrem values  Discovered hypotheses Groups of hypotheses Groups of hypotheses Coverage hypotheses/input data Coverage hypotheses/input data

33 Heuristic Rules (1)  Examples: IF more extrem/missing values found, search for association with extrem/missing values IF more extrem/missing values found, search for association with extrem/missing values IF 0 hypotheses found, set-up less strong quantifier (p, Base) values IF 0 hypotheses found, set-up less strong quantifier (p, Base) values IF subset of input data not covered by hypotheses THEN search for associations covering these data IF subset of input data not covered by hypotheses THEN search for associations covering these data

34 Heuristic Rules (2)  Examples: IF nominal type of column (input data matrix) AND no associated table for discretization THEN each value is one category (attribute creation) IF nominal type of column (input data matrix) AND no associated table for discretization THEN each value is one category (attribute creation) Use “subset” coefficient type for nominal attributes Use “subset” coefficient type for nominal attributes

35 Metabase, Knowledge Base  Metadata (Knowledge): Results of Previous X-Miner Tasks Results of Previous X-Miner Tasks Domain Knowledge Domain Knowledge Interaction with User (learning?) Interaction with User (learning?)

36 GUHA-80 vs. X-Miner (1)  Task parameters (partial cedents, …)  SW, HW  Experiences with LM applications,…

37 GUHA-80 vs. X-Miner (2)  More complex heuristics

38 EverMiner – Features  Based on LispMiner (X-Miners)  Agenda of jobs, priority/strings  Heuristics  Interaction with user  Enables to repeat the process on new data (“check” vs. new KDD process)

39 EverMiner – where we are  Experiences (Medicine, traffic, shares, sociology,…)  Heuristics collection (www, brainstorming)  Co-operation with data preparation experts (FEL, SumatraTT)  Testing “Strings of jobs” (learning)

40 Discussion