Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004 http://gama.vse.cz/keg/

2   Idea of Self-Organised Data Mining GUHA-80 revival   Process of Self-Organised Data Mining Key factors for Self-Organised Data Mining Metabase, Knowledge Base, etc.   Proposed EverMiner system for Self-Organised Data Mining Agenda

3 Introduction  Motivation: support X-Miner users Best practices, known problems collection Best practices, known problems collection  Muller, Lemke: Self-Organising Data Mining (2000)  My thesis: Design/test strings of jobs for EverMiner Design/test strings of jobs for EverMiner Formalization/using heuristics Formalization/using heuristics

4 References (1)  Hájek, P. – Havránek, T.: GUHA 80: An Application of Artificial Intelligence to Data Analysis. Computers and Artificial Intelligence, Vol. 1, 1982, pp. 107-134  Hájek, P. – Ivánek, J.: Artificial Intelligence and Data Analysis. Proc. COMPSTAT’82, Wien, Physica Verlag 1982, pp. 54-60

5 References (2)  Hájek, P. – Havránek, T.: GUHA-80 – An Application of Artificial Intelligence to Data Analysis. Matematické středisko biologických ústavů ČSAV, Praha, 1982  Jirků, P. – Havránek, T.: On Verbosity Levels in Cognitive Problem Solvers. Proc. Computational Linguistics, 1982, http://acl.eldoc.ub.rug.nl/mirror/C/C82/

6 References (3)  Rauch, J.: EverMiner – studie projektu. Dokumentace projektu LISp-Miner, 2003.  Mueller, J.-A. – Lemke, F.: Self-Organising Data Mining. Extracting Knowledge from Data. Dresden, Berlin, 2000.

7 GUHA-80: Main Features  Application of artificial intelligence to exploratory data analysis  To generate interesting views onto given empirical data (recognize interesting logical patterns)  Views: relevant, useful

8 GUHA-80 Sources (1)  GUHA Automatically generate all interesting hypotheses Automatically generate all interesting hypotheses  Lenat’s AM Jobs (tasks) Jobs (tasks) Agenda of jobs Agenda of jobs Hundreds of heuristical rules Hundreds of heuristical rules Concepts Concepts

9  GUHA-80 vs. Lenat’s AM Data Data Data-processing proceduresData-processing procedures  Statistical program packages Effective modules Effective modules GUHA-80 Sources (2)

10 GUHA-80 Paradigm  Open-ended data analysis To maximize interestingness value To maximize interestingness value  Hundreds of heuristic rules Guide to define and study next step Guide to define and study next step  Access potentially relevant rules, Find truly relevant rules, Follows truly relevant rules

11 Interestingness in GUHA-80  No explicit definition  Determined by interplay Heuristical rules Heuristical rules Weighting mechanisms Weighting mechanisms Testing in practice (adequately behaviour?) Testing in practice (adequately behaviour?)  No algorithm, but constraints

12 Principles of GUHA-80  Domain dependence (…exploratory data analysis)  Join human possibilities with machine  More heuristics are relevant  Interactivity with user  Non routine (GUHA-80 not for every-day data processing)

13 GUHA-80 Structure (1)

14 GUHA-80 Structure (2)  Input empirical data  Input parameters How understood “interestingness” How understood “interestingness”  Effective modules (system’s knowledge) Clustering procedures Clustering procedures GUHA procedures GUHA procedures  Agenda of jobs (priority/weight)

15  Heuristics: optimal way to realize a job  Changing system of concepts  Hierarchy of concepts (applicability)  Possible unification of heuristics, jobs,… GUHA-80 Structure (3)

20 GUHA-80 Input  Data  Input information Decompositions/orderings of sets of quantities Decompositions/orderings of sets of quantities Help understand “interestingness” Help understand “interestingness”

21 GUHA-80 Effective modules  Evaluation of usual statistical characteristics,…  Complicated procedures  Synthesis of parameters (“job on job”)

22 GUHA-80  Hundreds of heuristic rules  No explicit definition of interestingness (exploration in a space)  Interactivity with the user  Non-routine character

23 Process of S-O Data Mining Empirical Data Chains of Data & Knowledge Processing Tasks Domain Knowledge,… All Interesting Views, Patterns DataSource, TimeTransf, SumatraTT, 4ft, KL, CF, …

24 Process of S-O Data Mining

25 Key Factors of S-O Data Mining  Data Preparation  Modeling  Evaluation  Knowledge Base  Domain Knowledge

26 Data Preparation  Discretization Attribute Type dependent: Attribute Type dependent: Nominal/Ordinal/Interval/Ratio Nominal/Ordinal/Interval/Ratio Type of coefficient dependent Type of coefficient dependent Discretization-Modeling Cycle (KL, 4ft, CF,…) Discretization-Modeling Cycle (KL, 4ft, CF,…) Known problem with intervals of categories without values Known problem with intervals of categories without values Usually not one target attribute Usually not one target attribute

27 Attribute type dependent discretization  Nominal Classes of values Classes of values  Ordinal Extrem/missing values Extrem/missing values Type of coefficient Type of coefficient Usually not one target attribute Usually not one target attribute

28 Intervals of Categories without Values

29 Intervals of Categories without Values Solution: Statistics – extrem values Statistics – extrem values 4ft Task: correlations, implications 4ft Task: correlations, implications Potentially interesting patterns Potentially interesting patterns

30 Extrem/Missing Values 4ft: Find associations between extrem/missing values (impl/correl) CF, KL: Find patterns with extrem/missing values

31 Data Preparation  Classes of attributes Partial cedents Partial cedents Associations between attributes in one class Associations between attributes in one class Associations between partial cedents Associations between partial cedents

32 Evaluation-Modeling  Input information for partial cedents  Mining for Interesting Patterns Exceptions Exceptions Missing values Missing values Extrem values Extrem values  Discovered hypotheses Groups of hypotheses Groups of hypotheses Coverage hypotheses/input data Coverage hypotheses/input data

33 Heuristic Rules (1)  Examples: IF more extrem/missing values found, search for association with extrem/missing values IF more extrem/missing values found, search for association with extrem/missing values IF 0 hypotheses found, set-up less strong quantifier (p, Base) values IF 0 hypotheses found, set-up less strong quantifier (p, Base) values IF subset of input data not covered by hypotheses THEN search for associations covering these data IF subset of input data not covered by hypotheses THEN search for associations covering these data

34 Heuristic Rules (2)  Examples: IF nominal type of column (input data matrix) AND no associated table for discretization THEN each value is one category (attribute creation) IF nominal type of column (input data matrix) AND no associated table for discretization THEN each value is one category (attribute creation) Use “subset” coefficient type for nominal attributes Use “subset” coefficient type for nominal attributes

35 Metabase, Knowledge Base  Metadata (Knowledge): Results of Previous X-Miner Tasks Results of Previous X-Miner Tasks Domain Knowledge Domain Knowledge Interaction with User (learning?) Interaction with User (learning?)

36 GUHA-80 vs. X-Miner (1)  Task parameters (partial cedents, …)  SW, HW  Experiences with LM applications,…

37 GUHA-80 vs. X-Miner (2)  More complex heuristics

38 EverMiner – Features  Based on LispMiner (X-Miners)  Agenda of jobs, priority/strings  Heuristics  Interaction with user  Enables to repeat the process on new data (“check” vs. new KDD process)

39 EverMiner – where we are  Experiences (Medicine, traffic, shares, sociology,…)  Heuristics collection (www, brainstorming)  Co-operation with data preparation experts (FEL, SumatraTT)  Testing “Strings of jobs” (learning)

40 Discussion

Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

Similar presentations

Presentation on theme: "Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

Similar presentations

Presentation on theme: "Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004"— Presentation transcript:

Similar presentations

About project

Feedback