Download presentation
Presentation is loading. Please wait.
Published byCurtis Hoover Modified over 9 years ago
1
Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004 http://gama.vse.cz/keg/
2
2 Idea of Self-Organised Data Mining GUHA-80 revival Process of Self-Organised Data Mining Key factors for Self-Organised Data Mining Metabase, Knowledge Base, etc. Proposed EverMiner system for Self-Organised Data Mining Agenda
3
3 Introduction Motivation: support X-Miner users Best practices, known problems collection Best practices, known problems collection Muller, Lemke: Self-Organising Data Mining (2000) My thesis: Design/test strings of jobs for EverMiner Design/test strings of jobs for EverMiner Formalization/using heuristics Formalization/using heuristics
4
4 References (1) Hájek, P. – Havránek, T.: GUHA 80: An Application of Artificial Intelligence to Data Analysis. Computers and Artificial Intelligence, Vol. 1, 1982, pp. 107-134 Hájek, P. – Ivánek, J.: Artificial Intelligence and Data Analysis. Proc. COMPSTAT’82, Wien, Physica Verlag 1982, pp. 54-60
5
5 References (2) Hájek, P. – Havránek, T.: GUHA-80 – An Application of Artificial Intelligence to Data Analysis. Matematické středisko biologických ústavů ČSAV, Praha, 1982 Jirků, P. – Havránek, T.: On Verbosity Levels in Cognitive Problem Solvers. Proc. Computational Linguistics, 1982, http://acl.eldoc.ub.rug.nl/mirror/C/C82/
6
6 References (3) Rauch, J.: EverMiner – studie projektu. Dokumentace projektu LISp-Miner, 2003. Mueller, J.-A. – Lemke, F.: Self-Organising Data Mining. Extracting Knowledge from Data. Dresden, Berlin, 2000.
7
7 GUHA-80: Main Features Application of artificial intelligence to exploratory data analysis To generate interesting views onto given empirical data (recognize interesting logical patterns) Views: relevant, useful
8
8 GUHA-80 Sources (1) GUHA Automatically generate all interesting hypotheses Automatically generate all interesting hypotheses Lenat’s AM Jobs (tasks) Jobs (tasks) Agenda of jobs Agenda of jobs Hundreds of heuristical rules Hundreds of heuristical rules Concepts Concepts
9
9 GUHA-80 vs. Lenat’s AM Data Data Data-processing proceduresData-processing procedures Statistical program packages Effective modules Effective modules GUHA-80 Sources (2)
10
10 GUHA-80 Paradigm Open-ended data analysis To maximize interestingness value To maximize interestingness value Hundreds of heuristic rules Guide to define and study next step Guide to define and study next step Access potentially relevant rules, Find truly relevant rules, Follows truly relevant rules
11
11 Interestingness in GUHA-80 No explicit definition Determined by interplay Heuristical rules Heuristical rules Weighting mechanisms Weighting mechanisms Testing in practice (adequately behaviour?) Testing in practice (adequately behaviour?) No algorithm, but constraints
12
12 Principles of GUHA-80 Domain dependence (…exploratory data analysis) Join human possibilities with machine More heuristics are relevant Interactivity with user Non routine (GUHA-80 not for every-day data processing)
13
13 GUHA-80 Structure (1)
14
14 GUHA-80 Structure (2) Input empirical data Input parameters How understood “interestingness” How understood “interestingness” Effective modules (system’s knowledge) Clustering procedures Clustering procedures GUHA procedures GUHA procedures Agenda of jobs (priority/weight)
15
15 Heuristics: optimal way to realize a job Changing system of concepts Hierarchy of concepts (applicability) Possible unification of heuristics, jobs,… GUHA-80 Structure (3)
16
16
17
17
18
18
19
19
20
20 GUHA-80 Input Data Input information Decompositions/orderings of sets of quantities Decompositions/orderings of sets of quantities Help understand “interestingness” Help understand “interestingness”
21
21 GUHA-80 Effective modules Evaluation of usual statistical characteristics,… Complicated procedures Synthesis of parameters (“job on job”)
22
22 GUHA-80 Hundreds of heuristic rules No explicit definition of interestingness (exploration in a space) Interactivity with the user Non-routine character
23
23 Process of S-O Data Mining Empirical Data Chains of Data & Knowledge Processing Tasks Domain Knowledge,… All Interesting Views, Patterns DataSource, TimeTransf, SumatraTT, 4ft, KL, CF, …
24
24 Process of S-O Data Mining
25
25 Key Factors of S-O Data Mining Data Preparation Modeling Evaluation Knowledge Base Domain Knowledge
26
26 Data Preparation Discretization Attribute Type dependent: Attribute Type dependent: Nominal/Ordinal/Interval/Ratio Nominal/Ordinal/Interval/Ratio Type of coefficient dependent Type of coefficient dependent Discretization-Modeling Cycle (KL, 4ft, CF,…) Discretization-Modeling Cycle (KL, 4ft, CF,…) Known problem with intervals of categories without values Known problem with intervals of categories without values Usually not one target attribute Usually not one target attribute
27
27 Attribute type dependent discretization Nominal Classes of values Classes of values Ordinal Extrem/missing values Extrem/missing values Type of coefficient Type of coefficient Usually not one target attribute Usually not one target attribute
28
28 Intervals of Categories without Values
29
29 Intervals of Categories without Values Solution: Statistics – extrem values Statistics – extrem values 4ft Task: correlations, implications 4ft Task: correlations, implications Potentially interesting patterns Potentially interesting patterns
30
30 Extrem/Missing Values 4ft: Find associations between extrem/missing values (impl/correl) CF, KL: Find patterns with extrem/missing values
31
31 Data Preparation Classes of attributes Partial cedents Partial cedents Associations between attributes in one class Associations between attributes in one class Associations between partial cedents Associations between partial cedents
32
32 Evaluation-Modeling Input information for partial cedents Mining for Interesting Patterns Exceptions Exceptions Missing values Missing values Extrem values Extrem values Discovered hypotheses Groups of hypotheses Groups of hypotheses Coverage hypotheses/input data Coverage hypotheses/input data
33
33 Heuristic Rules (1) Examples: IF more extrem/missing values found, search for association with extrem/missing values IF more extrem/missing values found, search for association with extrem/missing values IF 0 hypotheses found, set-up less strong quantifier (p, Base) values IF 0 hypotheses found, set-up less strong quantifier (p, Base) values IF subset of input data not covered by hypotheses THEN search for associations covering these data IF subset of input data not covered by hypotheses THEN search for associations covering these data
34
34 Heuristic Rules (2) Examples: IF nominal type of column (input data matrix) AND no associated table for discretization THEN each value is one category (attribute creation) IF nominal type of column (input data matrix) AND no associated table for discretization THEN each value is one category (attribute creation) Use “subset” coefficient type for nominal attributes Use “subset” coefficient type for nominal attributes
35
35 Metabase, Knowledge Base Metadata (Knowledge): Results of Previous X-Miner Tasks Results of Previous X-Miner Tasks Domain Knowledge Domain Knowledge Interaction with User (learning?) Interaction with User (learning?)
36
36 GUHA-80 vs. X-Miner (1) Task parameters (partial cedents, …) SW, HW Experiences with LM applications,…
37
37 GUHA-80 vs. X-Miner (2) More complex heuristics
38
38 EverMiner – Features Based on LispMiner (X-Miners) Agenda of jobs, priority/strings Heuristics Interaction with user Enables to repeat the process on new data (“check” vs. new KDD process)
39
39 EverMiner – where we are Experiences (Medicine, traffic, shares, sociology,…) Heuristics collection (www, brainstorming) Co-operation with data preparation experts (FEL, SumatraTT) Testing “Strings of jobs” (learning)
40
40 Discussion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.