GUHA - a summary 1. GUHA (General Unary Hypotheses Automaton) is a method of automatic generation of hypotheses based on empirical data, thus a method.

Slides:



Advertisements
Similar presentations
Artificial Intelligence
Advertisements

Dana Nau: Lecture slides for Automated Planning Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License:
Dana Nau: Lecture slides for Automated Planning Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License:
1 Logic Logic in general is a subfield of philosophy and its development is credited to ancient Greeks. Symbolic or mathematical logic is used in AI. In.
L41 Lecture 2: Predicates and Quantifiers.. L42 Agenda Predicates and Quantifiers –Existential Quantifier  –Universal Quantifier 
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
Truth Trees Intermediate Logic.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Chapter 10 The Analysis of Frequencies. The expression “cross partition” refers to an abstract process of set theory. When the cross partition idea is.
GUHA - a summary 1. GUHA (General Unary Hypotheses Automaton) is a method of automatic generation of hypotheses based on empirical data, thus a method.
Software Testing and Quality Assurance
Let remember from the previous lesson what is Knowledge representation
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Propositional Calculus Math Foundations of Computer Science.
SAT Solver Math Foundations of Computer Science. 2 Boolean Expressions  A Boolean expression is a Boolean function  Any Boolean function can be written.
Propositional Logic Review
The Foundations: Logic and Proofs
Software Testing Sudipto Ghosh CS 406 Fall 99 November 9, 1999.
Genetic Programming.
Basic Data Mining Techniques
Intro to Discrete Structures
Discrete Mathematics and Its Applications
Mathematical Structures A collection of objects with operations defined on them and the accompanying properties form a mathematical structure or system.
Systems Architecture I1 Propositional Calculus Objective: To provide students with the concepts and techniques from propositional calculus so that they.
Theoretical basis of GUHA Definition 1. A (simplified) observational predicate language L n consists of (i) (unary) predicates P 1,…,P n, and an infinite.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.
Development in the Ferda project December 2006 Martin Ralbovský.
Declarative vs Procedural Programming  Procedural programming requires that – the programmer tell the computer what to do. That is, how to get the output.
Boolean Algebra and Computer Logic Mathematical Structures for Computer Science Chapter 7.1 – 7.2 Copyright © 2006 W.H. Freeman & Co.MSCS Slides Boolean.
School of Information - The University of Texas at Austin LIS 397.1, Introduction to Research in Library and Information Science LIS Introduction.
CMPF144 FUNDAMENTALS OF COMPUTING THEORY Module 5: Classical Logic.
Pattern-directed inference systems
Advanced Topics in Propositional Logic Chapter 17 Language, Proof and Logic.
Black-box Testing.
Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic.
1 Classes of association rules short overview Jan Rauch, Department of Knowledge and Information Engineering University of Economics, Prague.
1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)
1 Relational Algebra and Calculas Chapter 4, Part A.
Relational Algebra.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
Propositional Calculus CS 270: Mathematical Foundations of Computer Science Jeremy Johnson.
Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004
Question paper 1997.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
DISCRETE COMPUTATIONAL STRUCTURES CSE 2353 Fall 2010 Most slides modified from Discrete Mathematical Structures: Theory and Applications by D.S. Malik.
We will now study some special kinds of non-standard quantifiers. Definition 4. Let  (x),  (x) be two fixed formulae of a language L n such that x is.
1 Logic Our ability to state invariants, record preconditions and post- conditions, and the ability to reason about a formal model depend on the logic.
Chapter 11 Introduction to Computational Complexity Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Albert Gatt LIN3021 Formal Semantics Lecture 3. Aims This lecture is divided into two parts: 1. We make our first attempts at formalising the notion of.
Chapter 12 Chi-Square Tests and Nonparametric Tests.
Dana Nau: Lecture slides for Automated Planning Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License:
Artificial Intelligence Knowledge Representation.
CENG 424-Logic for CS Introduction Based on the Lecture Notes of Konstantin Korovin, Valentin Goranko, Russel and Norvig, and Michael Genesereth.
By P. S. Suryateja Asst. Professor, CSE Vishnu Institute of Technology
Formulation of hypothesis and testing
Propositional Calculus: Boolean Functions and Expressions
Knowledge Representation and Reasoning
Propositional Calculus: Boolean Functions and Expressions
The Foundations: Logic and Proofs
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Propositional Calculus: Boolean Algebra and Simplification
Problem Solving Skill Area 305.1
This Lecture Substitution model
Implementation of Learning Systems
The Foundations: Logic and Proofs
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Presentation transcript:

GUHA - a summary 1. GUHA (General Unary Hypotheses Automaton) is a method of automatic generation of hypotheses based on empirical data, thus a method of data mining. GUHA is one of the oldest methods of data mining - GUHA was introduced in Hájek P., Havel I., Chytil M.: The GUHA method of automatic hypotheses determination, Computing 1 (1966) and GUHA still develops. GUHA is a kind of automated exploratory data analysis: it generates systematically hypotheses supported by the data. 2. GUHA is primary suitable for exploratory analysis of large data. The processed data form a rectangle matrix, where rows corresponds to objects belonging to the sample and each column corresponds to one investigated variable. A typical data matrix processed by GUHA has hundreds or thousands of rows and tens of columns. Exploratory analysis means that there is no single specific hypothesis that should be tested by our data; rather, our aim is to get orientation in the domain of investigation, analyse the behaviour of chosen variables, interactions among them etc. Such inquiry is not blind but directed by some general (possibly vague) direction of research (some general problem).

GUHA - a summary 3. GUHA systematically creates all hypotheses interesting from the point of view of a given general problem and on the base of given data. This is the main principle: “all interesting hypotheses”. Clearly, this contains a dilemma: "all'' means most possible, "only interesting'' means "not too many''. To cope with this dilemma, one may use different GUHA procedures and, having selected one, by fixing in various ways its numerous parameters. (The program leads the user and makes the selection of parameters easy.) Three remarks: * GUHA procedures polyfactorial hypotheses i.e. not only hypotheses relating one variable with another one, but expressing relations among single variables, pairs, triples, quadruples of variables etc. * GUHA offers hypotheses. Exploratory character implies that the hypotheses produced by the computer (numerous in number: typically tens or hundreds of hypotheses) are just supported by the data, not verified. You are assumed to use this offer as inspiration, and possibly select some few hypotheses for further testing. *GUHA is not suitable for testing a single hypothesis: routine packages are good for this.

GUHA - a summary 4. The GUHA procedure generates statements on association between complex Boolean attributes. These attributes are constructed from the predicates corresponding to the columns of the data matrix. Each such predicate (attribute) endowed with a (finite) set of categories, each category being by a subset of the range of the predicate. A literal has the form PRED(CAT) where PRED is a predicate and CAT one of its categories (e.g. TEMPERATURE:(>38) etc.) A hypothesis (or better: an observational statement) has a form  (  is associated with  ) where the attributes ,  are built from literals (unary predicates) using Boolean connectives  , , , (conjunction, disjunction, negation). Typically only some Boolean attributes are allowed, e.g. only conjunctions of finitely many literals, containing each predicate at most once, for example TEMPERATURE: (>38 °C)  PRESURE: (HIGH)  SEX (MALE)

GUHA - a summary 5. Given the data (a model M), each pair of Boolean attributes ,  determines its four-fold frequency table; the association of  with  is defined by choosing an associational quantifier  i.e. a function assigning to each four-fold table either 1 (associated) or 0 (not associated) and satisfying some natural monotonicity conditions. The formula  is true in the data iff the function defining  gives 1 ( TRUE ) to the four-fold table given by ,  The four-fold table has the form:  ¬   a b r   c d s k l m where a is the number of objects in the data satisfying both  and  ; b is the number of objects in the data satisfying  but not  ; c is the number of objects in the data not satisfying  but satisfying  ; d is the number of objects in the data not satisfying  nor  ; r=a+b, s=c+d, k=a+c, l=b+d and m=a+b+c+d. Association means, roughly, that there are enough coincidences (a, d are big enough) and not too many differences (b, c are not too big). Thus, a quantifier  is associational if v M (  )= TRUE and a'  a, b'  b, c'  c, d'  d imply v N (  )= TRUE, too.

GUHA - a summary 6. There are various types of associational quantifiers, formalising various kind of associations; among them implicational quantifiers formalise the association "many  are  ''. Comparative quantifiers formalise the association "  makes  more likely than .'' Some quantifiers just express observations on the data, some others serve as tests of statistical hypotheses on unknown probabilities. Not all quantifiers are associational; the implicational ones do not depend on c, d the comparative ones are symmetric:  implies  and admit negation:  implies    . Various quantifiers are used. 7. The input for the GUHA consists of (1) the data matrix and (2) parameters determining symbolic restriction to the pairs ,  of Boolean attributes (antecedent - succedent) to be generated, the quantifier to be used and a few other things. In particular, one has to declare predicates that can occur in the antecedent and the succedent, minimal and maximal length of antecedent/succedent (number of literals occurring), the kind and parameters of the quantifier used, kind of processing of missing data (if any; three possibilities) etc.

GUHA - a summary 8. The core program LISp Minor produces all associations  satisfying the syntactic restrictions and true in the data. The generation is not done blindly but uses various techniques serving to avoid exhaustive search. The found associations together with various parameters are not mechanically printed but saved in a solution file for further processing. 9. The program for interpretation of results enables the user to browse the associations format, sort them according to various criteria, select reasonably defined subsets and output concise information of various kinds. 10. The GUHA method has deep logical and statistical foundations, continuously developed further GUHA is being further developed at the institute of Computer Science of the Academy of Sciences of the Czech Republic (Petr Hajek and his group) and at the Prague University of Economics under the name LISp-Miner (Jan Rauch and his group).

Example. Assume we are observing children who have an allergic reaction to, say, tomato, apple, orange, cheese or milk. Thus, we have observations as ‘Child x is allergic to milk’, ‘Child y is allergic to cheese’, ‘Child z is allergic to tomato’, etc. We write shorter ‘Milk(x)’, ‘Cheese(y)’ and ‘Tomato(z)’, etc. Milk(-), Cheese(-), Tomato(-), Orange(-) and Apple(-) are (unary) predicates of our obser- vational language and x, y, z,… are variables. Expressions ‘Milk(x)’, ‘Chese(y)’ etc. are atomic (open) formulae. Combine formulae by logical connectives  (not),  (and)  (or), e.g. Milk(x)   Cheese(y) would mean ‘Child x is allergic to milk and Child y is not allergic to cheese’ However, in stead of open formulae, we are more interested in universal closed formulae, e.g. ‘All children are allergic to milk‘, ‘Most children are not allergic to orange’, ‘There is a child allergic to tomato’, ‘In most cases, if a child is allergic to milk then she/he is allergic to cheese’; write  x Milk(x), Wx¬Orange(x),  xTomato(x),  x (Milk(x),Cheese(x)). The following table (matrix) with 5 columns and 15 rows results our observations, where ‘0’ and ‘1’ have the obvious meaning. Our observations support the the following statements: *There is a child allergic to tomato * In all cases, if a child is allergic to cheese then she/he is allergic to milk, too * Not all children are allergic to tomato * Most children are allergic to milk

We define what is data (in this context) data mining, its goals and outcomes GUHA in general In GUHA-theory we define what is an observational predicate language a model TRUE/FALSE in a model a four-fold contingency table We study non-standard quatifiers associational better quantifiers implicational better quantifiers IB quantifier are AB quantifiers interesting quantifiers The following are logically justified associational quantifiers The following is probability theoretically justified associational quantifier The following statistically justified associational quantifiers Fisher quantifier c 2 quantifier We study how e.g. these quantifiers are implemented to a system LISp Miner 1 h 4 -5 h h