Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM 08 2015/8/261.

Slides:



Advertisements
Similar presentations
Protein Secondary Structure Prediction Using BLAST and Relaxed Threshold Rule Induction from Coverings Leong Lee Missouri University of Science and Technology,
Advertisements

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Association Rule Mining
Manipulation of Query Expressions. Outline Query unfolding Query containment and equivalence Answering queries using views.
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Query Folding Xiaolei Qian Presented by Ram Kumar Vangala.
Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline.
gSpan: Graph-based substructure pattern mining
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Algebraic and Logical Query Languages Spring 2011 Instructor: Hassan Khosravi.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
Query Processing Presented by Aung S. Win.
Abrar Fawaz AlAbed-AlHaq Kent State University October 28, 2011
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Chapter 9. Chapter Summary Relations and Their Properties Representing Relations Equivalence Relations Partial Orderings.
Copyright © Curt Hill Query Evaluation Translating a query into action.
Chapter 16 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
Logical Database Design (1 of 3) John Ortiz Lecture 6Logical Database Design (1)2 Introduction  The logical design is a process of refining DB schema.
Discrete Mathematics and Its Applications Sixth Edition By Kenneth Rosen Chapter 8 Relations 歐亞書局.
Chapter 7: Relations Relations(7.1) Relations(7.1) n-any Relations & their Applications (7.2) n-any Relations & their Applications (7.2)
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Relations.
Relations. Important Definitions We covered all of these definitions on the board on Monday, November 7 th. Definition 1 Definition 2 Definition 3 Definition.
Relation. Combining Relations Because relations from A to B are subsets of A x B, two relations from A to B can be combined in any way two sets can be.
1 CS 430 Database Theory Winter 2005 Lecture 5: Relational Algebra.
SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.
Lec 7 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.
18 February 2003Mathias Creutz 1 T Seminar: Discovery of frequent episodes in event sequences Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Rensselaer Polytechnic Institute CSCI-4380 – Database Systems David Goldschmidt, Ph.D.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Chapter 13: Query Processing
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Gspan: Graph-based Substructure Pattern Mining
Relations and Their Properties
Practical Database Design and Tuning
Reducing Number of Candidates
Module 2: Intro to Relational Model
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Practical Database Design and Tuning
Association Analysis: Basic Concepts and Algorithms
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Example of a Relation attributes (or columns) tuples (or rows)
Chapter 2: Intro to Relational Model
Association Analysis: Basic Concepts
Presentation transcript:

Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261

Outline. Motivation Preliminaries Conqueror : algorithm ◦ Selection loop ◦ Projection loop ◦ Constants loop ◦ Eliminating redundancies Experiments Conclusion 2015/8/262

Motivation. First query ask for the actors have starred in a movie of the genre ‘drama’. Second query ask for ‘drama’ and ‘comedy’. Now suppose the answer to the first query consists of 1000 actors, and the answer to the second query consists of 900 actors. 2015/8/263

Motivation (cont) It reveals the potentially interesting pattern that actors starring in ‘drama’ movies typically (with a probability of 90%) also star in a ‘comedy’ movie. In general, we are looking for pairs of queries Q 1,Q 2, such that Q 1 asks for a set of tuples satisfying a certain condition and Q2 asks for those tuples satisfying a more specific condition. 2015/8/264

Preliminaries. Relational database: R(R 1,…,R n ) Definition 1: ◦ Simple Conjunctive Query π X σ F (R 1 ×···× R n ) ◦ F : R i.A = R j.B or R k.A=“c” ◦ x : attributes from R example : Q1: π A,B R or Q2 : π A,B σ A=B R 2015/8/265

Preliminaries (cont) Definition 2 : Containment ◦ Two conjunctive queries Q 1 and Q 2 over R we write Q 1 ⊆ Q 2 if for every possible instance I of R, Q 1 (I) ⊆ Q 2 (I) Definition 3 : Diagonally contained ◦ Q 1 is diagonally contained in Q 2 if Q 1 is contained in a projection of Q 2 (Q 1 ⊆ π X Q 2 ) write Q 1 ⊆ Δ Q /8/266

Preliminaries (cont) Definition 4 : Association Rule ◦ An association rule is of the form Q 1 ⇒ Q 2, such that Q 1 and Q 2 are both simple conjunctive queries and Q 2 ⊆ Δ Q /8/267

Preliminaries (cont) Definition 5 : Support ◦ The support of a conjunctive query Q in an instance I is the number of distinct tuples in the answer of Q on I. ◦ A query is said to be frequent in I if its support exceeds a given minimal support threshold. ◦ The support of an association rule Q 1 ⇒ Q 2 in I is the support of Q 2 in I, an association rule is called frequent in I if Q 2 is frequent in I. 2015/8/268

Preliminaries (cont) Definition 6 : Confidence ◦ An association rule Q 1 ⇒ Q 2 is said to be confident if the support of Q 2 divided by the support of Q 1 exceeds a given minimal confidence threshold. 2015/8/269

Conqueror : Algorithm Divided into two phases. ◦ In a first phase, all frequent simple conjunctive queries are generated. ◦ Then, in a second phase, all confident association rules over these frequent queries are generated. 2015/8/2610

Algorithm (cont) Property 1 : ◦ Let Q 1 and Q 2 be two simple conjunctive queries. If Q 2 ⊆ Δ Q 1, then support(Q 1 ) ≥ support(Q 2 ). 2015/8/2611

Algorithm (cont) Selection loop: ◦ Generate all instantiations of F, without constants, in a breadth-first manner. Projection loop: ◦ For each generated selection, generate all instantiations of X in a breadth-first manner, and test their frequency. Constants loop: ◦ For each generated query in the projection loop, add constant assignments to F in a breadth-first manner. 2015/8/2612

Algorithm (cont) Selection loop ◦ We will use the so called restricted growth string for generating all partitions. ◦ A Restricted Growth string is an array a[1...m] where m is the total number of attributes occurring in the database. ◦ Restricted growth string satisfies the following growth inequality (for i =1, 2,...,n − 1, and with a[1] = 1): a[i +1] ≤ 1+max a[1],a[2],...,a[i]. 2015/8/2613

Algorithm (cont) a[1] = 1 i=1, a[1+1] ≤ 1 + max{a[1]} = 2 i=2, a[2+1] ≤ 1 + max{a[1], a[2]} = 3. EXAMPLE 4. ◦ Let A 1,A 2,A 3,A 4 be the set of all attributes occurring in the database. Then, the restricted growth string 1221 represents the conjunction of equalities A 1 = A 4, A 2 = A /8/2614

Algorithm (cont) 2015/8/2615

Algorithm (cont) Before generating possible projections for a given selection, we first determine whether the selection represents a cartesian product. 2015/8/2616

Algorithm (cont) What is cartesian product ◦ To determine whether a selection represents a cartesian product, we interpret each simple conjunctive query as an undirected graph, such that each relation or constant is a node, and each equality in the selection of the query is an edge between the nodes occurring in that equality. 2015/8/2617

Algorithm (cont) Projection loop ◦ For every generated projection, we first check whether all more general queries are known to be frequent, and if so, the resulting query is evaluated against the database 2015/8/2618

Algorithm (cont) Constant loop ◦ Every block of attribute equalities of the selection can also be set equal to a constant. 2015/8/2619

Algorithm (cont) Candidate evaluation ◦ Evaluated against the database by translating each query to SQL. ◦ The result of such a query is then stored in a temporary table ( τ ). SELECT A, COUNT(*) AS sup FROM τ GROUP BY A ◦ The result of these queries is stored in a new temporary table ( τ A ) holding the constant values together with their support. 2015/8/2620

Algorithm (cont) Let τ A and τ B be the temporary tables holding the constant values for the attributes A and B together with their support. We can now generate the table τ A,B. This is the generated query for getting the values for τ A,B,C using the temporary tables τ, τ A,B, τ A,C, τ B,C. 2015/8/2621

Algorithm (cont) Association rule generation ◦ For all queries Q1 the algorithm finds all queries Q2 such that Q2 ⊆ Δ Q1, it computes the confidence of the rule Q1 ⇒ Q2 and tests whether it is confident. 2015/8/2622

Algorithm (cont) Eliminating redundancies ◦ Consider the following association rules, each based on a vertical containment:  π R.A,R.B,S.E σ R.C=S.F(R × S) ⇒ π R.A,S.E σ R.C=S.F(R × S)  π R.A,S.E σ R.C=S.F(R × S) ⇒ π R.A σ R.C=S.F(R × S)  π R.A,R.B,S.E σ R.C=S.F(R × S) ⇒ π R.A σ R.C=S.F(R × S) ◦ Now suppose the first association rule has a confidence of 100%. Then, the confidence of the second and third association rule must be equal. 2015/8/2623

Algorithm (cont) LEMMA. An association rule Q1 ⇒ Q2 is redundant if ◦ 1. There exists an association rule Q 3 ⇒ Q 1 with confidence 100% ◦ 2. There exists an association rule Q 4 ⇒ Q 2 with confidence 100%, and Q 4 ⊆ Δ Q /8/2624

Experiments. The IMDB snapshot consist of three tables ACTORS (A), MOVIES (M) and GENRES (G),and two tables that represent the connections between them namely ACTORMOVIES (AM) and GENREMOVIES (GM). We can conclude that every movie has a genre because of the following association rule with 100% Confidence π M.MID (M) ⇒ π M.MID σ GM.MID=M.MID (M × GM) 2015/8/2625

Experiments (cont) In our database, not every movie has to have an actor associated with it as the following rule only has 75.91% confidence. π M.MID (M) ⇒ π M.MID σ AM.MID=M.MID (M × AM) We can find ‘frequent’ genres in which actors play. It has 40.44% confidence, so 40.44% of the actors play in a ‘Documentary’ (genre id 3) while the same rule for ‘Drama’ has 49.85% confidence. 2015/8/2626

Experiments (cont) 81.60% of the actors in genre ‘Music’ (genre id 16) only play in one movie. But the same rule for genre ‘Crime’ has only 49.87% confidence. 2015/8/2627

Conclusion. 2015/8/2628