Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.

Slides:

Advertisements

Similar presentations

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.

2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.

Graph Mining Laks V.S. Lakshmanan

Query Folding Xiaolei Qian Presented by Ram Kumar Vangala.

Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline.

CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.

gSpan: Graph-based substructure pattern mining

1 Conjunctions of Queries. 2 Conjunctive Queries A conjunctive query is a single Datalog rule with only non-negated atoms in the body. (Note: No negated.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Mining Data Mining Spring Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database.

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Leiden University Efficient Frequent Query Discovery in F ARMER Siegfried Nijssen and Joost N. Kok ECML/PKDD-2003, Cavtat.

Association Analysis (7) (Mining Graphs)

Data Mining Association Analysis: Basic Concepts and Algorithms

A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graph.

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Data Mining Association Analysis: Basic Concepts and Algorithms

2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 1 Efficient Maintenance of Semistructured Schema Katsaros Dimitrios Aristotle University.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graphs.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Graph Algebra with Pattern Matching and Aggregation Support 1.

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261.

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

1 Mining Tree Queries in a Graph Bart Goethals, Eveline Hoekx and Jan Van den Bussche KDD ’ 05 presentor: Ming Jing Tsai.

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Querying Structured Text in an XML Database By Xuemei Luo.

The Volcano Query Optimization Framework S. Sudarshan (based on description in Prasan Roy’s thesis Chapter 2)

1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

TreeFinder ： a first step towards XML data mining Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Alexandre Termier Marie-Christine Michele Sebag.

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

7 1 Database Systems: Design, Implementation, & Management, 7 th Edition, Rob & Coronel 7.6 Advanced Select Queries SQL provides useful functions that.

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.

1 Mining the Smallest Association Rule Set for Predictions Jiuyong Li, Hong Shen, and Rodney Topor Proceedings of the 2001 IEEE International Conference.

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Searching for Pattern Rules Guichong Li and Howard J. Hamilton Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor ： Jia-Ling Koh Speaker ： Tsui-Feng Yen.

Gspan: Graph-based Substructure Pattern Mining

More SQL: Complex Queries, Triggers, Views, and Schema Modification

MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.

More SQL: Complex Queries,

Data Mining Association Analysis: Basic Concepts and Algorithms

Mining Frequent Subgraphs

CARPENTER Find Closed Patterns in Long Biological Datasets

Action Association Rules Mining

A Parameterised Algorithm for Mining Association Rules

Mining Frequent Subgraphs

Presentation transcript:

Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium

2 Graph Data A (directed) graph over a set of nodes N is a set G of edges: ordered pairs  i  j  with i  j  N. Snapshot of a graph representing the complete metabolic pathway of a human.

3 Graph Mining Transactional category –dataset: set of many small graphs (transactions) –frequency:  transactions in which the pattern occurs (at least once) –ILP: Warmr [AGM, FSG, TreeMiner, gSpan, FFSM] Single graph category –dataset: single large graph –frequency:  copies of the pattern in the large graph [Subdue, Vanetik-Gudes-Shimony, SEuS, SiGraM, Jeh-Widom] Focus on pattern mining, few work on association rule mining!

4 Our work Single graph category Pattern + association rule mining Patterns with: –Existential nodes –Parameters Occurrence of the pattern in G is any homomorphism from the pattern in G. So far only considered in the ILP (transactional) setting

5 Example of a pattern frequency    x    z   5  z   G   z  8  G   z  x   G 

6 Patterns are conjunctive queries. frequency    x    z   5  z   G   z  8  G   z  x   G  select distinct G3.to as x from G G1, G G2, G G3 where G1.from=5 and G1.to=G2.from and G1.to=G3.from and G2.to=8

7 Example of an Association Rule

8 Features of the presented algorithms Pattern mining phase + association mining phase Restriction to trees => efficient algorithms Equivalence checking Apply theory of conjunctive database queries Database oriented implementation

9 Outline rest of talk Formal problem definition Algorithms: 1.Pattern Mining Overall approach Outer loop: incremental Inner loop: levelwise Equivalence checking 2. Association Rule Mining Result management Experimental results Future work

10 Formal definition of a tree pattern. A tree pattern is a tree P whose nodes are called variables, and: 1.some variables marked as existential  2.some variables are parameters (labeled with a constant) 3.remaining variables are called distinguished

11 Formal definition of a tree query. A tree query Q is a pair (H,P) where: 1.P is a tree pattern, the body of Q 2.H is a tuple of distinguished variables and parameters of P. All distinguished variables of P must appear at least once in H, the head of Q

12 Formal definition of a matching A matching of a pattern P in a graph G is a homomorphism h: P  G, with h  z  a, for parameters labeled a.

13 Example: Matching zz yzz x

14 Example: Matching zz yzz x

15 Example: Matching zz yzz x hh 

16 Example: Matching zz yzz x hh  hh 

17 Example: Matching zz yzz x hh  hh  hh 

18 Example: Matching zz yzz x hh  hh  hh  hh 

19 Example: Matching zz yzz x hh  hh  hh  hh  hh 

20 Formal definition of frequency The frequency of Q in G is #answers in the answer set. We define the answer set of Q in G as follows: Q  G  f(H)|f is a matching of P in G 

21 Example: Matching zz yzz x hh  hh  hh  hh  hh  frequency   

22 Problem statement 1: Tree query mining Given a graph G and a threshold k, find all tree queries that have frequency at least k in G, those queries are called frequent.

23 Formal definition of an association rule An association rule (AR) is of the form Q 1  Q 2 with Q 1 and Q 2 tree queries. The AR is legal if Q 2  Q 1. The confidence of the AR in a graph G is defined as the frequency of Q 2 divided by the frequency of Q 1.

24 Problem statement 2: Association rule mining Input: a graph G, minsup, a tree query Q left frequent in G, minconf Output: all tree queries Q such that Q left  Q is a legal and confident association rule in G.

25 Outline rest of talk Formal problem definition Algorithms: 1.Pattern Mining Overall approach Outer loop: incremental Inner loop: levelwise Equivalence checking 2. Association Rule Mining Result management Experimental results Future work

26 Pattern Mining Algorithm Outer loop: Generate, incrementally, all possible trees of increasing sizes. Avoid generation of isomorphic trees. Inner loop: For each newly generated tree, generate all queries based on that tree, and test their frequency.... x1x1 x4x4 x3x3 x2x2  x2x2 x1x1   x2x2 x1x1  xx   

27 Outer loop It is well known how to efficiently generate all trees uniquely up to isomorphism Based on canonical form of trees. [Scions, Li-Ruskey, Zaki, Chi-Young-Muntz]

28 Inner loop: Levelwise approach A query Q is characterized by  –  Q  set of existential nodes –  Q  set of parameters –Labeling Q  of the parameters by constants. Q          specializes Q          if     ,      and  agrees with  on  . If Q  specializes Q  then freq  Q    freq  Q    Most general query: T = ( , ,  )

29 Inner loop: Candidate generation CanTab   is a candidate query  FreqTab   is a frequent query  Q’=  ’  ’  is a parent of Q=  if either:  ’ and  has precisely one more node than  ’, or  ’ and  has precisely one more node than  ’ Join Lemma: Each candidacy table can be computed by taking the natural join of its parent frequency tables.

30 Inner loop: Frequency counting Each candidacy table can be computed by a single SQL query. (ref. Join lemma). Suppose: G  from  to  table in the database, then each frequency table can be computed with a single SQL query. –  »formulate in SQL and count –   »formulate   in SQL  E »natural join of E with CanTab  »group by  »count each group

31 Inner loop: Example  x    x   x    x   x  

32 Inner loop: Example  x    x   x    x   x   Join expression: CanTab {x  }{x ,x  } = FreqTab  x   x   ⋈ FreqTab  x   x   ⋈ FreqTab  x   x  

33 Inner loop: Example  x    x   x    x   x   SQL expression E for  x      select distinct G1.from as x1, G2.to as x3, G3.to as x4 from G G1, G G2, G G3 where G1.to = G2.from and G3.from = G2.from

34 Inner loop: Example  x    x   x    x   x   SQL expression for filling the frequency table: select distinct E.x1, E.x3, count(E.x4) from E, CanTab {x2}{x1,x3} as CT where E.x1 = CT.x1 and E.x3 = CT.x3 group by E.x1, E.x3 having count(E.x4) >= k

35 Equivalent queries Queries Q  and Q  are equivalent if same answer sets on all graphs G (up to renaming of the distinguished variables) 2 cases of equivalent queries: 1.Q 1 has fewer nodes than Q 2 2.Q 1 and Q 2 have the same number of nodes

36 Equivalence theorem A containment mapping from Q  to Q  is a h: Q   Q  that maps distinguished variables of Q  one-to-one to distinguished variables of Q , and maps parameters of Q  to parameters of Q , preserving labels Two queries are equivalent if and only if there are containment mappings between them in both directions.

37 Case  : Q  fewer nodes than Q 2 Redundancy lemma: Let Q be a tree query without selected nodes. Then Q has a redundancy if and only if it contains a subtree C in the form of a linear chain of  nodes (possibly just a single node), such that the parent of C has another subtree that is at least as deep as C. Redundant subtree

38 Case  : Q  and Q  same number of nodes Q  and Q  must be isomorphic. Canonical form of queries: refine the canonical ordering of the underlying unlabeled tree, taking into account node labels.

39 Association Mining Algorithm Input: a graph G, minsup, a tree query Q left frequent in G, minconf Output: all tree queries Q such that Q left  Q is a legal and confident association rule in G.

40 Containment mappings For each tree query, generate all containment mappings from Q left to Q, ignoring parameter assignments.

41 Instantiations For each containment mapping, generate all parameter assignments such that Q left  Q is frequent and confident.

42 Equivalent Association rules Equivalence checking of association rules is as hard as general graph isomorphism testing.

43 Outline rest of talk Result management Experimental results Future work

44 Result management Output: frequency tables stored in a relational database. Browser

45

46 Experimental results: Real-life datasets Food web  nodes   edges  frequency = 176

47 Experimental results: Real-life datasets Food web  nodes   edges  confidence = 11%

48 Experimental results: Performance Fully implemented on top of IBM DB2 Preliminary performance results: –pattern mining algorithm: adequate performance huge number of patterns constant overhead per discovered pattern –association mining algorithm: very fast constant overhead per discovered rule

49 Future work Applications: scientific data mining Loosen restriction to trees

50 References Bart Goethals, Eveline Hoekx and Jan Van den Bussche, Mining Tree Queries in a Graph, in Proceedings of the eleventh ACM SIGKDD International conference on Knowledge Discovery and Data Mining, p 61-69, ACM Press 2005 Eveline Hoekx and Jan Van den Bussche, Mining for Tree- Query Associations in a Graph, to appear in Proceedings of the 2006 IEEE International Conference on Data Mining (ICDM 2006)