Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.

Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium

2 Graph Data A (directed) graph over a set of nodes N is a set G of edges: ordered pairs  i  j  with i  j  N. Snapshot of a graph representing the complete metabolic pathway of a human.

3 Graph Mining Transactional category –dataset: set of many small graphs (transactions) –frequency:  transactions in which the pattern occurs (at least once) –ILP: Warmr [AGM, FSG, TreeMiner, gSpan, FFSM] Single graph category –dataset: single large graph –frequency:  copies of the pattern in the large graph [Subdue, Vanetik-Gudes-Shimony, SEuS, SiGraM, Jeh-Widom] Focus on pattern mining, few work on association rule mining!

4 Our work Single graph category Pattern + association rule mining Patterns with: –Existential nodes –Parameters Occurrence of the pattern in G is any homomorphism from the pattern in G. So far only considered in the ILP (transactional) setting

5 Example of a pattern frequency    x    z   5  z   G   z  8  G   z  x   G 

6 Patterns are conjunctive queries. frequency    x    z   5  z   G   z  8  G   z  x   G  select distinct G3.to as x from G G1, G G2, G G3 where G1.from=5 and G1.to=G2.from and G1.to=G3.from and G2.to=8

7 Example of an Association Rule

8 Features of the presented algorithms Pattern mining phase + association mining phase Restriction to trees => efficient algorithms Equivalence checking Apply theory of conjunctive database queries Database oriented implementation

9 Outline rest of talk Formal problem definition Algorithms: 1.Pattern Mining Overall approach Outer loop: incremental Inner loop: levelwise Equivalence checking 2. Association Rule Mining Result management Experimental results Future work

10 Formal definition of a tree pattern. A tree pattern is a tree P whose nodes are called variables, and: 1.some variables marked as existential  2.some variables are parameters (labeled with a constant) 3.remaining variables are called distinguished

11 Formal definition of a tree query. A tree query Q is a pair (H,P) where: 1.P is a tree pattern, the body of Q 2.H is a tuple of distinguished variables and parameters of P. All distinguished variables of P must appear at least once in H, the head of Q

12 Formal definition of a matching A matching of a pattern P in a graph G is a homomorphism h: P  G, with h  z  a, for parameters labeled a.

13 Example: Matching zz yzz x

14 Example: Matching zz yzz x

15 Example: Matching zz yzz x hh 

16 Example: Matching zz yzz x hh  hh 

17 Example: Matching zz yzz x hh  hh  hh 

18 Example: Matching zz yzz x hh  hh  hh  hh 

19 Example: Matching zz yzz x hh  hh  hh  hh  hh 

20 Formal definition of frequency The frequency of Q in G is #answers in the answer set. We define the answer set of Q in G as follows: Q  G  f(H)|f is a matching of P in G 

21 Example: Matching zz yzz x hh  hh  hh  hh  hh  frequency   

22 Problem statement 1: Tree query mining Given a graph G and a threshold k, find all tree queries that have frequency at least k in G, those queries are called frequent.

23 Formal definition of an association rule An association rule (AR) is of the form Q 1  Q 2 with Q 1 and Q 2 tree queries. The AR is legal if Q 2  Q 1. The confidence of the AR in a graph G is defined as the frequency of Q 2 divided by the frequency of Q 1.

24 Problem statement 2: Association rule mining Input: a graph G, minsup, a tree query Q left frequent in G, minconf Output: all tree queries Q such that Q left  Q is a legal and confident association rule in G.

25 Outline rest of talk Formal problem definition Algorithms: 1.Pattern Mining Overall approach Outer loop: incremental Inner loop: levelwise Equivalence checking 2. Association Rule Mining Result management Experimental results Future work

26 Pattern Mining Algorithm Outer loop: Generate, incrementally, all possible trees of increasing sizes. Avoid generation of isomorphic trees. Inner loop: For each newly generated tree, generate all queries based on that tree, and test their frequency.... x1x1 x4x4 x3x3 x2x2  x2x2 x1x1   x2x2 x1x1  xx   

27 Outer loop It is well known how to efficiently generate all trees uniquely up to isomorphism Based on canonical form of trees. [Scions, Li-Ruskey, Zaki, Chi-Young-Muntz]

28 Inner loop: Levelwise approach A query Q is characterized by  –  Q  set of existential nodes –  Q  set of parameters –Labeling Q  of the parameters by constants. Q          specializes Q          if     ,      and  agrees with  on  . If Q  specializes Q  then freq  Q    freq  Q    Most general query: T = ( , ,  )

29 Inner loop: Candidate generation CanTab   is a candidate query  FreqTab   is a frequent query  Q’=  ’  ’  is a parent of Q=  if either:  ’ and  has precisely one more node than  ’, or  ’ and  has precisely one more node than  ’ Join Lemma: Each candidacy table can be computed by taking the natural join of its parent frequency tables.

30 Inner loop: Frequency counting Each candidacy table can be computed by a single SQL query. (ref. Join lemma). Suppose: G  from  to  table in the database, then each frequency table can be computed with a single SQL query. –  »formulate in SQL and count –   »formulate   in SQL  E »natural join of E with CanTab  »group by  »count each group

31 Inner loop: Example  x    x   x    x   x  

32 Inner loop: Example  x    x   x    x   x   Join expression: CanTab {x  }{x ,x  } = FreqTab  x   x   ⋈ FreqTab  x   x   ⋈ FreqTab  x   x  

33 Inner loop: Example  x    x   x    x   x   SQL expression E for  x      select distinct G1.from as x1, G2.to as x3, G3.to as x4 from G G1, G G2, G G3 where G1.to = G2.from and G3.from = G2.from

34 Inner loop: Example  x    x   x    x   x   SQL expression for filling the frequency table: select distinct E.x1, E.x3, count(E.x4) from E, CanTab {x2}{x1,x3} as CT where E.x1 = CT.x1 and E.x3 = CT.x3 group by E.x1, E.x3 having count(E.x4) >= k

35 Equivalent queries Queries Q  and Q  are equivalent if same answer sets on all graphs G (up to renaming of the distinguished variables) 2 cases of equivalent queries: 1.Q 1 has fewer nodes than Q 2 2.Q 1 and Q 2 have the same number of nodes

36 Equivalence theorem A containment mapping from Q  to Q  is a h: Q   Q  that maps distinguished variables of Q  one-to-one to distinguished variables of Q , and maps parameters of Q  to parameters of Q , preserving labels Two queries are equivalent if and only if there are containment mappings between them in both directions.

37 Case  : Q  fewer nodes than Q 2 Redundancy lemma: Let Q be a tree query without selected nodes. Then Q has a redundancy if and only if it contains a subtree C in the form of a linear chain of  nodes (possibly just a single node), such that the parent of C has another subtree that is at least as deep as C. Redundant subtree

38 Case  : Q  and Q  same number of nodes Q  and Q  must be isomorphic. Canonical form of queries: refine the canonical ordering of the underlying unlabeled tree, taking into account node labels.

39 Association Mining Algorithm Input: a graph G, minsup, a tree query Q left frequent in G, minconf Output: all tree queries Q such that Q left  Q is a legal and confident association rule in G.

40 Containment mappings For each tree query, generate all containment mappings from Q left to Q, ignoring parameter assignments.

41 Instantiations For each containment mapping, generate all parameter assignments such that Q left  Q is frequent and confident.

42 Equivalent Association rules Equivalence checking of association rules is as hard as general graph isomorphism testing.

43 Outline rest of talk Result management Experimental results Future work

44 Result management Output: frequency tables stored in a relational database. Browser

46 Experimental results: Real-life datasets Food web  nodes   edges  frequency = 176

47 Experimental results: Real-life datasets Food web  nodes   edges  confidence = 11%

48 Experimental results: Performance Fully implemented on top of IBM DB2 Preliminary performance results: –pattern mining algorithm: adequate performance huge number of patterns constant overhead per discovered pattern –association mining algorithm: very fast constant overhead per discovered rule

49 Future work Applications: scientific data mining Loosen restriction to trees

50 References Bart Goethals, Eveline Hoekx and Jan Van den Bussche, Mining Tree Queries in a Graph, in Proceedings of the eleventh ACM SIGKDD International conference on Knowledge Discovery and Data Mining, p 61-69, ACM Press 2005 Eveline Hoekx and Jan Van den Bussche, Mining for Tree- Query Associations in a Graph, to appear in Proceedings of the 2006 IEEE International Conference on Data Mining (ICDM 2006)

Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.

Similar presentations

Presentation on theme: "Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.

Similar presentations

Presentation on theme: "Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium."— Presentation transcript:

Similar presentations

About project

Feedback