Presentation is loading. Please wait.

Presentation is loading. Please wait.

Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.

Similar presentations


Presentation on theme: "Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015."— Presentation transcript:

1 Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015

2 Outline  1. Background & Related Work:  Preliminary & Problem Definition  Filter + Verification [Feature Based Index Approach]  2. Lindex: A general index structure for sub search  3.Direct feature mining for sub search 2© Dayu Yuan9/7/2015

3  Problem Definition:  In a graph database D = {g 1,g 2,...g n }, given a query graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph.  Solution:  Brute force: For each query q, scan the dataset, find D(q)  Filter + Verification:  Given a query q, find a candidate set C(q), then verify each graph in C(q) to obtain D(q) Subgraph Search: Definition D q C(q) D(q) 3© Dayu Yuan9/7/2015

4  Filter + Verification:  Rule:  If a graph g contains the query q, then g has to contain all q’s subgraphs.  Inverted Index: pair  Left: subgraph features (small segment of subgraphs),  Right: Posting List (IDs of all db graphs containing the “key” subgraph) Subgraph Search: Solutions 4© Dayu Yuan9/7/2015

5  Response time:  (1) filtering cost: D-> C(q)  Cost of the search for subgraph features contained in the query  Cost of loading the postings file, cost of joining the postings  (2) verification cost: C(q) -> D(q)  subgraph isomorphism tests NP-complete, dominate overall cost  Related work:  Reduce the verification cost by mining subgraph features  Disadvantages:  (1) Different index structure designs for different features  (2) “batch mode” feature mining [talk latter] Subgraph Search: Related Work 5© Dayu Yuan9/7/2015

6 Outline  1. Background:  2. Lindex: A general index structure for subsearch  Compact (memory consumption)  Effective (filtering power)  Efficient (response time)  Experiment Results  3.Direct feature mining for sub search 6© Dayu Yuan9/7/2015

7 Lindex: A general index structure Contributions:  Orthogonal to related work (feature mining)  General: Applicable to all subgraph/subtree features.  Compact, Effective and Efficient  Compact: less memory consumption.  Effective: prune more false positive (with the same features)  Efficient: runs faster 7© Dayu Yuan9/7/2015

8 Lindex: Compact  Space Saving (Extension Labeling)  Each Edge in a graph is represented as:   the label of the graph sg 2 is ,  the label of its chosen parent sg 1 is  Then subgraph g 2 can be stored as just 8© Dayu Yuan9/7/2015

9 Lindex: Empirical Evaluation of Memory Index\Featur e DFG∆TCFGMimRTree+∆DFT Feature Count7599/62389873/571250006172/387500/6172 Gindex1359153413481339 FGindex1826 SwiftIndex860 Lindex677841772676671 Unit in KB 9© Dayu Yuan9/7/2015

10  Definition (maxSub, minSuper). Lindex: Effective in Filtering (1) sg 2 and sg 4 are maxSub of q (2) sg 5 is minsup of q 10© Dayu Yuan9/7/2015

11 Strategy One: Minimal Supergraph Filtering  Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is Lindex: Effective in Filtering (3) (1) sg 2 and sg 4 are maxSub of q (2) sg 5 is minsup of q 11© Dayu Yuan9/7/2015

12  Strategy Two: Postings Partition  Direct & Indirect Value Set.  Direct Set: such that sg can extend to g, without being isomorphic to any other features  Indirect Set: Lindex: Effective in Filtering Data Based Graphs Index Why “b” is in the direct value set of “sg 1 ”, but “a” is not? 12© Dayu Yuan9/7/2015

13  Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is Lindex: Effective in Filtering Query “a”Graphs need to be verified Traditional Model Strategy(1) Strategy(1 + 2) c b Omit Prof 13© Dayu Yuan9/7/2015

14 the label of the graph sg 2 is ,  the label of its chosen parent sg 1 is   Node 1 of sg 1 mapped to Node 1 of sg 2 Lindex: Efficient in Maxsub Feature Search instead of constructing canonical labels for each subgraph of q and comparing them with the existing labels in the index to check if a indexing feature matches while traversing a graph lattice, mappings constructed to check that a graph sg 1 is contained in q can be extended to check whether a supergraph of sg 1 in the lattice, sg 2, is contained in q by incrementally expanding the mappings from sg 1 to q. 14© Dayu Yuan9/7/2015

15 Lindex: Efficient in Minsup Feature Search The set of minimal supergraph of a query q in the Lindex is a subset of the intersection of the set of descendants of each subgraph node of q in the partial lattice. 15© Dayu Yuan9/7/2015

16 Outline  1. Background:  2. Lindex: A general index structure for subsearch  Compact (memory consumption)  Effective (filtering power)  Efficient (response time)  Experiment Results  3.Direct feature mining for sub search 16© Dayu Yuan9/7/2015

17 Lindex: Experiments Exp on AIDS Dataset: 40,000 Graphs 17© Dayu Yuan9/7/2015

18 Exp on AIDS Dataset: 40,000 Graphs Lindex: Experiments 18© Dayu Yuan9/7/2015

19 Exp on AIDS Dataset: 40,000 Graphs Lindex: Experiments 19© Dayu Yuan9/7/2015

20 Outline  1. Background:  2. Lindex: A general index structure for sub search  3.Direct feature mining for sub search  Motivation  Problem Definition & Objective Function  Branch & Bound  Partition of the search space  Experiment Results 20© Dayu Yuan9/7/2015

21 Feature Mining: A Brief History 1 2 3 Graphs Graph Classification Graph Containment Search ……. Graph Feature MiningApplications All Freq Subgraphs Batch Mode Direct Feature Mining 21© Dayu Yuan9/7/2015

22 Feature Mining: Motivation  All previous feature selection algorithms for “subgraph search problem” follow “batch mode”  Assume stable database  Bottleneck (frequent subgraph enumeration)  Hard to tune the setting of parameters (minimum support, etc)  Our Contributions:  First direct feature mining algorithm for the subgraph search problem  Effective in index updating  Choose high quality features 22© Dayu Yuan9/7/2015

23 Feature Mining: Problem Definition Previous work:  Given a graph database D, find a set of subgraph (subtree) features, minimizing the response time over training query Q.   Our work:  Given a graph database D, an already built index I with feature set P 0, search for a new feature p, such that the new feature set {P 0 + p} minimizes the response time 23© Dayu Yuan9/7/2015

24 Feature Mining: Problem Definition  Iterative Index Updating:  Given database D, current index I with features P 0  (1) Remove Useless Features  Find a feature p in P 0  (2) Add New Features  Find a new feature p  (3) Goes to (1) 24© Dayu Yuan9/7/2015

25 Feature Mining: More on the Object Function  (1) Pros and Cons of using the query logs  The objective function of previous algorithms (i.e. Gindex, FGindex) depends on queries too. [Implicitly]  (2) Feature selected are “discriminative”  Previous work:  the discriminative power of ‘sg’ is measured w.r.t to sub(sg) or sup(sg), where sub(sg) denotes all subgraphs of ‘sg’, and sup(sg) denotes all supergraph of ‘sg’.  Our objective function:  discriminative power is measure w.r.t P 0  (3) Computation Issue: 25© Dayu Yuan9/7/2015

26 Feature Mining: More on the Object Function Q MinSup Queries(p, Q) Computing D(p) for each enumerated feature ‘p’ is expensive 26© Dayu Yuan9/7/2015

27 Feature Mining: Challenges  (1) Objective function is expensive to evaluate  (2) Exponential search space for the new index subgraph feature “p”.  (3) Objective function is neither monotonic nor anti- monotonic. [Apriori rule can not be used]  (4) Traditional graph feature mining algorithms (e.g. LeapSearch) do not work. (They rely only on “frequencies”) 27© Dayu Yuan9/7/2015

28 Feature Mining: Estimate The Objective Function  The objective function of a new subgraph feature p, has an easy to compute upper bound and lower bound:  Inexpensive to compute  Two approaches to estimate  (1) Lazy calculation: don’t have to calculate gain(p, P 0 ) when  Upp(p, P 0 ) < gain(p*, P 0 )  Low(p, P 0 ) > gain(p*, P 0 )  (2) Omit Prof 28© Dayu Yuan9/7/2015

29 Feature Mining: Branch and Bound  Exhaustive Search according to DFS Tree  A graph(pattern) can be canonically labeled as a string, the DFS tree is a pre-fix tree of the labels of graphs. n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n7n7 Depth first search. Visit: n 1, n 2, n 3, n 4 and find the current best pattern is n 3. Now visit n 5, pre-observe that n 5 and all its offspring have gain function less than n 3. Prune the branch and start to visit n 7. The objective function is neither monotonic or anti- monotonic 29© Dayu Yuan9/7/2015

30 Feature Mining: Branch and Bound  For each branch, e.g., branch starting from n 5, find an branch upper bound > gain value of all nodes on that branch.  Thm:  For a feature p, an upper bound exists such that for all p’ that are supergraph of p, gain(p’, P0) <= BUpp(p, P0) Although correct, the upper bound is not tight Q MinSup Queries(p, Q) Omit Prof 30© Dayu Yuan9/7/2015

31 Feature Mining: Heuristic based search space partition  Problem:  The search always starts from the same root and search according to the same order  Observation  The new graph pattern p must be a super graph of some patterns in P 0, i.e., p ⊃ p2 in Figure 4 1) A great proportion of the queries are supergraphs of root, otherwise there will be few queries using p ⊃ r for filtering 2) The average size of the set of candidates for queries ⊃ r are large, which means improvement over those queries is important. 31© Dayu Yuan9/7/2015

32 Feature Mining: Heuristic based search space partition  Procedure:  (1)gain(p*)=0  (2)Sort all P 0 according to sPoint(p i ) function in decreasing order  (3) Start Iterating  For i=1to|P| do  If branch upper bound of BUpp(ri) < gain(p ∗ ) then break  Else Find the minimal supergraph queries minSup(r, Q)  p*(r) = Branch & Bound Search (minSup(r, Q), p ∗ )  If gain(p*(r)) > gain(p ∗ ), update p ∗ = p ∗ r  Discussion:  (1) Candidate features are enumerated as descendent of the “root”  (2) Candidate features are ‘frequent’ on D(r), not all D  Smaller minimum support  (3) “root” are visited according to sPoint(r) score, quick to find a close to optimal feature.  (4) Top k feature selection 32© Dayu Yuan9/7/2015

33 Outline  1. Background:  2. Lindex: A general index structure for sub search  3.Direct feature mining for sub search  Motivation  Problem Definition & Objective Function  Branch & Bound  Partition of the search space  Experiment Results 33© Dayu Yuan9/7/2015

34 Feature Mining: Experiments 34© Dayu Yuan9/7/2015  The same AIDS dataset D,  Index0: Gindex with minsupport 0.05  IndexDF: Gindex with minsupport 0.02 [1175 new feature are added]  Index QG/BB/TK (Index updated based on Index0)  BB: branch and bound  QG: search space partitioned  TK: top k feature returned in on iteration  Achieving the same candidate set size decrease

35 Feature Mining: Experiments 35© Dayu Yuan9/7/2015

36 Feature Mining: Experiments 36© Dayu Yuan9/7/2015  2 Dataset: D1 & D2 (80% same)  DF(D1): Gindex on Dataset D1  DF(D2): Gindex on Dataaset D2  Index QG/BB/TK (Index updated based on DF(D1))  BB: branch and bound  QG: search space partitioned  TK: top k feature returned in on iteration  Exp1: D2 = D1 + 20% New  Exp2: D2 = 80%D1 + 20%New  Iterative until the objective value is stable

37 Feature Mining: Experiments DF VS. iterative methods 37© Dayu Yuan9/7/2015

38 Feature Mining: Experiments 38© Dayu Yuan9/7/2015

39 Feature Mining: Experiments TCFG VS. iterative methods MimR VS. iterative methods 39© Dayu Yuan9/7/2015 Iterative until the gain is stable

40 Conclusion 9/7/2015© Dayu Yuan40  1. Lindex: index structure general enough to support any features  Compact  Effective  Efficient  2. Direct feature mining  Third generation algorithm (no frequent feature enumeration bottleneck)  Effective in updating the index to accommodate changes  Runs much faster than building the index from scratch  Feature selected can filter more false positives than features selected from scratch.

41 41  Thanks  Questions? © Dayu Yuan9/7/2015


Download ppt "Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015."

Similar presentations


Ads by Google