Download presentation
Presentation is loading. Please wait.
Published byChester O’Brien’ Modified over 9 years ago
1
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015
2
Outline 1. Background & Related Work: Preliminary & Problem Definition Filter + Verification [Feature Based Index Approach] 2. Lindex: A general index structure for sub search 3.Direct feature mining for sub search 2© Dayu Yuan9/7/2015
3
Problem Definition: In a graph database D = {g 1,g 2,...g n }, given a query graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph. Solution: Brute force: For each query q, scan the dataset, find D(q) Filter + Verification: Given a query q, find a candidate set C(q), then verify each graph in C(q) to obtain D(q) Subgraph Search: Definition D q C(q) D(q) 3© Dayu Yuan9/7/2015
4
Filter + Verification: Rule: If a graph g contains the query q, then g has to contain all q’s subgraphs. Inverted Index: pair Left: subgraph features (small segment of subgraphs), Right: Posting List (IDs of all db graphs containing the “key” subgraph) Subgraph Search: Solutions 4© Dayu Yuan9/7/2015
5
Response time: (1) filtering cost: D-> C(q) Cost of the search for subgraph features contained in the query Cost of loading the postings file, cost of joining the postings (2) verification cost: C(q) -> D(q) subgraph isomorphism tests NP-complete, dominate overall cost Related work: Reduce the verification cost by mining subgraph features Disadvantages: (1) Different index structure designs for different features (2) “batch mode” feature mining [talk latter] Subgraph Search: Related Work 5© Dayu Yuan9/7/2015
6
Outline 1. Background: 2. Lindex: A general index structure for subsearch Compact (memory consumption) Effective (filtering power) Efficient (response time) Experiment Results 3.Direct feature mining for sub search 6© Dayu Yuan9/7/2015
7
Lindex: A general index structure Contributions: Orthogonal to related work (feature mining) General: Applicable to all subgraph/subtree features. Compact, Effective and Efficient Compact: less memory consumption. Effective: prune more false positive (with the same features) Efficient: runs faster 7© Dayu Yuan9/7/2015
8
Lindex: Compact Space Saving (Extension Labeling) Each Edge in a graph is represented as: the label of the graph sg 2 is , the label of its chosen parent sg 1 is Then subgraph g 2 can be stored as just 8© Dayu Yuan9/7/2015
9
Lindex: Empirical Evaluation of Memory Index\Featur e DFG∆TCFGMimRTree+∆DFT Feature Count7599/62389873/571250006172/387500/6172 Gindex1359153413481339 FGindex1826 SwiftIndex860 Lindex677841772676671 Unit in KB 9© Dayu Yuan9/7/2015
10
Definition (maxSub, minSuper). Lindex: Effective in Filtering (1) sg 2 and sg 4 are maxSub of q (2) sg 5 is minsup of q 10© Dayu Yuan9/7/2015
11
Strategy One: Minimal Supergraph Filtering Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is Lindex: Effective in Filtering (3) (1) sg 2 and sg 4 are maxSub of q (2) sg 5 is minsup of q 11© Dayu Yuan9/7/2015
12
Strategy Two: Postings Partition Direct & Indirect Value Set. Direct Set: such that sg can extend to g, without being isomorphic to any other features Indirect Set: Lindex: Effective in Filtering Data Based Graphs Index Why “b” is in the direct value set of “sg 1 ”, but “a” is not? 12© Dayu Yuan9/7/2015
13
Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is Lindex: Effective in Filtering Query “a”Graphs need to be verified Traditional Model Strategy(1) Strategy(1 + 2) c b Omit Prof 13© Dayu Yuan9/7/2015
14
the label of the graph sg 2 is , the label of its chosen parent sg 1 is Node 1 of sg 1 mapped to Node 1 of sg 2 Lindex: Efficient in Maxsub Feature Search instead of constructing canonical labels for each subgraph of q and comparing them with the existing labels in the index to check if a indexing feature matches while traversing a graph lattice, mappings constructed to check that a graph sg 1 is contained in q can be extended to check whether a supergraph of sg 1 in the lattice, sg 2, is contained in q by incrementally expanding the mappings from sg 1 to q. 14© Dayu Yuan9/7/2015
15
Lindex: Efficient in Minsup Feature Search The set of minimal supergraph of a query q in the Lindex is a subset of the intersection of the set of descendants of each subgraph node of q in the partial lattice. 15© Dayu Yuan9/7/2015
16
Outline 1. Background: 2. Lindex: A general index structure for subsearch Compact (memory consumption) Effective (filtering power) Efficient (response time) Experiment Results 3.Direct feature mining for sub search 16© Dayu Yuan9/7/2015
17
Lindex: Experiments Exp on AIDS Dataset: 40,000 Graphs 17© Dayu Yuan9/7/2015
18
Exp on AIDS Dataset: 40,000 Graphs Lindex: Experiments 18© Dayu Yuan9/7/2015
19
Exp on AIDS Dataset: 40,000 Graphs Lindex: Experiments 19© Dayu Yuan9/7/2015
20
Outline 1. Background: 2. Lindex: A general index structure for sub search 3.Direct feature mining for sub search Motivation Problem Definition & Objective Function Branch & Bound Partition of the search space Experiment Results 20© Dayu Yuan9/7/2015
21
Feature Mining: A Brief History 1 2 3 Graphs Graph Classification Graph Containment Search ……. Graph Feature MiningApplications All Freq Subgraphs Batch Mode Direct Feature Mining 21© Dayu Yuan9/7/2015
22
Feature Mining: Motivation All previous feature selection algorithms for “subgraph search problem” follow “batch mode” Assume stable database Bottleneck (frequent subgraph enumeration) Hard to tune the setting of parameters (minimum support, etc) Our Contributions: First direct feature mining algorithm for the subgraph search problem Effective in index updating Choose high quality features 22© Dayu Yuan9/7/2015
23
Feature Mining: Problem Definition Previous work: Given a graph database D, find a set of subgraph (subtree) features, minimizing the response time over training query Q. Our work: Given a graph database D, an already built index I with feature set P 0, search for a new feature p, such that the new feature set {P 0 + p} minimizes the response time 23© Dayu Yuan9/7/2015
24
Feature Mining: Problem Definition Iterative Index Updating: Given database D, current index I with features P 0 (1) Remove Useless Features Find a feature p in P 0 (2) Add New Features Find a new feature p (3) Goes to (1) 24© Dayu Yuan9/7/2015
25
Feature Mining: More on the Object Function (1) Pros and Cons of using the query logs The objective function of previous algorithms (i.e. Gindex, FGindex) depends on queries too. [Implicitly] (2) Feature selected are “discriminative” Previous work: the discriminative power of ‘sg’ is measured w.r.t to sub(sg) or sup(sg), where sub(sg) denotes all subgraphs of ‘sg’, and sup(sg) denotes all supergraph of ‘sg’. Our objective function: discriminative power is measure w.r.t P 0 (3) Computation Issue: 25© Dayu Yuan9/7/2015
26
Feature Mining: More on the Object Function Q MinSup Queries(p, Q) Computing D(p) for each enumerated feature ‘p’ is expensive 26© Dayu Yuan9/7/2015
27
Feature Mining: Challenges (1) Objective function is expensive to evaluate (2) Exponential search space for the new index subgraph feature “p”. (3) Objective function is neither monotonic nor anti- monotonic. [Apriori rule can not be used] (4) Traditional graph feature mining algorithms (e.g. LeapSearch) do not work. (They rely only on “frequencies”) 27© Dayu Yuan9/7/2015
28
Feature Mining: Estimate The Objective Function The objective function of a new subgraph feature p, has an easy to compute upper bound and lower bound: Inexpensive to compute Two approaches to estimate (1) Lazy calculation: don’t have to calculate gain(p, P 0 ) when Upp(p, P 0 ) < gain(p*, P 0 ) Low(p, P 0 ) > gain(p*, P 0 ) (2) Omit Prof 28© Dayu Yuan9/7/2015
29
Feature Mining: Branch and Bound Exhaustive Search according to DFS Tree A graph(pattern) can be canonically labeled as a string, the DFS tree is a pre-fix tree of the labels of graphs. n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n7n7 Depth first search. Visit: n 1, n 2, n 3, n 4 and find the current best pattern is n 3. Now visit n 5, pre-observe that n 5 and all its offspring have gain function less than n 3. Prune the branch and start to visit n 7. The objective function is neither monotonic or anti- monotonic 29© Dayu Yuan9/7/2015
30
Feature Mining: Branch and Bound For each branch, e.g., branch starting from n 5, find an branch upper bound > gain value of all nodes on that branch. Thm: For a feature p, an upper bound exists such that for all p’ that are supergraph of p, gain(p’, P0) <= BUpp(p, P0) Although correct, the upper bound is not tight Q MinSup Queries(p, Q) Omit Prof 30© Dayu Yuan9/7/2015
31
Feature Mining: Heuristic based search space partition Problem: The search always starts from the same root and search according to the same order Observation The new graph pattern p must be a super graph of some patterns in P 0, i.e., p ⊃ p2 in Figure 4 1) A great proportion of the queries are supergraphs of root, otherwise there will be few queries using p ⊃ r for filtering 2) The average size of the set of candidates for queries ⊃ r are large, which means improvement over those queries is important. 31© Dayu Yuan9/7/2015
32
Feature Mining: Heuristic based search space partition Procedure: (1)gain(p*)=0 (2)Sort all P 0 according to sPoint(p i ) function in decreasing order (3) Start Iterating For i=1to|P| do If branch upper bound of BUpp(ri) < gain(p ∗ ) then break Else Find the minimal supergraph queries minSup(r, Q) p*(r) = Branch & Bound Search (minSup(r, Q), p ∗ ) If gain(p*(r)) > gain(p ∗ ), update p ∗ = p ∗ r Discussion: (1) Candidate features are enumerated as descendent of the “root” (2) Candidate features are ‘frequent’ on D(r), not all D Smaller minimum support (3) “root” are visited according to sPoint(r) score, quick to find a close to optimal feature. (4) Top k feature selection 32© Dayu Yuan9/7/2015
33
Outline 1. Background: 2. Lindex: A general index structure for sub search 3.Direct feature mining for sub search Motivation Problem Definition & Objective Function Branch & Bound Partition of the search space Experiment Results 33© Dayu Yuan9/7/2015
34
Feature Mining: Experiments 34© Dayu Yuan9/7/2015 The same AIDS dataset D, Index0: Gindex with minsupport 0.05 IndexDF: Gindex with minsupport 0.02 [1175 new feature are added] Index QG/BB/TK (Index updated based on Index0) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration Achieving the same candidate set size decrease
35
Feature Mining: Experiments 35© Dayu Yuan9/7/2015
36
Feature Mining: Experiments 36© Dayu Yuan9/7/2015 2 Dataset: D1 & D2 (80% same) DF(D1): Gindex on Dataset D1 DF(D2): Gindex on Dataaset D2 Index QG/BB/TK (Index updated based on DF(D1)) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration Exp1: D2 = D1 + 20% New Exp2: D2 = 80%D1 + 20%New Iterative until the objective value is stable
37
Feature Mining: Experiments DF VS. iterative methods 37© Dayu Yuan9/7/2015
38
Feature Mining: Experiments 38© Dayu Yuan9/7/2015
39
Feature Mining: Experiments TCFG VS. iterative methods MimR VS. iterative methods 39© Dayu Yuan9/7/2015 Iterative until the gain is stable
40
Conclusion 9/7/2015© Dayu Yuan40 1. Lindex: index structure general enough to support any features Compact Effective Efficient 2. Direct feature mining Third generation algorithm (no frequent feature enumeration bottleneck) Effective in updating the index to accommodate changes Runs much faster than building the index from scratch Feature selected can filter more false positives than features selected from scratch.
41
41 Thanks Questions? © Dayu Yuan9/7/2015
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.