Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.

Slides:

Advertisements

Similar presentations

Recap: Mining association rules from large datasets

Advertisements

Graph Mining Laks V.S. Lakshmanan

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Traveling Salesperson Problem

gSpan: Graph-based substructure pattern mining

www.brainybetty.com1 MAVisto A tool for the exploration of network motifs By Guo Chuan & Shi Jiayi.

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Frequent Closed Pattern Search By Row and Feature Enumeration

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Types of Algorithms.

LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.

Experiments We measured the times(s) and number of expanded nodes to previous heuristic using BFBnB. Dynamic Programming Intuition. All DAGs must have.

Chapter 6: Transform and Conquer

FP-Growth algorithm Vasiljevic Vladica,

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

IGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu.

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

MAE 552 – Heuristic Optimization Lecture 27 April 3, 2002

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Blind Search-Part 2 Ref: Chapter 2. Search Trees The search for a solution can be described by a tree - each node represents one state. The path from.

1 Branch and Bound Searching Strategies 2 Branch-and-bound strategy 2 mechanisms: A mechanism to generate branches A mechanism to generate a bound so.

33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Offline Algorithmic Techniques for Several Content Delivery Problems in Some Restricted Types of Distributed Systems Mugurel Ionut Andreica, Nicolae Tapus.

Database Management 9. course. Execution of queries.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

Exact methods for ALB ALB problem can be considered as a shortest path problem The complete graph need not be developed since one can stop as soon as in.

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Lecture 3: Uninformed Search

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Types of Algorithms. 2 Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We’ll talk about a classification.

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

by Dayu Yuan The Pennsylvania State University

Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.

Association Analysis (3)

APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.

1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.

Branch and Bound Searching Strategies

Graph Indexing: A Frequent Structure-based Approach 指導老師：曾新穆教授組員：李彥寬、洪世敏、丁鏘巽、黃冠霖、詹博丞日期： 2013/11/ /11/141.

Graph Indexing From managing and mining graph data.

1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Gspan: Graph-based Substructure Pattern Mining

An Efficient Algorithm for Incremental Update of Concept space

Frequent Pattern Mining

DS595/CS525 Team#2 - Mi Tian, Deepan Sanghavi, Dhaval Dholakia

Mining Frequent Subgraphs

TT-Join: Efficient Set Containment Join

Database Design and Programming

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Compact routing schemes with improved stretch

Fraction-Score: A New Support Measure for Co-location Pattern Mining

Donghui Zhang, Tian Xia Northeastern University

Approximate Graph Mining with Label Costs

Lecture 4: Tree Search Strategies

Presentation transcript:

Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015

Outline  1. Background & Related Work:  Preliminary & Problem Definition  Filter + Verification [Feature Based Index Approach]  2. Lindex: A general index structure for sub search  3.Direct feature mining for sub search 2© Dayu Yuan9/7/2015

 Problem Definition:  In a graph database D = {g 1,g 2,...g n }, given a query graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph.  Solution:  Brute force: For each query q, scan the dataset, find D(q)  Filter + Verification:  Given a query q, find a candidate set C(q), then verify each graph in C(q) to obtain D(q) Subgraph Search: Definition D q C(q) D(q) 3© Dayu Yuan9/7/2015

 Filter + Verification:  Rule:  If a graph g contains the query q, then g has to contain all q’s subgraphs.  Inverted Index: pair  Left: subgraph features (small segment of subgraphs),  Right: Posting List (IDs of all db graphs containing the “key” subgraph) Subgraph Search: Solutions 4© Dayu Yuan9/7/2015

 Response time:  (1) filtering cost: D-> C(q)  Cost of the search for subgraph features contained in the query  Cost of loading the postings file, cost of joining the postings  (2) verification cost: C(q) -> D(q)  subgraph isomorphism tests NP-complete, dominate overall cost  Related work:  Reduce the verification cost by mining subgraph features  Disadvantages:  (1) Different index structure designs for different features  (2) “batch mode” feature mining [talk latter] Subgraph Search: Related Work 5© Dayu Yuan9/7/2015

Outline  1. Background:  2. Lindex: A general index structure for subsearch  Compact (memory consumption)  Effective (filtering power)  Efficient (response time)  Experiment Results  3.Direct feature mining for sub search 6© Dayu Yuan9/7/2015

Lindex: A general index structure Contributions:  Orthogonal to related work (feature mining)  General: Applicable to all subgraph/subtree features.  Compact, Effective and Efficient  Compact: less memory consumption.  Effective: prune more false positive (with the same features)  Efficient: runs faster 7© Dayu Yuan9/7/2015

Lindex: Compact  Space Saving (Extension Labeling)  Each Edge in a graph is represented as:   the label of the graph sg 2 is ,  the label of its chosen parent sg 1 is  Then subgraph g 2 can be stored as just 8© Dayu Yuan9/7/2015

Lindex: Empirical Evaluation of Memory Index\Featur e DFG∆TCFGMimRTree+∆DFT Feature Count7599/ / /387500/6172 Gindex FGindex1826 SwiftIndex860 Lindex Unit in KB 9© Dayu Yuan9/7/2015

 Definition (maxSub, minSuper). Lindex: Effective in Filtering (1) sg 2 and sg 4 are maxSub of q (2) sg 5 is minsup of q 10© Dayu Yuan9/7/2015

Strategy One: Minimal Supergraph Filtering  Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is Lindex: Effective in Filtering (3) (1) sg 2 and sg 4 are maxSub of q (2) sg 5 is minsup of q 11© Dayu Yuan9/7/2015

 Strategy Two: Postings Partition  Direct & Indirect Value Set.  Direct Set: such that sg can extend to g, without being isomorphic to any other features  Indirect Set: Lindex: Effective in Filtering Data Based Graphs Index Why “b” is in the direct value set of “sg 1 ”, but “a” is not? 12© Dayu Yuan9/7/2015

 Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is Lindex: Effective in Filtering Query “a”Graphs need to be verified Traditional Model Strategy(1) Strategy(1 + 2) c b Omit Prof 13© Dayu Yuan9/7/2015

the label of the graph sg 2 is ,  the label of its chosen parent sg 1 is   Node 1 of sg 1 mapped to Node 1 of sg 2 Lindex: Efficient in Maxsub Feature Search instead of constructing canonical labels for each subgraph of q and comparing them with the existing labels in the index to check if a indexing feature matches while traversing a graph lattice, mappings constructed to check that a graph sg 1 is contained in q can be extended to check whether a supergraph of sg 1 in the lattice, sg 2, is contained in q by incrementally expanding the mappings from sg 1 to q. 14© Dayu Yuan9/7/2015

Lindex: Efficient in Minsup Feature Search The set of minimal supergraph of a query q in the Lindex is a subset of the intersection of the set of descendants of each subgraph node of q in the partial lattice. 15© Dayu Yuan9/7/2015

Outline  1. Background:  2. Lindex: A general index structure for subsearch  Compact (memory consumption)  Effective (filtering power)  Efficient (response time)  Experiment Results  3.Direct feature mining for sub search 16© Dayu Yuan9/7/2015

Lindex: Experiments Exp on AIDS Dataset: 40,000 Graphs 17© Dayu Yuan9/7/2015

Exp on AIDS Dataset: 40,000 Graphs Lindex: Experiments 18© Dayu Yuan9/7/2015

Exp on AIDS Dataset: 40,000 Graphs Lindex: Experiments 19© Dayu Yuan9/7/2015

Outline  1. Background:  2. Lindex: A general index structure for sub search  3.Direct feature mining for sub search  Motivation  Problem Definition & Objective Function  Branch & Bound  Partition of the search space  Experiment Results 20© Dayu Yuan9/7/2015

Feature Mining: A Brief History Graphs Graph Classification Graph Containment Search ……. Graph Feature MiningApplications All Freq Subgraphs Batch Mode Direct Feature Mining 21© Dayu Yuan9/7/2015

Feature Mining: Motivation  All previous feature selection algorithms for “subgraph search problem” follow “batch mode”  Assume stable database  Bottleneck (frequent subgraph enumeration)  Hard to tune the setting of parameters (minimum support, etc)  Our Contributions:  First direct feature mining algorithm for the subgraph search problem  Effective in index updating  Choose high quality features 22© Dayu Yuan9/7/2015

Feature Mining: Problem Definition Previous work:  Given a graph database D, find a set of subgraph (subtree) features, minimizing the response time over training query Q.   Our work:  Given a graph database D, an already built index I with feature set P 0, search for a new feature p, such that the new feature set {P 0 + p} minimizes the response time 23© Dayu Yuan9/7/2015

Feature Mining: Problem Definition  Iterative Index Updating:  Given database D, current index I with features P 0  (1) Remove Useless Features  Find a feature p in P 0  (2) Add New Features  Find a new feature p  (3) Goes to (1) 24© Dayu Yuan9/7/2015

Feature Mining: More on the Object Function  (1) Pros and Cons of using the query logs  The objective function of previous algorithms (i.e. Gindex, FGindex) depends on queries too. [Implicitly]  (2) Feature selected are “discriminative”  Previous work:  the discriminative power of ‘sg’ is measured w.r.t to sub(sg) or sup(sg), where sub(sg) denotes all subgraphs of ‘sg’, and sup(sg) denotes all supergraph of ‘sg’.  Our objective function:  discriminative power is measure w.r.t P 0  (3) Computation Issue: 25© Dayu Yuan9/7/2015

Feature Mining: More on the Object Function Q MinSup Queries(p, Q) Computing D(p) for each enumerated feature ‘p’ is expensive 26© Dayu Yuan9/7/2015

Feature Mining: Challenges  (1) Objective function is expensive to evaluate  (2) Exponential search space for the new index subgraph feature “p”.  (3) Objective function is neither monotonic nor anti- monotonic. [Apriori rule can not be used]  (4) Traditional graph feature mining algorithms (e.g. LeapSearch) do not work. (They rely only on “frequencies”) 27© Dayu Yuan9/7/2015

Feature Mining: Estimate The Objective Function  The objective function of a new subgraph feature p, has an easy to compute upper bound and lower bound:  Inexpensive to compute  Two approaches to estimate  (1) Lazy calculation: don’t have to calculate gain(p, P 0 ) when  Upp(p, P 0 ) < gain(p*, P 0 )  Low(p, P 0 ) > gain(p*, P 0 )  (2) Omit Prof 28© Dayu Yuan9/7/2015

Feature Mining: Branch and Bound  Exhaustive Search according to DFS Tree  A graph(pattern) can be canonically labeled as a string, the DFS tree is a pre-fix tree of the labels of graphs. n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n7n7 Depth first search. Visit: n 1, n 2, n 3, n 4 and find the current best pattern is n 3. Now visit n 5, pre-observe that n 5 and all its offspring have gain function less than n 3. Prune the branch and start to visit n 7. The objective function is neither monotonic or anti- monotonic 29© Dayu Yuan9/7/2015

Feature Mining: Branch and Bound  For each branch, e.g., branch starting from n 5, find an branch upper bound > gain value of all nodes on that branch.  Thm:  For a feature p, an upper bound exists such that for all p’ that are supergraph of p, gain(p’, P0) <= BUpp(p, P0) Although correct, the upper bound is not tight Q MinSup Queries(p, Q) Omit Prof 30© Dayu Yuan9/7/2015

Feature Mining: Heuristic based search space partition  Problem:  The search always starts from the same root and search according to the same order  Observation  The new graph pattern p must be a super graph of some patterns in P 0, i.e., p ⊃ p2 in Figure 4 1) A great proportion of the queries are supergraphs of root, otherwise there will be few queries using p ⊃ r for filtering 2) The average size of the set of candidates for queries ⊃ r are large, which means improvement over those queries is important. 31© Dayu Yuan9/7/2015

Feature Mining: Heuristic based search space partition  Procedure:  (1)gain(p*)=0  (2)Sort all P 0 according to sPoint(p i ) function in decreasing order  (3) Start Iterating  For i=1to|P| do  If branch upper bound of BUpp(ri) < gain(p ∗ ) then break  Else Find the minimal supergraph queries minSup(r, Q)  p*(r) = Branch & Bound Search (minSup(r, Q), p ∗ )  If gain(p*(r)) > gain(p ∗ ), update p ∗ = p ∗ r  Discussion:  (1) Candidate features are enumerated as descendent of the “root”  (2) Candidate features are ‘frequent’ on D(r), not all D  Smaller minimum support  (3) “root” are visited according to sPoint(r) score, quick to find a close to optimal feature.  (4) Top k feature selection 32© Dayu Yuan9/7/2015

Outline  1. Background:  2. Lindex: A general index structure for sub search  3.Direct feature mining for sub search  Motivation  Problem Definition & Objective Function  Branch & Bound  Partition of the search space  Experiment Results 33© Dayu Yuan9/7/2015

Feature Mining: Experiments 34© Dayu Yuan9/7/2015  The same AIDS dataset D,  Index0: Gindex with minsupport 0.05  IndexDF: Gindex with minsupport 0.02 [1175 new feature are added]  Index QG/BB/TK (Index updated based on Index0)  BB: branch and bound  QG: search space partitioned  TK: top k feature returned in on iteration  Achieving the same candidate set size decrease

Feature Mining: Experiments 35© Dayu Yuan9/7/2015

Feature Mining: Experiments 36© Dayu Yuan9/7/2015  2 Dataset: D1 & D2 (80% same)  DF(D1): Gindex on Dataset D1  DF(D2): Gindex on Dataaset D2  Index QG/BB/TK (Index updated based on DF(D1))  BB: branch and bound  QG: search space partitioned  TK: top k feature returned in on iteration  Exp1: D2 = D1 + 20% New  Exp2: D2 = 80%D1 + 20%New  Iterative until the objective value is stable

Feature Mining: Experiments DF VS. iterative methods 37© Dayu Yuan9/7/2015

Feature Mining: Experiments 38© Dayu Yuan9/7/2015

Feature Mining: Experiments TCFG VS. iterative methods MimR VS. iterative methods 39© Dayu Yuan9/7/2015 Iterative until the gain is stable

Conclusion 9/7/2015© Dayu Yuan40  1. Lindex: index structure general enough to support any features  Compact  Effective  Efficient  2. Direct feature mining  Third generation algorithm (no frequent feature enumeration bottleneck)  Effective in updating the index to accommodate changes  Runs much faster than building the index from scratch  Feature selected can filter more false positives than features selected from scratch.

41  Thanks  Questions? © Dayu Yuan9/7/2015