Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics IDAR 2007.

Slides:



Advertisements
Similar presentations
Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Advertisements

Representing Boolean Functions for Symbolic Model Checking Supratik Chakraborty IIT Bombay.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
CS162 Week 2 Kyle Dewey. Overview Continuation of Scala Assignment 1 wrap-up Assignment 2a.
The Volcano/Cascades Query Optimization Framework
Fast Algorithms For Hierarchical Range Histogram Constructions
CS171 Introduction to Computer Science II Graphs Strike Back.
Implementation of Graph Decomposition and Recursive Closures Graph Decomposition and Recursive Closures was published in 2003 by Professor Chen. The project.
Evaluating Reachability Queries over Path Collections* P. Bouros 1, S. Skiadopoulos 2, T. Dalamagas 3, D. Sacharidis 3, T. Sellis 1,3 1 National Technical.
Trees Chapter Chapter Contents Tree Concepts Hierarchical Organizations Tree Terminology Traversals of a Tree Traversals of a Binary Tree Traversals.
Accelerating Inferencing. Assertion Efficient inferencing using taxonomies require fast computation of subsumption, disjointness, least common ancestors,
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
Solving Partial Order Constraints for LPO termination.
Using Search in Problem Solving
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
Graph, Search Algorithms Ka-Lok Ng Department of Bioinformatics Asia University.
Recursive Graph Deduction and Reachability Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Scalable Network Distance Browsing in Spatial Database Samet, H., Sankaranarayanan, J., and Alborzi H. Proceedings of the 2008 ACM SIGMOD international.
Data Flow Analysis Compiler Design Nov. 8, 2005.
Solving problems by searching
Graph Algebra with Pattern Matching and Aggregation Support 1.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Review Binary Tree Binary Tree Representation Array Representation Link List Representation Operations on Binary Trees Traversing Binary Trees Pre-Order.
Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.
A TREE BASED ALGEBRA FRAMEWORK FOR XML DATA SYSTEMS
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,
WAES 3308 Numerical Methods for AI
Database Management 9. course. Execution of queries.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Lesley Charles November 23, 2009.
SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
Path-Hop: efficiently indexing large graphs for reachability queries Tylor Cai and C.K. Poon CityU of Hong Kong.
Fast and practical indexing and querying of very large graphs Silke Triβl, Ulf Leser Humboldt-Universitat zu Berlin Presenter: Liwen Sun (Stephen) SIGMOD’07.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Discrete Structures Trees (Ch. 11)
Cost Framework for a Heterogeneous Distributed Semi-structured Environment Tianxiao Liu (1)(2) Tuyet-Tram Dang-Ngoc (1) Dominique Laurent (1) DBMAN 2007.
Johannes Kepler University Linz Department of Business Informatics Data & Knowledge Engineering Altenberger Str. 69, 4040 Linz Austria/Europe
Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet Svetlana Strunjaš-Yoshikawa Joint with Fred Annexstein and.
Basic Search Procedure 1. Start with the start node (root of the search tree) and place in on the queue 2. Remove the front node in the queue and If the.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
Ricochet Robots Mitch Powell Daniel Tilgner. Abstract Ricochet robots is a board game created in Germany in A player is given 30 seconds to find.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.
Chapter 13: Query Processing
Best-first search is a search algorithm which explores a graph by expanding the most promising node chosen according to a specified rule.
Biointelligence Lab School of Computer Sci. & Eng. Seoul National University Artificial Intelligence Chapter 8 Uninformed Search.
1 GRAPHS – Definitions A graph G = (V, E) consists of –a set of vertices, V, and –a set of edges, E, where each edge is a pair (v,w) s.t. v,w  V Vertices.
Algebra 1 Section 4.2 Graph linear equation using tables The solution to an equation in two variables is a set of ordered pairs that makes it true. Is.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
1 Efficient Processing of Transitive Closure Queries in Ontology Store using Graph Labeling Kim, Jongnam SNU OOPSLA Lab. Dec. 3, 2004.
Chapter 3 Solving problems by searching. Search We will consider the problem of designing goal-based agents in observable, deterministic, discrete, known.
CSCI2950-C Lecture 12 Networks
By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01
Top 50 Data Structures Interview Questions
Efficient processing of path query with not-predicates on XML data
Database Management System
Prepared by : Ankit Patel (226)
Semi-Structured Data and Agile Application Development
Probabilistic Data Management
COSC160: Data Structures Linked Lists
Design of Declarative Graph Query Languages: On the Choice between Value, Pattern and Object based Representations for Graphs Hasan Jamil Department of.
Fast Computation of Symmetries in Boolean Functions Alan Mishchenko
Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs
Efficient Subgraph Similarity All-Matching
XML indexing – A(k) indices
Logic Based Query Languages
Datalog Inspired by the impedance mismatch in relational databases.
Presentation transcript:

Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics IDAR 2007

2 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Motivation – Biological Networks from Name Sequence TYPE Function Location … Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

3 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Querying Networks - PQL  Pathway Query Language (PQL) [Leser, 2005]  Syntax for querying graphs  Find subgraphs matching the query graph SELECT B FROM network LET node A, node B, path P WHERE A.name = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B; A B name = Glucose ISA compound ISA enzyme P Find all enzymes that are directly or indirectly affected by „Glucose“ Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

4 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Node Conditions  Nodes can contain conditions on A B name = Glucose ISA compound ISA enzyme P query TYPE hierarchy - partially root molecule interaction macro- molecule compound sugar gene protein ionmRNA catalysis inhibition enzyme  Attributes A.name = ‘Glucose’  TYPE (of hierarchy) A ISA compound  Function (of ontology) A HASFUNC (‘catalysis’, GO)  Location A ISIN (‘Human’, taxonomy) Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

5 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Path Conditions  Paths can contain conditions on A B name = Glucose ISA compound ISA gene P query a b graph  Edges P.path = A[-*]B AND P.length = 1  Path existence P.path = A[-*]B  Path length P.path = A[-*]B AND P.length < 10  Start node P.start = A  Containment P { R Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

6 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Result of Graph Queries  Search for matching subgraphs  Find node and path bindings for the query variables in the network A B name = Glucose ISA compound ISA enzyme P network query Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

7 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Outline  Motivation  Optimize Graph Queries  Evaluate node conditions  Evaluate path conditions  Future Work  Relational algebra for graph queries  Conclusion Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

8 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Evaluation of Node Conditons  Node attributes  Select operator (σ) on Node table  Node types, functions, and locations  Hierarchy operator (χ) – Return the specified concept and all successor concepts A B name = Glucose ISA compound ISA gene P query query plan for node A Node TYPE σ name=Glucose χ compound ⋈ Node.TYPE=TYPE Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

9 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR How to evaluate Path conditions?  Recursively traverse the graph  Edge  Arbitrary number of joins  No possibility to optimize the execution a b graph ⋈ Edge ⋈ …⋈ … Need for new logical and physical operators Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

10 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Path Existence Operator, Φ  Node variables A and B  Set of nodes V bound to A  Set of nodes W bound to B  Path variable P  Condition on P : path from A to B  A Φ B returns the set of node pairs (v,w) for which paths from v  V to w  W in G exist. A B name = Glucose ISA compound ISA gene P query Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

11 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Physical Implementation of Φ  Graph traversal at query time  Breadth-first or depth-first search  Query precomputed index structure  Transitive closure (only for small graphs)  GRIPP [Trißl et al., 2007] – GRIPP index table, IND(G) –one instance for every node v in G Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

12 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR GRIPP Index Creation  Depth-first traversal of G A B D H E F G R [0[0 C [1[1 [2[2 [3[3 [5[5,4],4],6],6],7],7] [8[8,9],9] [10,19] [11,14] [15,18],20],21] [12 [16  We reach a node v  for the first time – add tree instance of v to IND(G) – proceed traversal  again – add non-tree instance of v to IND(G) – do not traverse child nodes of v,13],17 ] Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

13 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR  Is node C reachable from node D? GRIPP Index Table, IND(G) A B D H E F G R [0[0 C [1[1 [2[2 [3[3 [5[5,4],4],6],6],7],7] [8[8,9],9] [10,19] [11,14] [15,18],20],21] [12 [16,13],17 ] nodeprepostinst R 021tree A 120tree B 27 E 34 F 56 C 89 D 1019tree G 1114tree B 1213non H 1518tree A 1617non Graph, G GRIPP index, IND(G) C D Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

14 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Order Tree, O(G) nodeprepostinst R 021tree A 120tree B 27 E 34 F 56 C 89 D 1019tree G 1114tree B 1213non H 1518tree A 1617non Order tree, O(G) w reachable from v iff v pre < w pre < v post Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

15 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Order Tree, O(G) nodeprepostinst R 021tree A 120tree B 27 E 34 F 56 C 89 D 1019tree G 1114tree B 1213non H 1518tree A 1617non Order tree, O(G) w reachable from v iff v pre < w pre < v post Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

16 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Query strategy – Step 1  Retrieve the reachable instance set of start node v, called RIS(v)  Retrieve RIS(D)  Requires only a single query on IND(G)  If C  RIS(D)  return true  stop the search  Else  proceed to Step 2 Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

17 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Query strategy – Step 2  Search for non-tree instances in RIS(v)  The nodes of these instances are hop nodes  Check every i  RIS(D)  If i is tree instance – [G and H] – Done  If i is non-tree instance – [A and B] – i has no successors in O(G), but possibly in G – proceed to Step 3 Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

18 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Query strategy – Step 3  Extend the search  using hop nodes v 1, …, v n  Obtain the tree instance of node B  Proceed to Step 1  Repeat steps 1…3 until  an instance of node C is found  or no more hop nodes are available Depth-first traversal of O(G) using hop nodes Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

19 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR GRIPP – Sets of Nodes A B D H E F G R C Graph, G Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion A P B Node D C E

20 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR GRIPP – Sets of Nodes  Two different strategies  Single node pair  Evaluate reachability for every node pair separately  Set-oriented  Evaluate reachability for the set in one step Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

21 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Query GRIPP – Single Node Pair  First evaluate reachability(D,E)  Then reachability(D,C) separately true

22 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Query GRIPP – Set-oriented  First query the order tree completely  Then search used nodes and target nodes  If pre Used < pre Target < post Used  true nodeprepost D 1019 B 27 A 120 Used nodes nodeprepost C 89 E 34 Target nodes true Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

23 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Cost model  Single node pair strategy  query time linear in size of target set  better for few target nodes  Set-oriented strategy  almost constant query times  better for many target nodes Average query time for both strategies and increasing size of target node set on a graph with 10,000 nodes and 20,000 edges Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

24 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Outline  Motivation  Optimize Graph Queries  Evaluate node conditions  Evaluate path conditions  Future Work  Relational algebra for graph queries  Conclusion Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

25 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Future Work  Towards an algebra for graph queries  Define new operators – Logical – Physical  Determine cost functions – Estimate the size of result sets  Define rewrite rules – Which operations can be pushed? Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

26 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Future Work – New Operators  Path length operator  Evaluate the length of a path  Possible solution – Store parts of paths – e.g., up to length x [Giugno & Shasha, 2002] a b graph Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

27 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Future Work  Cost Model  Assign cost models to physical operators  Estimate the size of result sets  Between how many node pairs does a path exist? – Possibly of certain length?  Possible solution – Sampling Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

28 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Rewrite Query Plan A B name = Glucose ISA compound ISA enzyme P query SELECT B FROM network LET node A, node B, path P WHERE A.name = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B; Node TYPE σ name=Glucose χ compound ⋈ Node.TYPE=TYPE Node TYPE χ enzyme ⋈ Node.TYPE=TYPE Φ πBπB Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

29 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Better Plan? Node TYPE σ name=Glucose χ compound ⋈ Node.TYPE=TYPE Node TYPE χ enzyme ⋈ Node.TYPE=TYPE Φ πBπB 1 18,000 Node TYPE σ name=Glucose χ compound ⋈ Node.TYPE=TYPE Node TYPE χ enzyme ⋈ B.TYPE=TYPE Φ πBπB ,000 2,000

30 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Conclusion  Optimize the execution of graph queries  Use cost-based query optimization  Extend relational algebra  New operators – Path existence operator, Φ – Path length operator  Cost functions – Estimate the size of result sets  Rewrite rules Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

Thanks for your attention Special thanks to my PhD supervisor Ulf Leser Silke Trißl Humboldt-Universität zu Berlin Work sponsored by IDAR 2007

32 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR References  U. Leser. A query language for biological networks. Bioinformatics, 21 Suppl 2:ii33–ii39, Sep  B. Eckman and P. G. Brown Graph data management for molecular and cell biology. IBM J. Res & Dev., 50(6):545 – 560, Nov  F. Sohler and R. Zimmer. Identifying active transcription factors and kinases from expression data using pathway queries. Bioinformatics, 21 Suppl 2:ii115-ii122, Sep  J. McHugh and J. Widom. Query Optimization for XML. In Proc. of the VLDB Conference, pages 315–326, Morgan Kaufmann.  V. Wu, J. M. Patel, and H. V. Jagadish. Structural Join Order Selection for XML Query Optimization. In Proc. of the ICDE Conference, pages 443–454, IEEE Computer Society.  S. Trißl and U. Leser. Fast and Practical Indexing and Querying of Very Large Graphs. In Proc. of the ACM SIGMOD Conference, to appear, ACM Press.