Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics IDAR 2007
2 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Motivation – Biological Networks from Name Sequence TYPE Function Location … Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
3 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Querying Networks - PQL Pathway Query Language (PQL) [Leser, 2005] Syntax for querying graphs Find subgraphs matching the query graph SELECT B FROM network LET node A, node B, path P WHERE = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B; A B name = Glucose ISA compound ISA enzyme P Find all enzymes that are directly or indirectly affected by „Glucose“ Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
4 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Node Conditions Nodes can contain conditions on A B name = Glucose ISA compound ISA enzyme P query TYPE hierarchy - partially root molecule interaction macro- molecule compound sugar gene protein ionmRNA catalysis inhibition enzyme Attributes = ‘Glucose’ TYPE (of hierarchy) A ISA compound Function (of ontology) A HASFUNC (‘catalysis’, GO) Location A ISIN (‘Human’, taxonomy) Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
5 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Path Conditions Paths can contain conditions on A B name = Glucose ISA compound ISA gene P query a b graph Edges P.path = A[-*]B AND P.length = 1 Path existence P.path = A[-*]B Path length P.path = A[-*]B AND P.length < 10 Start node P.start = A Containment P { R Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
6 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Result of Graph Queries Search for matching subgraphs Find node and path bindings for the query variables in the network A B name = Glucose ISA compound ISA enzyme P network query Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
7 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Outline Motivation Optimize Graph Queries Evaluate node conditions Evaluate path conditions Future Work Relational algebra for graph queries Conclusion Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
8 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Evaluation of Node Conditons Node attributes Select operator (σ) on Node table Node types, functions, and locations Hierarchy operator (χ) – Return the specified concept and all successor concepts A B name = Glucose ISA compound ISA gene P query query plan for node A Node TYPE σ name=Glucose χ compound ⋈ Node.TYPE=TYPE Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
9 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR How to evaluate Path conditions? Recursively traverse the graph Edge Arbitrary number of joins No possibility to optimize the execution a b graph ⋈ Edge ⋈ …⋈ … Need for new logical and physical operators Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
10 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Path Existence Operator, Φ Node variables A and B Set of nodes V bound to A Set of nodes W bound to B Path variable P Condition on P : path from A to B A Φ B returns the set of node pairs (v,w) for which paths from v V to w W in G exist. A B name = Glucose ISA compound ISA gene P query Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
11 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Physical Implementation of Φ Graph traversal at query time Breadth-first or depth-first search Query precomputed index structure Transitive closure (only for small graphs) GRIPP [Trißl et al., 2007] – GRIPP index table, IND(G) –one instance for every node v in G Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
12 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR GRIPP Index Creation Depth-first traversal of G A B D H E F G R [0[0 C [1[1 [2[2 [3[3 [5[5,4],4],6],6],7],7] [8[8,9],9] [10,19] [11,14] [15,18],20],21] [12 [16 We reach a node v for the first time – add tree instance of v to IND(G) – proceed traversal again – add non-tree instance of v to IND(G) – do not traverse child nodes of v,13],17 ] Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
13 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Is node C reachable from node D? GRIPP Index Table, IND(G) A B D H E F G R [0[0 C [1[1 [2[2 [3[3 [5[5,4],4],6],6],7],7] [8[8,9],9] [10,19] [11,14] [15,18],20],21] [12 [16,13],17 ] nodeprepostinst R 021tree A 120tree B 27 E 34 F 56 C 89 D 1019tree G 1114tree B 1213non H 1518tree A 1617non Graph, G GRIPP index, IND(G) C D Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
14 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Order Tree, O(G) nodeprepostinst R 021tree A 120tree B 27 E 34 F 56 C 89 D 1019tree G 1114tree B 1213non H 1518tree A 1617non Order tree, O(G) w reachable from v iff v pre < w pre < v post Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
15 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Order Tree, O(G) nodeprepostinst R 021tree A 120tree B 27 E 34 F 56 C 89 D 1019tree G 1114tree B 1213non H 1518tree A 1617non Order tree, O(G) w reachable from v iff v pre < w pre < v post Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
16 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Query strategy – Step 1 Retrieve the reachable instance set of start node v, called RIS(v) Retrieve RIS(D) Requires only a single query on IND(G) If C RIS(D) return true stop the search Else proceed to Step 2 Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
17 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Query strategy – Step 2 Search for non-tree instances in RIS(v) The nodes of these instances are hop nodes Check every i RIS(D) If i is tree instance – [G and H] – Done If i is non-tree instance – [A and B] – i has no successors in O(G), but possibly in G – proceed to Step 3 Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
18 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Query strategy – Step 3 Extend the search using hop nodes v 1, …, v n Obtain the tree instance of node B Proceed to Step 1 Repeat steps 1…3 until an instance of node C is found or no more hop nodes are available Depth-first traversal of O(G) using hop nodes Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
19 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR GRIPP – Sets of Nodes A B D H E F G R C Graph, G Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion A P B Node D C E
20 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR GRIPP – Sets of Nodes Two different strategies Single node pair Evaluate reachability for every node pair separately Set-oriented Evaluate reachability for the set in one step Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
21 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Query GRIPP – Single Node Pair First evaluate reachability(D,E) Then reachability(D,C) separately true
22 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Query GRIPP – Set-oriented First query the order tree completely Then search used nodes and target nodes If pre Used < pre Target < post Used true nodeprepost D 1019 B 27 A 120 Used nodes nodeprepost C 89 E 34 Target nodes true Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
23 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Cost model Single node pair strategy query time linear in size of target set better for few target nodes Set-oriented strategy almost constant query times better for many target nodes Average query time for both strategies and increasing size of target node set on a graph with 10,000 nodes and 20,000 edges Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
24 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Outline Motivation Optimize Graph Queries Evaluate node conditions Evaluate path conditions Future Work Relational algebra for graph queries Conclusion Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
25 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Future Work Towards an algebra for graph queries Define new operators – Logical – Physical Determine cost functions – Estimate the size of result sets Define rewrite rules – Which operations can be pushed? Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
26 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Future Work – New Operators Path length operator Evaluate the length of a path Possible solution – Store parts of paths – e.g., up to length x [Giugno & Shasha, 2002] a b graph Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
27 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Future Work Cost Model Assign cost models to physical operators Estimate the size of result sets Between how many node pairs does a path exist? – Possibly of certain length? Possible solution – Sampling Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
28 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Rewrite Query Plan A B name = Glucose ISA compound ISA enzyme P query SELECT B FROM network LET node A, node B, path P WHERE = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B; Node TYPE σ name=Glucose χ compound ⋈ Node.TYPE=TYPE Node TYPE χ enzyme ⋈ Node.TYPE=TYPE Φ πBπB Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
29 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Better Plan? Node TYPE σ name=Glucose χ compound ⋈ Node.TYPE=TYPE Node TYPE χ enzyme ⋈ Node.TYPE=TYPE Φ πBπB 1 18,000 Node TYPE σ name=Glucose χ compound ⋈ Node.TYPE=TYPE Node TYPE χ enzyme ⋈ B.TYPE=TYPE Φ πBπB ,000 2,000
30 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Conclusion Optimize the execution of graph queries Use cost-based query optimization Extend relational algebra New operators – Path existence operator, Φ – Path length operator Cost functions – Estimate the size of result sets Rewrite rules Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion
Thanks for your attention Special thanks to my PhD supervisor Ulf Leser Silke Trißl Humboldt-Universität zu Berlin Work sponsored by IDAR 2007
32 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR References U. Leser. A query language for biological networks. Bioinformatics, 21 Suppl 2:ii33–ii39, Sep B. Eckman and P. G. Brown Graph data management for molecular and cell biology. IBM J. Res & Dev., 50(6):545 – 560, Nov F. Sohler and R. Zimmer. Identifying active transcription factors and kinases from expression data using pathway queries. Bioinformatics, 21 Suppl 2:ii115-ii122, Sep J. McHugh and J. Widom. Query Optimization for XML. In Proc. of the VLDB Conference, pages 315–326, Morgan Kaufmann. V. Wu, J. M. Patel, and H. V. Jagadish. Structural Join Order Selection for XML Query Optimization. In Proc. of the ICDE Conference, pages 443–454, IEEE Computer Society. S. Trißl and U. Leser. Fast and Practical Indexing and Querying of Very Large Graphs. In Proc. of the ACM SIGMOD Conference, to appear, ACM Press.