Download presentation
Presentation is loading. Please wait.
Published byPolly Carr Modified over 9 years ago
1
A Comparison of Column, Row and Array DBMSs to Process Recursive Queries Carlos Ordonez ATT Labs
2
Acknowledgments Michael Stonebraker Visited MIT 2013, 2014 Wellington Cabrera, PhD student; Achyuth Gurram, MS student Divesh Srivastava for inviting me to spend my sabbatical at ATT
3
Introduction Recursion defined in ANSI SQL Graph algorithms: paths, reachability, neighborhood analysis Complexity: Cubic, NP-completeness Before: Deductive databases: datalog Harder query optimization than traditional SPJ queries
4
Directed Graphs Definitions: Directed Graph G=(V,E), maybe cyclic! A vertex in V : i or j and i,j=1..n. and edge (i,j ) has a direction and weight v storage: adjacency list table E : |E|=N Problems: Transitive closure: vertices j reachable from i Power matrix E k
5
Examples V=cities,E=roads. Is there some path from San Diego to NYC?: path from i to j? shortest one? V=employees, E=manager -> employee q1: all employees under i q2: Is j supervised by i? Bill of materials: The well-known part/subpart manufacturing DB: all subparts Y from part X
6
Recursive view R k: recursion depth n=|V|,N=|E|
7
Transitive Closure: G+=(V,E’)
8
Power matrix
9
Technical details Linear recursion; Intuition: R=R*E Inner joins; no negation SELECTs must have same data types No GROUP-BY, DISTINCT, HAVING, NOT IN, OUTER JOIN clauses inside R Any SQL query on R is valid Seminaive; recursion depth k: loop with k-1 joins or (rarely) fixpoint
10
Stonebraker: One size does not fit all!! Analytics: Row: OLTP, point queries Column: DSS/cube queries Array: math, science Other: Stream: one pass; in-RAM MMDB: OLTP Hadoop/noSQL: yawn
11
DBMS storage elevator story row | column | array Row: old, single file, block, B- trees/hash, horizontal partitioning Column: new, multiple files, var. size blocks, ordered values, compressed, no row-level index! Array: very different storage; attributes={dimensions|columns}; chunk==subarray; multidimensional; grid index in RAM
12
Algorithms Semi-naïve: classical, general, reasonably efficient, expressive Direct: very efficient; TC only; in-place update; matrix-based; requires arrays; not good for SQL; not used today! [TKDE 2010, Teradata DBMS]
13
Semi-naive
14
Seminaive in SQL
15
Optimizations: SPJ Relational algebra + physical operators Join: hash or sort-merge (nested loop does not make sense with E) Projection: push dup elimination & aggreg. Selection: push filter To be explored later Outer joins External joins Indexing: row-level only?
16
Join: hash versus sort-merge Goal: O(N) Main computation: Join optimization: Column: projection={unordered, ordered values} Row: unordered, ordered versus index Array: default={ordered, indexed} choice={sparse,dense}
17
Projection Duplicate elimination reachability binary edges Aggregation shortest/longest path count # paths length vs weight/cost
18
Selection Reduce |Rd|, correctness
19
Issues with select operator Incorrect to use a predicate involving a join expression column in recursive step Cicles => Infinite recursion Monotonically increasing v, OK to prune Recursion depth k: required in practice
20
Benchmark with graphs Real Skewed Complex structure; sample==different But Fixed size Synthetic Vary n,N Vary shape NEW: Cliques!
21
Simulating realistic G social nets, Internet
22
Join (stop=30 mins; array 2 chunk sizes)
23
Projection Duplicate elimination
24
Projection Aggregation
25
Selection simple filter i=1
26
Ultimate benchmark (fair?) tuned storage, best join, aggr
27
Conclusions Query optimizations Confirm decades of research: required But impact definitely varies G knowledge helps (catch 22) Benchmark with tuned query processing Column DBMS faster; cliques/skewed degree OK Array DBMS competitive for dense/clique G Row DBMS reasonable
28
Conclusions Graph features impacting time and I/O Density: avg vertex degree; deg(i) skew Cliques: K Cycles: deep k Recommendations Tune DBMS storage Tune join (skewed hash join) Beware of large cliques and short cycles Increment recursion depth k gradually
29
Future work Develop operators for Array DBMS Time complexity based on G Beyond semi-naïve: Direct, logarithmic More specific graph problems like CC, Neighborhood analysis, vertex similarity Query processing cost model
30
References on RQs SIGMOD papers: IBM TKDE 2010 paper; Teradata papers DOLAP 2014: Revisiting RQs in Column DBMS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.