Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Comparison of Column, Row and Array DBMSs to Process Recursive Queries Carlos Ordonez ATT Labs.

Similar presentations


Presentation on theme: "A Comparison of Column, Row and Array DBMSs to Process Recursive Queries Carlos Ordonez ATT Labs."— Presentation transcript:

1 A Comparison of Column, Row and Array DBMSs to Process Recursive Queries Carlos Ordonez ATT Labs

2 Acknowledgments Michael Stonebraker Visited MIT 2013, 2014 Wellington Cabrera, PhD student; Achyuth Gurram, MS student Divesh Srivastava for inviting me to spend my sabbatical at ATT

3 Introduction Recursion defined in ANSI SQL Graph algorithms: paths, reachability, neighborhood analysis Complexity: Cubic, NP-completeness Before: Deductive databases: datalog Harder query optimization than traditional SPJ queries

4 Directed Graphs Definitions: Directed Graph G=(V,E), maybe cyclic! A vertex in V : i or j and i,j=1..n. and edge (i,j ) has a direction and weight v storage: adjacency list table E : |E|=N Problems: Transitive closure: vertices j reachable from i Power matrix E k

5 Examples V=cities,E=roads. Is there some path from San Diego to NYC?: path from i to j? shortest one? V=employees, E=manager -> employee q1: all employees under i q2: Is j supervised by i? Bill of materials: The well-known part/subpart manufacturing DB: all subparts Y from part X

6 Recursive view R k: recursion depth n=|V|,N=|E|

7 Transitive Closure: G+=(V,E’)

8 Power matrix

9 Technical details Linear recursion; Intuition: R=R*E Inner joins; no negation SELECTs must have same data types No GROUP-BY, DISTINCT, HAVING, NOT IN, OUTER JOIN clauses inside R Any SQL query on R is valid Seminaive; recursion depth k: loop with k-1 joins or (rarely) fixpoint

10 Stonebraker: One size does not fit all!! Analytics: Row: OLTP, point queries Column: DSS/cube queries Array: math, science Other: Stream: one pass; in-RAM MMDB: OLTP Hadoop/noSQL: yawn

11 DBMS storage elevator story row | column | array Row: old, single file, block, B- trees/hash, horizontal partitioning Column: new, multiple files, var. size blocks, ordered values, compressed, no row-level index! Array: very different storage; attributes={dimensions|columns}; chunk==subarray; multidimensional; grid index in RAM

12 Algorithms Semi-naïve: classical, general, reasonably efficient, expressive Direct: very efficient; TC only; in-place update; matrix-based; requires arrays; not good for SQL; not used today! [TKDE 2010, Teradata DBMS]

13 Semi-naive

14 Seminaive in SQL

15 Optimizations: SPJ Relational algebra + physical operators Join: hash or sort-merge (nested loop does not make sense with E) Projection: push dup elimination & aggreg. Selection: push filter To be explored later Outer joins External joins Indexing: row-level only?

16 Join: hash versus sort-merge Goal: O(N) Main computation: Join optimization: Column: projection={unordered, ordered values} Row: unordered, ordered versus index Array: default={ordered, indexed} choice={sparse,dense}

17 Projection Duplicate elimination reachability binary edges Aggregation shortest/longest path count # paths length vs weight/cost

18 Selection Reduce |Rd|, correctness

19 Issues with select operator Incorrect to use a predicate involving a join expression column in recursive step Cicles => Infinite recursion Monotonically increasing v, OK to prune Recursion depth k: required in practice

20 Benchmark with graphs Real Skewed Complex structure; sample==different But Fixed size Synthetic Vary n,N Vary shape NEW: Cliques!

21 Simulating realistic G social nets, Internet

22 Join (stop=30 mins; array 2 chunk sizes)

23 Projection Duplicate elimination

24 Projection Aggregation

25 Selection simple filter i=1

26 Ultimate benchmark (fair?) tuned storage, best join, aggr

27 Conclusions Query optimizations Confirm decades of research: required But impact definitely varies G knowledge helps (catch 22) Benchmark with tuned query processing Column DBMS faster; cliques/skewed degree OK Array DBMS competitive for dense/clique G Row DBMS reasonable

28 Conclusions Graph features impacting time and I/O Density: avg vertex degree; deg(i) skew Cliques: K Cycles: deep k Recommendations Tune DBMS storage Tune join (skewed hash join) Beware of large cliques and short cycles Increment recursion depth k gradually

29 Future work Develop operators for Array DBMS Time complexity based on G Beyond semi-naïve: Direct, logarithmic More specific graph problems like CC, Neighborhood analysis, vertex similarity Query processing cost model

30 References on RQs SIGMOD papers: IBM TKDE 2010 paper; Teradata papers DOLAP 2014: Revisiting RQs in Column DBMS


Download ppt "A Comparison of Column, Row and Array DBMSs to Process Recursive Queries Carlos Ordonez ATT Labs."

Similar presentations


Ads by Google