Managing Data at Scale Ke Yi and Dan Suciu Dagstuhl 2016.

Slides:

Advertisements

Similar presentations

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Advertisements

Distributed Systems CS

1 Conjunctions of Queries. 2 Conjunctive Queries A conjunctive query is a single Datalog rule with only non-negated atoms in the body. (Note: No negated.

Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.

Razdan with contribution from others 1 Algorithm Analysis What is the Big ‘O Bout? Anshuman Razdan Div of Computing.

Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.

S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Communication Cost in Parallel Query Processing

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

C OMMUNICATION S TEPS F OR P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2013.

Complexity Classes Kang Yu 1. NP NP : nondeterministic polynomial time NP-complete : 1.In NP (can be verified in polynomial time) 2.Every problem in NP.

General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.

1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.

Parallel Evaluation of Conjunctive Queries Paraschos Koutris and Dan Suciu University of Washington PODS 2011, Athens.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

PRAM and Parallel Computing

CS 405G: Introduction to Database Systems

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Data Driven Resource Allocation for Distributed Learning

Information Complexity Lower Bounds

Communication Cost in Parallel Query Processing

Communication Cost in Parallel Query Processing

Large-scale Machine Learning

New Characterizations in Turnstile Streams with Applications

Scalable Load-Distance Balancing

Introduction to Algorithms

Randomized Algorithms

Algorithm Analysis CSE 2011 Winter September 2018.

Distributed Submodular Maximization in Massive Datasets

Optimal Query Processing Meets Information Theory

CS 4/527: Artificial Intelligence

April 30th – Scheduling / parallel

Queries with Difference on Probabilistic Databases

PRAM architectures, algorithms, performance evaluation

Optimal Query Processing Meets Information Theory

Cse 344 May 2nd – Map/reduce.

Data Integration with Dependent Sources

Chap 9. General LP problems: Duality and Infeasibility

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Random Sampling over Joins Revisited

Chapter 17: Database System Architectures

Cse 344 May 4th – Map/Reduce.

Parallel and Distributed Algorithms

Objective of This Course

Akshay Tomar Prateek Singh Lohchubh

Randomized Algorithms

Hung Q. Ngo RelationalAI Inc.

CS110: Discussion about Spark

Introduction to Multiprocessors

Parallel Analytic Systems

Distributed Systems CS

CMPT 354: Database System I

Coarse Grained Parallel Selection

NP-Completeness Reference: Computers and Intractability: A Guide to the Theory of NP-Completeness by Garey and Johnson, W.H. Freeman and Company, 1979.

COS 418: Distributed Systems Lecture 16 Wyatt Lloyd

Database System Architectures

Module 6: Introduction to Parallel Computing

Estimating Algorithm Performance

MapReduce: Simplified Data Processing on Large Clusters

Guess Free Maximization of Submodular and Linear Sums

Presentation transcript:

Managing Data at Scale Ke Yi and Dan Suciu Dagstuhl 2016

Thomas’ Questions What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016

Thomas’ Questions What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016

1. Key Benefits from Theory DB Theory Mission: Guide the development of new data management techniques driven by changes in: What data processing tasks are needed How data is processed Where data is processed Dagstuhl 2016

What ML, data analytics (R/GraphLab/GraphX) Distributed state (replicated objects) Graphics, Data visualization (Halide) Information extraction / text processing ML, Data analytics: Complex, ad-hoc queries… …plus linear algebra: (R/GraphLab/GraphX) …plus iterations: not well modeled by datalog Dagstuhl 2016

How Cloud computing (Amazon, MS, Google) Distributed data processing (MapReduce, Spark, Pregel, GraphLab) Distributed transactions (Spanner) major OS-level innovation happening right now at Amazon, MS, google Distributed transactions: Weak consistency (eventual consistency) Bigtable to Spanner, PostgresX Paxos v.s. 2PC Dagstuhl 2016

Where Shared nothing, distributed systems Shared memory Chip trends (SIMD, NVM, Dark silicon) Shared nothing, distributed systems Motivation for MR: data larger than main memory Networks are faster than disks MR still common practice (Flume) Theory model: next slides Shared memory: Motivation: data is not that big after all Theory models: PRAM (CRCW, EREW…) Chip trends: SIMD (QuickStep, EmptyHeaded) NVM (=non-volatile-memory) Dark silicon: e.g. specialization Dagstuhl 2016

Thomas’ Questions What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016

2-3. Key Problems and Techniques Some results/techniques from recent years*: Worst-case optimal algorithms Communication optimal algorithms I/O-optimal algorithms + sampling (Ke) *subjective selection Dagstuhl 2016

AGM Bound Worst-case output size: [Atserias,Grohe,Marx’2011] AGM Bound Worst-case output size: Fix a number N, full conjunctive query Q For any database D s.t. |R1D| ≤ N, |R2D| ≤ N, … How large is |Q(D)| ? Examples: Q(x,y,z) = R1(x,y),R2(y,z) |Q| ≤ N2 Q(x,y,z,u) = R1(x,y),R2(y,z),R3(z,u) |Q| ≤ N2 Q(x,y,z,u,v) = R1(x,y),…, R4(z,u) |Q| ≤ N3 For any edge-cover of Q of size ρ, |Q| ≤ Nρ Dagstuhl 2016

AGM Bound Worst-case output size: [Atserias,Grohe,Marx’2011] AGM Bound Worst-case output size: Fix a number N, full conjunctive query Q For any database D s.t. |R1D| ≤ N, |R2D| ≤ N, … How large is |Q(D)| ? Examples: Q(x,y,z) = R1(x,y),R2(y,z) |Q| ≤ N2 Q(x,y,z,u) = R1(x,y),R2(y,z),R3(z,u) |Q| ≤ N2 Q(x,y,z,u,v) = R1(x,y),…, R4(z,u) |Q| ≤ N3 For any edge-cover of Q of size ρ, |Q| ≤ Nρ Dagstuhl 2016

AGM Bound Worst-case output size: [Atserias,Grohe,Marx’2011] AGM Bound Worst-case output size: Fix a number N, full conjunctive query Q For any database D s.t. |R1D| ≤ N, |R2D| ≤ N, … How large is |Q(D)| ? Examples: Q(x,y,z) = R1(x,y),R2(y,z) |Q| ≤ N2 Q(x,y,z,u) = R1(x,y),R2(y,z),R3(z,u) |Q| ≤ N2 Q(x,y,z,u,v) = R1(x,y),…, R4(z,u) |Q| ≤ N3 For any edge-cover of Q of size ρ, |Q| ≤ Nρ Dagstuhl 2016

AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* [Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph

AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* [Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph Q(x,y,z) = R(x,y),S(y,z),T(z,x)

AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* [Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph Q(x,y,z) = R(x,y),S(y,z),T(z,x) min(wR + wS + wT) x: wR + wT ≥ 1 y: wR + wS ≥ 1 z: wS + wT ≥ 1 ρ* = 3/2 ½ |Q| ≤ N3/2 Thm. For any feasible wR, wS, wT |Q| ≤ NwR+wS+wT Proof. (Shearer’s lemma – next)

AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* [Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph Q(x,y,z) = R(x,y),S(y,z),T(z,x) max(vx + vy + vz) R: vx + vy ≤ 1 S: vy + vz ≤ 1 T: vx + vz ≤ 1 min(wR + wS + wT) x: wR + wT ≥ 1 y: wR + wS ≥ 1 z: wS + wT ≥ 1 ρ* = 3/2 ½ |Q| ≤ N3/2 Thm. For any feasible vx,vy,vz |Q| = Nvx+vy+vz Thm. For any feasible wR, wS, wT |Q| ≤ NwR+wS+wT ≤ Proof “Free” instance R(x,y) = [Nvx] × [Nvy] S(y,z) = [Nvy] × [Nvz] T(z,x) = [Nvx] × [Nvz] Proof. (Shearer’s lemma – next)

AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q x y z a 3 r 1/5 q 2 b d Dagstuhl 2016

AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Dagstuhl 2016

AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Dagstuhl 2016

AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) Dagstuhl 2016

AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) Dagstuhl 2016

AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) ≥ H(xyz) + H(y) + H(xz) Dagstuhl 2016

AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) ≥ H(xyz) + H(y) + H(xz) ≥ H(xyz) + H(xyz) + H(ɸ) Dagstuhl 2016

AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) ≥ H(xyz) + H(y) + H(xz) ≥ H(xyz) + H(xyz) + H(ɸ) = 2H(xyz) = 2log|Q| |Q| ≤ N3/2 Dagstuhl 2016

Worst-Case Algorithm Theorem. GenericJoin runs in time O(Nρ*) [Ngo,Re,Rudra] Worst-Case Algorithm GenericJoin(Q, D): if Q = ground atom return “true” iff Q is in D choose any variable x compute A = Πx(R1) ∩ Πx(R2) ∩ … for all a ∈ A: GenericJoin(Q[a/x], D) return their union Theorem. GenericJoin runs in time O(Nρ*) Note: every traditional query plan (= one join at a time) is suboptimal!

Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p Say: in algorithms, L = maximum load. In lower bounds, L = average load

Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate Say: in algorithms, L = maximum load. In lower bounds, L = average load

Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 . . . . Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load ≤L ≤L Round 3 . . . . . . . .

Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 . . . . Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load Max communication load / round / server = L ≤L ≤L Round 3 . . . . . . . .

Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 . . . . Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load Max communication load / round / server = L ≤L ≤L Round 3 . . . . . . . . Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m Rounds r 1 O(1) p

Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 . . . . Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load Max communication load / round / server = L ≤L ≤L Round 3 . . . . . . . . Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m Rounds r 1 O(1) p

Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 . . . . Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load Max communication load / round / server = L ≤L ≤L Round 3 . . . . . . . . Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m Rounds r 1 O(1) p

Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 . . . . Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load Max communication load / round / server = L ≤L ≤L Round 3 . . . . . . . . Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m Rounds r 1 O(1) p

Speedup Speed # processors (=p) L = m/p linear speedup L = m/p1-ε sub-linear speedup

Example L = 2 m / p ½ = O(m / p ½) Cartesian product: Q(x,y) = R(x) × S(y) R(x)  S(y)  L = 2 m / p ½ = O(m / p ½) Dagstuhl 2016

Full Conjunctive Queries [Beame, Koutris, S] Full Conjunctive Queries [Afrati&Ullman] One Round #Rounds = r No Skew Skew |S1| = |S2| = … m1, m2,.. |S1| = |S2| … S1| = |S2| … Algo-rithm Prod L = m/p1/2 Join L = m/p Triang L = m/p2/3 Multi L = m/p1/τ* Formula based on fractional edge packing L = m/p1/2 L = m/p1/ψ* L = m/p1/2 r = 1 L = m/p2/3 r = 2 L = ? (open) Lower Bound Same as above … L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing [Beame,Koutris,S]

Full Conjunctive Queries [Beame, Koutris, S] Full Conjunctive Queries Speedup may improve E.g. 1/p2/3  1/p One Round #Rounds = r No Skew Skew |S1| = |S2| = … m1, m2,.. |S1| = |S2| … S1| = |S2| … Algo-rithm Prod L = m/p1/2 Join L = m/p Triang L = m/p2/3 Multi L = m/p1/τ Formula based on fractional edge packing L = m/p1/2 L = m/p1/ψ* L = m/p1/2 r = 1 L = m/p2/3 r = 2 L = ? (open) Lower Bound Same as above … L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing

Full Conjunctive Queries [Beame, Koutris, S] Full Conjunctive Queries Speedup may improve E.g. 1/p2/3  1/p Increased load. Decreased speedup. One Round #Rounds = r No Skew Skew |S1| = |S2| = … m1, m2,.. |S1| = |S2| … S1| = |S2| … Algo-rithm Prod L = m/p1/2 Join L = m/p Triang L = m/p2/3 Multi L = m/p1/τ Formula based on fractional edge packing L = m/p1/2 L = m/p1/ψ* L = m/p1/2 r = 1 L = m/p2/3 r = 2 L = ? (open) Lower Bound Same as above … L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing

Full Conjunctive Queries [Beame, Koutris, S] Full Conjunctive Queries Speedup may improve E.g. 1/p2/3  1/p Mitigates skew for some queries. Increased load. Decreased speedup. One Round #Rounds = r No Skew Skew |S1| = |S2| = … m1, m2,.. |S1| = |S2| … S1| = |S2| … Algo-rithm Prod L = m/p1/2 Join L = m/p Triang L = m/p2/3 Multi L = m/p1/τ Formula based on fractional edge packing L = m/p1/2 L = m/p1/ψ* L = m/p1/2 r = 1 L = m/p2/3 r = 2 L = ? (open) Lower Bound Same as above … L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing

Challenges, Open Problems Single server: deepen connection to information theory (FDs, statistics) Shared Memory: GenericJoin on PRAM? Shared Nothing: O(1) rounds, beyond CQ Explain: τ* = skew-free, ρ* = skewed data Dagstuhl 2016

I/O-optimal Algorithms Ke Dagstuhl 2016

Key Techniques Convex optimization meets finite model theory meets information theory! Algorithms: novel “all-joins-at-once” Concentration bounds (Cernoff) for hash function guarantees Dagstuhl 2016

Some Tough Problems… …that no-one is addressing in PODS/ICDT! (why??): Transactions!!! Consistent, globally distributed data Eventual consistency: efficient but wrong (NoSQL) Strong consistency: correct but slow (Spanner, postgres-XL) Optimistic models: Parallel SI (PSI) Design the “right” DSL for ML + data transformations SystemML(IBM)? TensorFlow(google)? Expressive power / complexity / hierarchy? Design the “right” theoretical model for architectures to come: SMP, NVM, dark silicon Dagstuhl 2016

Thomas’ Questions What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016

What Alice is Missing Convex optimization Information theory Models of computation for query processing (beyond relational-machines) Cernoff/Hoeffding bounds and beyond Dagstuhl 2016

Thomas’ Questions What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016

Recipe for the Best PODS paper What do I know???? If I knew I would write it myself… I only have thoughts about “types” of best papers: Best first paper Best technical paper Best last paper Dagstuhl 2016