Managing Data at Scale Ke Yi and Dan Suciu Dagstuhl 2016
Thomas’ Questions What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016
Thomas’ Questions What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016
1. Key Benefits from Theory DB Theory Mission: Guide the development of new data management techniques driven by changes in: What data processing tasks are needed How data is processed Where data is processed Dagstuhl 2016
What ML, data analytics (R/GraphLab/GraphX) Distributed state (replicated objects) Graphics, Data visualization (Halide) Information extraction / text processing ML, Data analytics: Complex, ad-hoc queries… …plus linear algebra: (R/GraphLab/GraphX) …plus iterations: not well modeled by datalog Dagstuhl 2016
How Cloud computing (Amazon, MS, Google) Distributed data processing (MapReduce, Spark, Pregel, GraphLab) Distributed transactions (Spanner) major OS-level innovation happening right now at Amazon, MS, google Distributed transactions: Weak consistency (eventual consistency) Bigtable to Spanner, PostgresX Paxos v.s. 2PC Dagstuhl 2016
Where Shared nothing, distributed systems Shared memory Chip trends (SIMD, NVM, Dark silicon) Shared nothing, distributed systems Motivation for MR: data larger than main memory Networks are faster than disks MR still common practice (Flume) Theory model: next slides Shared memory: Motivation: data is not that big after all Theory models: PRAM (CRCW, EREW…) Chip trends: SIMD (QuickStep, EmptyHeaded) NVM (=non-volatile-memory) Dark silicon: e.g. specialization Dagstuhl 2016
Thomas’ Questions What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016
2-3. Key Problems and Techniques Some results/techniques from recent years*: Worst-case optimal algorithms Communication optimal algorithms I/O-optimal algorithms + sampling (Ke) *subjective selection Dagstuhl 2016
AGM Bound Worst-case output size: [Atserias,Grohe,Marx’2011] AGM Bound Worst-case output size: Fix a number N, full conjunctive query Q For any database D s.t. |R1D| ≤ N, |R2D| ≤ N, … How large is |Q(D)| ? Examples: Q(x,y,z) = R1(x,y),R2(y,z) |Q| ≤ N2 Q(x,y,z,u) = R1(x,y),R2(y,z),R3(z,u) |Q| ≤ N2 Q(x,y,z,u,v) = R1(x,y),…, R4(z,u) |Q| ≤ N3 For any edge-cover of Q of size ρ, |Q| ≤ Nρ Dagstuhl 2016
AGM Bound Worst-case output size: [Atserias,Grohe,Marx’2011] AGM Bound Worst-case output size: Fix a number N, full conjunctive query Q For any database D s.t. |R1D| ≤ N, |R2D| ≤ N, … How large is |Q(D)| ? Examples: Q(x,y,z) = R1(x,y),R2(y,z) |Q| ≤ N2 Q(x,y,z,u) = R1(x,y),R2(y,z),R3(z,u) |Q| ≤ N2 Q(x,y,z,u,v) = R1(x,y),…, R4(z,u) |Q| ≤ N3 For any edge-cover of Q of size ρ, |Q| ≤ Nρ Dagstuhl 2016
AGM Bound Worst-case output size: [Atserias,Grohe,Marx’2011] AGM Bound Worst-case output size: Fix a number N, full conjunctive query Q For any database D s.t. |R1D| ≤ N, |R2D| ≤ N, … How large is |Q(D)| ? Examples: Q(x,y,z) = R1(x,y),R2(y,z) |Q| ≤ N2 Q(x,y,z,u) = R1(x,y),R2(y,z),R3(z,u) |Q| ≤ N2 Q(x,y,z,u,v) = R1(x,y),…, R4(z,u) |Q| ≤ N3 For any edge-cover of Q of size ρ, |Q| ≤ Nρ Dagstuhl 2016
AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* [Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph
AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* [Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph Q(x,y,z) = R(x,y),S(y,z),T(z,x)
AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* [Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph Q(x,y,z) = R(x,y),S(y,z),T(z,x) min(wR + wS + wT) x: wR + wT ≥ 1 y: wR + wS ≥ 1 z: wS + wT ≥ 1 ρ* = 3/2 ½ |Q| ≤ N3/2 Thm. For any feasible wR, wS, wT |Q| ≤ NwR+wS+wT Proof. (Shearer’s lemma – next)
AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* [Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph Q(x,y,z) = R(x,y),S(y,z),T(z,x) max(vx + vy + vz) R: vx + vy ≤ 1 S: vy + vz ≤ 1 T: vx + vz ≤ 1 min(wR + wS + wT) x: wR + wT ≥ 1 y: wR + wS ≥ 1 z: wS + wT ≥ 1 ρ* = 3/2 ½ |Q| ≤ N3/2 Thm. For any feasible vx,vy,vz |Q| = Nvx+vy+vz Thm. For any feasible wR, wS, wT |Q| ≤ NwR+wS+wT ≤ Proof “Free” instance R(x,y) = [Nvx] × [Nvy] S(y,z) = [Nvy] × [Nvz] T(z,x) = [Nvx] × [Nvz] Proof. (Shearer’s lemma – next)
AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q x y z a 3 r 1/5 q 2 b d Dagstuhl 2016
AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Dagstuhl 2016
AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Dagstuhl 2016
AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) Dagstuhl 2016
AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) Dagstuhl 2016
AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) ≥ H(xyz) + H(y) + H(xz) Dagstuhl 2016
AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) ≥ H(xyz) + H(y) + H(xz) ≥ H(xyz) + H(xyz) + H(ɸ) Dagstuhl 2016
AGM Bound (Upper bound) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) ≥ H(xyz) + H(y) + H(xz) ≥ H(xyz) + H(xyz) + H(ɸ) = 2H(xyz) = 2log|Q| |Q| ≤ N3/2 Dagstuhl 2016
Worst-Case Algorithm Theorem. GenericJoin runs in time O(Nρ*) [Ngo,Re,Rudra] Worst-Case Algorithm GenericJoin(Q, D): if Q = ground atom return “true” iff Q is in D choose any variable x compute A = Πx(R1) ∩ Πx(R2) ∩ … for all a ∈ A: GenericJoin(Q[a/x], D) return their union Theorem. GenericJoin runs in time O(Nρ*) Note: every traditional query plan (= one join at a time) is suboptimal!
Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p Say: in algorithms, L = maximum load. In lower bounds, L = average load
Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate Say: in algorithms, L = maximum load. In lower bounds, L = average load
Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 . . . . Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load ≤L ≤L Round 3 . . . . . . . .
Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 . . . . Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load Max communication load / round / server = L ≤L ≤L Round 3 . . . . . . . .
Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 . . . . Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load Max communication load / round / server = L ≤L ≤L Round 3 . . . . . . . . Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m Rounds r 1 O(1) p
Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 . . . . Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load Max communication load / round / server = L ≤L ≤L Round 3 . . . . . . . . Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m Rounds r 1 O(1) p
Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 . . . . Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load Max communication load / round / server = L ≤L ≤L Round 3 . . . . . . . . Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m Rounds r 1 O(1) p
Massively Parallel Communication [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 . . . . Server p Number of servers = p ≤L ≤L Round 1 Server 1 . . . . Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 . . . . Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load Max communication load / round / server = L ≤L ≤L Round 3 . . . . . . . . Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m Rounds r 1 O(1) p
Speedup Speed # processors (=p) L = m/p linear speedup L = m/p1-ε sub-linear speedup
Example L = 2 m / p ½ = O(m / p ½) Cartesian product: Q(x,y) = R(x) × S(y) R(x) S(y) L = 2 m / p ½ = O(m / p ½) Dagstuhl 2016
Full Conjunctive Queries [Beame, Koutris, S] Full Conjunctive Queries [Afrati&Ullman] One Round #Rounds = r No Skew Skew |S1| = |S2| = … m1, m2,.. |S1| = |S2| … S1| = |S2| … Algo-rithm Prod L = m/p1/2 Join L = m/p Triang L = m/p2/3 Multi L = m/p1/τ* Formula based on fractional edge packing L = m/p1/2 L = m/p1/ψ* L = m/p1/2 r = 1 L = m/p2/3 r = 2 L = ? (open) Lower Bound Same as above … L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing [Beame,Koutris,S]
Full Conjunctive Queries [Beame, Koutris, S] Full Conjunctive Queries Speedup may improve E.g. 1/p2/3 1/p One Round #Rounds = r No Skew Skew |S1| = |S2| = … m1, m2,.. |S1| = |S2| … S1| = |S2| … Algo-rithm Prod L = m/p1/2 Join L = m/p Triang L = m/p2/3 Multi L = m/p1/τ Formula based on fractional edge packing L = m/p1/2 L = m/p1/ψ* L = m/p1/2 r = 1 L = m/p2/3 r = 2 L = ? (open) Lower Bound Same as above … L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing
Full Conjunctive Queries [Beame, Koutris, S] Full Conjunctive Queries Speedup may improve E.g. 1/p2/3 1/p Increased load. Decreased speedup. One Round #Rounds = r No Skew Skew |S1| = |S2| = … m1, m2,.. |S1| = |S2| … S1| = |S2| … Algo-rithm Prod L = m/p1/2 Join L = m/p Triang L = m/p2/3 Multi L = m/p1/τ Formula based on fractional edge packing L = m/p1/2 L = m/p1/ψ* L = m/p1/2 r = 1 L = m/p2/3 r = 2 L = ? (open) Lower Bound Same as above … L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing
Full Conjunctive Queries [Beame, Koutris, S] Full Conjunctive Queries Speedup may improve E.g. 1/p2/3 1/p Mitigates skew for some queries. Increased load. Decreased speedup. One Round #Rounds = r No Skew Skew |S1| = |S2| = … m1, m2,.. |S1| = |S2| … S1| = |S2| … Algo-rithm Prod L = m/p1/2 Join L = m/p Triang L = m/p2/3 Multi L = m/p1/τ Formula based on fractional edge packing L = m/p1/2 L = m/p1/ψ* L = m/p1/2 r = 1 L = m/p2/3 r = 2 L = ? (open) Lower Bound Same as above … L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing
Challenges, Open Problems Single server: deepen connection to information theory (FDs, statistics) Shared Memory: GenericJoin on PRAM? Shared Nothing: O(1) rounds, beyond CQ Explain: τ* = skew-free, ρ* = skewed data Dagstuhl 2016
I/O-optimal Algorithms Ke Dagstuhl 2016
Key Techniques Convex optimization meets finite model theory meets information theory! Algorithms: novel “all-joins-at-once” Concentration bounds (Cernoff) for hash function guarantees Dagstuhl 2016
Some Tough Problems… …that no-one is addressing in PODS/ICDT! (why??): Transactions!!! Consistent, globally distributed data Eventual consistency: efficient but wrong (NoSQL) Strong consistency: correct but slow (Spanner, postgres-XL) Optimistic models: Parallel SI (PSI) Design the “right” DSL for ML + data transformations SystemML(IBM)? TensorFlow(google)? Expressive power / complexity / hierarchy? Design the “right” theoretical model for architectures to come: SMP, NVM, dark silicon Dagstuhl 2016
Thomas’ Questions What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016
What Alice is Missing Convex optimization Information theory Models of computation for query processing (beyond relational-machines) Cernoff/Hoeffding bounds and beyond Dagstuhl 2016
Thomas’ Questions What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016
Recipe for the Best PODS paper What do I know???? If I knew I would write it myself… I only have thoughts about “types” of best papers: Best first paper Best technical paper Best last paper Dagstuhl 2016