Managing Data at Scale Ke Yi and Dan Suciu Dagstuhl 2016.

Managing Data at Scale Ke Yi and Dan Suciu Dagstuhl 2016

Thomas’ Questions What are the key benefits from theory?
What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016

1. Key Benefits from Theory
DB Theory Mission: Guide the development of new data management techniques driven by changes in: What data processing tasks are needed How data is processed Where data is processed Dagstuhl 2016

What ML, data analytics (R/GraphLab/GraphX)
Distributed state (replicated objects) Graphics, Data visualization (Halide) Information extraction / text processing ML, Data analytics: Complex, ad-hoc queries… …plus linear algebra: (R/GraphLab/GraphX) …plus iterations: not well modeled by datalog Dagstuhl 2016

How Cloud computing (Amazon, MS, Google)
Distributed data processing (MapReduce, Spark, Pregel, GraphLab) Distributed transactions (Spanner) major OS-level innovation happening right now at Amazon, MS, google Distributed transactions: Weak consistency (eventual consistency) Bigtable to Spanner, PostgresX Paxos v.s. 2PC Dagstuhl 2016

Where Shared nothing, distributed systems Shared memory
Chip trends (SIMD, NVM, Dark silicon) Shared nothing, distributed systems Motivation for MR: data larger than main memory Networks are faster than disks MR still common practice (Flume) Theory model: next slides Shared memory: Motivation: data is not that big after all Theory models: PRAM (CRCW, EREW…) Chip trends: SIMD (QuickStep, EmptyHeaded) NVM (=non-volatile-memory) Dark silicon: e.g. specialization Dagstuhl 2016

2-3. Key Problems and Techniques
Some results/techniques from recent years*: Worst-case optimal algorithms Communication optimal algorithms I/O-optimal algorithms + sampling (Ke) *subjective selection Dagstuhl 2016

AGM Bound Worst-case output size:
[Atserias,Grohe,Marx’2011] AGM Bound Worst-case output size: Fix a number N, full conjunctive query Q For any database D s.t. |R1D| ≤ N, |R2D| ≤ N, … How large is |Q(D)| ? Examples: Q(x,y,z) = R1(x,y),R2(y,z) |Q| ≤ N2 Q(x,y,z,u) = R1(x,y),R2(y,z),R3(z,u) |Q| ≤ N2 Q(x,y,z,u,v) = R1(x,y),…, R4(z,u) |Q| ≤ N3 For any edge-cover of Q of size ρ, |Q| ≤ Nρ Dagstuhl 2016

AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ*
[Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph

[Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph Q(x,y,z) = R(x,y),S(y,z),T(z,x)

[Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph Q(x,y,z) = R(x,y),S(y,z),T(z,x) min(wR + wS + wT) x: wR wT ≥ 1 y: wR + wS ≥ 1 z: wS + wT ≥ 1 ρ* = 3/2 |Q| ≤ N3/2 Thm. For any feasible wR, wS, wT |Q| ≤ NwR+wS+wT Proof. (Shearer’s lemma – next)

[Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph Q(x,y,z) = R(x,y),S(y,z),T(z,x) max(vx + vy + vz) R: vx + vy ≤ 1 S: vy + vz ≤ 1 T: vx vz ≤ 1 min(wR + wS + wT) x: wR wT ≥ 1 y: wR + wS ≥ 1 z: wS + wT ≥ 1 ρ* = 3/2 |Q| ≤ N3/2 Thm. For any feasible vx,vy,vz |Q| = Nvx+vy+vz Thm. For any feasible wR, wS, wT |Q| ≤ NwR+wS+wT ≤ Proof “Free” instance R(x,y) = [Nvx] × [Nvy] S(y,z) = [Nvy] × [Nvz] T(z,x) = [Nvx] × [Nvz] Proof. (Shearer’s lemma – next)

AGM Bound (Upper bound)
[Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q x y z a 3 r 1/5 q 2 b d Dagstuhl 2016

[Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Dagstuhl 2016

[Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Dagstuhl 2016

[Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) Dagstuhl 2016

[Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) Dagstuhl 2016

[Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) ≥ H(xyz) + H(y) + H(xz) Dagstuhl 2016

[Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) ≥ H(xyz) + H(y) + H(xz) ≥ H(xyz) + H(xyz) + H(ɸ) Dagstuhl 2016

[Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q(x,y,z) = R(x,y),S(y,z),T(z,x) Q Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) x y z a 3 r 1/5 q 2 b d R x y a 3 2/5 2 1/5 b d S y z 3 r 2/5 2 q 1/5 4 T x z a r 1/5 q 2/5 b d Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) ≥ H(xyz) + H(y) + H(xz) ≥ H(xyz) + H(xyz) + H(ɸ) = 2H(xyz) = 2log|Q| |Q| ≤ N3/2 Dagstuhl 2016

Worst-Case Algorithm Theorem. GenericJoin runs in time O(Nρ*)
[Ngo,Re,Rudra] Worst-Case Algorithm GenericJoin(Q, D): if Q = ground atom return “true” iff Q is in D choose any variable x compute A = Πx(R1) ∩ Πx(R2) ∩ … for all a ∈ A: GenericJoin(Q[a/x], D) return their union Theorem. GenericJoin runs in time O(Nρ*) Note: every traditional query plan (= one join at a time) is suboptimal!

Massively Parallel Communication
[Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 Server p Number of servers = p Say: in algorithms, L = maximum load. In lower bounds, L = average load

[Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 Server p Number of servers = p ≤L ≤L Round 1 Server 1 Server p One round = Compute & communicate Say: in algorithms, L = maximum load. In lower bounds, L = average load

[Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 Server p Number of servers = p ≤L ≤L Round 1 Server 1 Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load ≤L ≤L Round 3

[Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 Server p Number of servers = p ≤L ≤L Round 1 Server 1 Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load Max communication load / round / server = L ≤L ≤L Round 3

[Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) O(m/p) Server 1 Server p Number of servers = p ≤L ≤L Round 1 Server 1 Server p One round = Compute & communicate ≤L ≤L Round 2 Algorithm = Several rounds Server 1 Server p Say: in algorithms, L = maximum load. In lower bounds, L = average load Max communication load / round / server = L ≤L ≤L Round 3 Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m Rounds r 1 O(1) p

Speedup Speed # processors (=p) L = m/p linear speedup
L = m/p1-ε sub-linear speedup

Example L = 2 m / p ½ = O(m / p ½) Cartesian product:
Q(x,y) = R(x) × S(y) R(x)  S(y)  L = 2 m / p ½ = O(m / p ½) Dagstuhl 2016

Full Conjunctive Queries
[Beame, Koutris, S] Full Conjunctive Queries [Afrati&Ullman] One Round #Rounds = r No Skew Skew |S1| = |S2| = … m1, m2,.. |S1| = |S2| … S1| = |S2| … Algo-rithm Prod L = m/p1/2 Join L = m/p Triang L = m/p2/3 Multi L = m/p1/τ* Formula based on fractional edge packing L = m/p1/2 L = m/p1/ψ* L = m/p1/ r = 1 L = m/p2/ r = 2 L = ? (open) Lower Bound Same as above … L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing [Beame,Koutris,S]

[Beame, Koutris, S] Full Conjunctive Queries Speedup may improve E.g. 1/p2/3  1/p One Round #Rounds = r No Skew Skew |S1| = |S2| = … m1, m2,.. |S1| = |S2| … S1| = |S2| … Algo-rithm Prod L = m/p1/2 Join L = m/p Triang L = m/p2/3 Multi L = m/p1/τ Formula based on fractional edge packing L = m/p1/2 L = m/p1/ψ* L = m/p1/ r = 1 L = m/p2/ r = 2 L = ? (open) Lower Bound Same as above … L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing

[Beame, Koutris, S] Full Conjunctive Queries Speedup may improve E.g. 1/p2/3  1/p Increased load. Decreased speedup. One Round #Rounds = r No Skew Skew |S1| = |S2| = … m1, m2,.. |S1| = |S2| … S1| = |S2| … Algo-rithm Prod L = m/p1/2 Join L = m/p Triang L = m/p2/3 Multi L = m/p1/τ Formula based on fractional edge packing L = m/p1/2 L = m/p1/ψ* L = m/p1/ r = 1 L = m/p2/ r = 2 L = ? (open) Lower Bound Same as above … L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing

[Beame, Koutris, S] Full Conjunctive Queries Speedup may improve E.g. 1/p2/3  1/p Mitigates skew for some queries. Increased load. Decreased speedup. One Round #Rounds = r No Skew Skew |S1| = |S2| = … m1, m2,.. |S1| = |S2| … S1| = |S2| … Algo-rithm Prod L = m/p1/2 Join L = m/p Triang L = m/p2/3 Multi L = m/p1/τ Formula based on fractional edge packing L = m/p1/2 L = m/p1/ψ* L = m/p1/ r = 1 L = m/p2/ r = 2 L = ? (open) Lower Bound Same as above … L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing

Challenges, Open Problems
Single server: deepen connection to information theory (FDs, statistics) Shared Memory: GenericJoin on PRAM? Shared Nothing: O(1) rounds, beyond CQ Explain: τ* = skew-free, ρ* = skewed data Dagstuhl 2016

I/O-optimal Algorithms
Ke Dagstuhl 2016

Key Techniques Convex optimization meets finite model theory meets information theory! Algorithms: novel “all-joins-at-once” Concentration bounds (Cernoff) for hash function guarantees Dagstuhl 2016

Some Tough Problems… …that no-one is addressing in PODS/ICDT! (why??):
Transactions!!! Consistent, globally distributed data Eventual consistency: efficient but wrong (NoSQL) Strong consistency: correct but slow (Spanner, postgres-XL) Optimistic models: Parallel SI (PSI) Design the “right” DSL for ML + data transformations SystemML(IBM)? TensorFlow(google)? Expressive power / complexity / hierarchy? Design the “right” theoretical model for architectures to come: SMP, NVM, dark silicon Dagstuhl 2016

What Alice is Missing Convex optimization Information theory
Models of computation for query processing (beyond relational-machines) Cernoff/Hoeffding bounds and beyond Dagstuhl 2016

Recipe for the Best PODS paper
What do I know???? If I knew I would write it myself… I only have thoughts about “types” of best papers: Best first paper Best technical paper Best last paper Dagstuhl 2016

Managing Data at Scale Ke Yi and Dan Suciu Dagstuhl 2016.

Similar presentations

Presentation on theme: "Managing Data at Scale Ke Yi and Dan Suciu Dagstuhl 2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Managing Data at Scale Ke Yi and Dan Suciu Dagstuhl 2016.

Similar presentations

Presentation on theme: "Managing Data at Scale Ke Yi and Dan Suciu Dagstuhl 2016."— Presentation transcript:

Similar presentations

About project

Feedback