S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014.

Slides:



Advertisements
Similar presentations
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Advertisements

1 LP Duality Lecture 13: Feb Min-Max Theorems In bipartite graph, Maximum matching = Minimum Vertex Cover In every graph, Maximum Flow = Minimum.
Query Folding Xiaolei Qian Presented by Ram Kumar Vangala.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
1 Conjunctions of Queries. 2 Conjunctive Queries A conjunctive query is a single Datalog rule with only non-negated atoms in the body. (Note: No negated.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
1 Parallel Parentheses Matching Plus Some Applications.
Introduction to Algorithms
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
Basic Feasible Solutions: Recap MS&E 211. WILL FOLLOW A CELEBRATED INTELLECTUAL TEACHING TRADITION.
The number of edge-disjoint transitive triples in a tournament.
Communication Cost in Parallel Query Processing
Totally Unimodular Matrices Lecture 11: Feb 23 Simplex Algorithm Elliposid Algorithm.
1 Introduction to Linear and Integer Programming Lecture 9: Feb 14.
Approximation Algorithms
Flow Algorithms for Two Pipelined Filtering Problems Anne Condon, University of British Columbia Amol Deshpande, University of Maryland Lisa Hellerstein,
Job Scheduling Lecture 19: March 19. Job Scheduling: Unrelated Multiple Machines There are n jobs, each job has: a processing time p(i,j) (the time to.
Integer Programming Difference from linear programming –Variables x i must take on integral values, not real values Lots of interesting problems can be.
Distributed Combinatorial Optimization
1 Introduction to Approximation Algorithms Lecture 15: Mar 5.
(work appeared in SODA 10’) Yuk Hei Chan (Tom)
Hardness Results for Problems
Approximation Algorithms: Bristol Summer School 2008 Seffi Naor Computer Science Dept. Technion Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint.
C OMMUNICATION S TEPS F OR P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2013.
The Theory of NP-Completeness 1. Nondeterministic algorithms A nondeterminstic algorithm consists of phase 1: guessing phase 2: checking If the checking.
The Theory of NP-Completeness 1. What is NP-completeness? Consider the circuit satisfiability problem Difficult to answer the decision problem in polynomial.
P ARALLEL S KYLINE Q UERIES Foto Afrati Paraschos Koutris Dan Suciu Jeffrey Ullman University of Washington.
A D ICHOTOMY ON T HE C OMPLEXITY OF C ONSISTENT Q UERY A NSWERING FOR A TOMS W ITH S IMPLE K EYS Paris Koutris Dan Suciu University of Washington.
Yossi Azar Tel Aviv University Joint work with Ilan Cohen Serving in the Dark 1.
Equality Function Computation (How to make simple things complicated) Nitin Vaidya University of Illinois at Urbana-Champaign Joint work with Guanfeng.
Design Techniques for Approximation Algorithms and Approximation Classes.
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Q UERY -B ASED D ATA P RICING Paraschos Koutris Prasang Upadhyaya Magdalena Balazinska Bill Howe Dan Suciu University of Washington PODS 2012.
The Complexity of Optimization Problems. Summary -Complexity of algorithms and problems -Complexity classes: P and NP -Reducibility -Karp reducibility.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Approximation Algorithms Department of Mathematics and Computer Science Drexel University.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
A NSWERING C ONJUNCTIVE Q UERIES W ITH I NEQUALITIES Paris Koutris 1 Tova Milo 2 Sudeepa Roy 1 Dan Suciu 1 ICDT University of Washington 2 Tel Aviv.
1 Lectures on Parallel and Distributed Algorithms COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski Lectures on Parallel and Distributed.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Linear Program Set Cover. Given a universe U of n elements, a collection of subsets of U, S = {S 1,…, S k }, and a cost function c: S → Q +. Find a minimum.
Approximation Algorithms Department of Mathematics and Computer Science Drexel University.
CPSC 536N Sparse Approximations Winter 2013 Lecture 1 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA.
From Theory to Practice: Efficient Join Query Processing in a Parallel Database System Shumo Chu, Magdalena Balazinska and Dan Suciu Database Group, CSE,
Lecture.6. Table of Contents Lp –rounding Dual Fitting LP-Duality.
1 Covering Non-uniform Hypergraphs Endre Boros Yair Caro Zoltán Füredi Raphael Yuster.
Parallel Evaluation of Conjunctive Queries Paraschos Koutris and Dan Suciu University of Washington PODS 2011, Athens.
The Theory of NP-Completeness 1. Nondeterministic algorithms A nondeterminstic algorithm consists of phase 1: guessing phase 2: checking If the checking.
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
Common Intersection of Half-Planes in R 2 2 PROBLEM (Common Intersection of half- planes in R 2 ) Given n half-planes H 1, H 2,..., H n in R 2 compute.
The Message Passing Communication Model David Woodruff IBM Almaden.
Approximation Algorithms based on linear programming.
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
The Theory of NP-Completeness
Upper and Lower Bounds on the cost of a Map-Reduce Computation
Data Driven Resource Allocation for Distributed Learning
Lap Chi Lau we will only use slides 4 to 19
Topics in Algorithms Lap Chi Lau.
Efficient Join Query Evaluation in a Parallel Database System
Managing Data at Scale Ke Yi and Dan Suciu Dagstuhl 2016.
Optimal Query Processing Meets Information Theory
Optimal Query Processing Meets Information Theory
The Theory of NP-Completeness
Switching Lemmas and Proof Complexity
Presentation transcript:

S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

M OTIVATION Understand the complexity of parallel query processing on big data – on shared-nothing architectures (e.g. MapReduce) – even in the presence of data skew Dominating parameters of computation: – Communication cost – Number of communication rounds 2

T HE MPC M ODEL Computation proceeds in synchronous rounds – Local Computation – Global Communication 3 INPUT (size = M) p round 1round r p p OUTPUT #bits received at each rounds ≤ L

T HE MPC M ODEL What is the minimum load L of an MPC algorithm that computes a Conjunctive Query Q in one round? [Beame, K, Suciu, PODS 2013] Tight upper and lower bounds for relations of equal size (M bits) and no skew 4 The data is evenly distributed Maximizes parallelism Equivalent to sequential computation No parallelism maximum load

R ESULTS Computing a Conjunctive Query Q in the MPC model in one round for relations with different sizes and skew Matching upper and lower bounds for any skew-free input database and different relation sizes Almost matching upper and lower bounds in the presence of skew – Matching bounds in the case of simple joins 5

C ONJUNCTIVE Q UERIES Full Conjuctive Queries w/o self-joins: – Q(x, y, z) = R(x, y), S(y, z), T(z, x)[the triangle query] The hypergraph of the query Q: – Variables as vertices – Atoms as hyperedges 6 x y z R S T

E XAMPLE : C ARTESIAN P RODUCT The cartesian product: Q(x,y) = S 1 (x), S 2 (y) with cardinalities m 1, m 2 ALGORITHM – Organize the p servers in a rectangle – The load will be – To minimize L choose The algorithm is optimal 7 S 1 (x)  (h 1 (x), *) S 2 (y)  (*, h 2 (y))

L OWER B OUNDS (1) For a cartesian product Q = S 1 × S 2 × … × S u the lower bound for load is For a Conjunctive Query Q(x 1,…, x k ) = S 1 (…), …, S l (…) any subset of relations S j1, S j2, …, S ju without shared variables (an edge packing for the hypergraph of Q) gives a lower bound for the load The lower bound also holds with any fractional edge packing 8

L OWER B OUNDS (2) 9 Theorem For a Conjunctive Query Q, where relation S j has size M j (in bits), any MPC algorithm that computes Q in one round with maximum load L must satisfy for some constant c and for any fractional edge packing u: Proof techniques: Using entropy to bound knowledge Friedgut’s inequality to bound the maximum size of a query

H YPER C UBE A LGORITHM Q(x 1,…, x k ) = S 1 (…), …, S l (…) For each variable x i define the share to be an integer p i such that: p = p 1 ×.. × p k Assign each of the p servers to a point on the k- dimensional hypercube: [p] = [p 1 ] × … × [p k ] Hash each tuple to the appropriate subcube e.g. S 3 (x 3, x 4 )  (*, *, h 3 (x 3 ), h 4 (x 4 ), *, …) 10

E XAMPLE : T HE T RIANGLE Q UERY Algorithm: [Ganguly ’92, Afrati ’10, Suri ’11] – The p servers form a cube: [p 1/3 ] × [p 1/3 ] × [p 1/3 ] – Send each tuple to servers: R(a, b)  (h x (a), h y (b), - ) S(b, c)  (-, h y (b), h z (c) ) each tuple replicated p 1/3 times T(c, a)  (h x (a), -, h z (c) ) 11 (h x (a), h y (b), h z (c))

A NALYSIS OF H YPERCUBE (1) For a vector of shares p = (p 1, …, p k ), how is relation S j distributed to the servers? Ideally, each server receives tuples Example: relation R(x, y) of the triangle query – Ideal load L = M / #cells = M/p 2/3 – If R has a single value in the x-column, the load will instead be M/p 1/3 – The load will be O(M/p 2/3 ) if each value appears in the x and y columns at most M/p 1/3 times 12 p 1/3

A NALYSIS OF H YPERCUBE (2) In general, a relation S j is skew-free w.r.t. to p if for any subset of variables x of vars(S j ), every value appears at most If every relation is skew-free w.r.t. p then the maximum load of the HYPERCUBE algorithm is: 13

A NALYSIS OF H YPERCUBE (3) The maximum load of the HYPERCUBE algorithm is always bounded by Join with shares p x = p y = p z = p 1/3 – For a skew-free database, the load is O(M/p 2/3 ) – Otherwise, the load is always bounded by O(M/p 1/3 ) 14

C OMPUTING T HE S HARES The optimal shares are computed by solving a Linear Program (LP) 15

A NALYSIS OF H YPER C UBE 16 Theorem For a conjunctive query Q, where relation S j has size M j and is skew-free, there exist shares such that the HYPERCUBE algorithm runs with maximum load By using an LP duality argument, we can prove that the load matches the lower bound pk(Q) = set of all fractional edge packings

E DGE P ACKINGS F OR T HE T RIANGLE 17 Egde packing uLoad (asymptotic) (1/2, 1/2, 1/2)(M R M S M T ) 1/3 /p 2/3 (1,0,0)M R /p (0,1,0)M S /p (0,0,1)M T /p x y z R S T Q(x, y, z) = R(x, y), S(y, z), T(z, x)

T HE P RESENCE OF S KEW A simple join Q(x,y,z) = S 1 (x, z), S 2 (y, z) Optimal shares p x = p y = 1, p z = p – Standard parallel hash-join – If the database has no skew, L = O(max{M 1, M 2 } /p) – If it is skewed, the load can be as bad as O(M) (all tuples are sent to the same server) For any value h of z, m j (h) = frequency of h in S j 18

S KEW -A WARE J OIN (1) Q(x,y,z) = S 1 (x, z), S 2 (y, z) Idea: identify the heavy hitters and treat them differently h is a heavy hitter in S j if m j (h) > M j /p h is light otherwise CASE 1 (LIGHT) For all light values h, run the HyperCube algorithm (hash- join on z) on all p servers 19

S KEW -A WARE J OIN (2) CASE 2 (HEAVY) For any heavy hitter h (either in S 1 or S 2 ) Compute the residual query (a cartesian product) Q[z\h] = S 1 (x, h), S 2 (y, h) using p h exclusive servers. Choose p h such that – The sum of the p h is O(p) – The load for every residual query Q[z\h] is the same 20

S KEW : S IMPLE J OIN 21 Theorem Any MPC algorithm that computes the join query in one round must satisfy: The skew-aware join achieves the above optimal load

S KEW I N C ONJUNCTIVE Q UERIES For any conjunctive query Q, our algorithm computes the light values using HYPERCUBE – Since there is no skew, this part is optimal For the heavy hitters, it considers the residual queries and assigns appropriately an exclusive number of servers – The values of the heavy hitters and their frequency must be known to the algorithm 22

C ONCLUSION Summary Upper and lower bounds for computing Conjunctive Queries in the MPC model in the presence of skew Open Problems What is the load L when we consider more rounds? How do other classes of queries behave? 23

Thank you ! 24

D UALITY : E DGE P ACKING Fractional edge packing: assign u j to S j such that for each variable x i, the sum of edges that contain it is at most 1 25 q(x, y, z) = R(x, y), S(y, z), T(z, x) 1/2 x y z R S T By duality, the minimum value of the LP is equal to the maximum value, over all edge packings pk(q), of