Download presentation
Presentation is loading. Please wait.
Published byMilton Hubbard Modified over 9 years ago
1
S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014
2
M OTIVATION Understand the complexity of parallel query processing on big data – on shared-nothing architectures (e.g. MapReduce) – even in the presence of data skew Dominating parameters of computation: – Communication cost – Number of communication rounds 2
3
T HE MPC M ODEL Computation proceeds in synchronous rounds – Local Computation – Global Communication 3 INPUT (size = M)...... 1 p round 1round r...... 2 1 2 p... 1...... 2 p OUTPUT #bits received at each rounds ≤ L
4
T HE MPC M ODEL What is the minimum load L of an MPC algorithm that computes a Conjunctive Query Q in one round? [Beame, K, Suciu, PODS 2013] Tight upper and lower bounds for relations of equal size (M bits) and no skew 4 The data is evenly distributed Maximizes parallelism Equivalent to sequential computation No parallelism maximum load
5
R ESULTS Computing a Conjunctive Query Q in the MPC model in one round for relations with different sizes and skew Matching upper and lower bounds for any skew-free input database and different relation sizes Almost matching upper and lower bounds in the presence of skew – Matching bounds in the case of simple joins 5
6
C ONJUNCTIVE Q UERIES Full Conjuctive Queries w/o self-joins: – Q(x, y, z) = R(x, y), S(y, z), T(z, x)[the triangle query] The hypergraph of the query Q: – Variables as vertices – Atoms as hyperedges 6 x y z R S T
7
E XAMPLE : C ARTESIAN P RODUCT The cartesian product: Q(x,y) = S 1 (x), S 2 (y) with cardinalities m 1, m 2 ALGORITHM – Organize the p servers in a rectangle – The load will be – To minimize L choose The algorithm is optimal 7 S 1 (x) (h 1 (x), *) S 2 (y) (*, h 2 (y))
8
L OWER B OUNDS (1) For a cartesian product Q = S 1 × S 2 × … × S u the lower bound for load is For a Conjunctive Query Q(x 1,…, x k ) = S 1 (…), …, S l (…) any subset of relations S j1, S j2, …, S ju without shared variables (an edge packing for the hypergraph of Q) gives a lower bound for the load The lower bound also holds with any fractional edge packing 8
9
L OWER B OUNDS (2) 9 Theorem For a Conjunctive Query Q, where relation S j has size M j (in bits), any MPC algorithm that computes Q in one round with maximum load L must satisfy for some constant c and for any fractional edge packing u: Proof techniques: Using entropy to bound knowledge Friedgut’s inequality to bound the maximum size of a query
10
H YPER C UBE A LGORITHM Q(x 1,…, x k ) = S 1 (…), …, S l (…) For each variable x i define the share to be an integer p i such that: p = p 1 ×.. × p k Assign each of the p servers to a point on the k- dimensional hypercube: [p] = [p 1 ] × … × [p k ] Hash each tuple to the appropriate subcube e.g. S 3 (x 3, x 4 ) (*, *, h 3 (x 3 ), h 4 (x 4 ), *, …) 10
11
E XAMPLE : T HE T RIANGLE Q UERY Algorithm: [Ganguly ’92, Afrati ’10, Suri ’11] – The p servers form a cube: [p 1/3 ] × [p 1/3 ] × [p 1/3 ] – Send each tuple to servers: R(a, b) (h x (a), h y (b), - ) S(b, c) (-, h y (b), h z (c) ) each tuple replicated p 1/3 times T(c, a) (h x (a), -, h z (c) ) 11 (h x (a), h y (b), h z (c))
12
A NALYSIS OF H YPERCUBE (1) For a vector of shares p = (p 1, …, p k ), how is relation S j distributed to the servers? Ideally, each server receives tuples Example: relation R(x, y) of the triangle query – Ideal load L = M / #cells = M/p 2/3 – If R has a single value in the x-column, the load will instead be M/p 1/3 – The load will be O(M/p 2/3 ) if each value appears in the x and y columns at most M/p 1/3 times 12 p 1/3
13
A NALYSIS OF H YPERCUBE (2) In general, a relation S j is skew-free w.r.t. to p if for any subset of variables x of vars(S j ), every value appears at most If every relation is skew-free w.r.t. p then the maximum load of the HYPERCUBE algorithm is: 13
14
A NALYSIS OF H YPERCUBE (3) The maximum load of the HYPERCUBE algorithm is always bounded by Join with shares p x = p y = p z = p 1/3 – For a skew-free database, the load is O(M/p 2/3 ) – Otherwise, the load is always bounded by O(M/p 1/3 ) 14
15
C OMPUTING T HE S HARES The optimal shares are computed by solving a Linear Program (LP) 15
16
A NALYSIS OF H YPER C UBE 16 Theorem For a conjunctive query Q, where relation S j has size M j and is skew-free, there exist shares such that the HYPERCUBE algorithm runs with maximum load By using an LP duality argument, we can prove that the load matches the lower bound pk(Q) = set of all fractional edge packings
17
E DGE P ACKINGS F OR T HE T RIANGLE 17 Egde packing uLoad (asymptotic) (1/2, 1/2, 1/2)(M R M S M T ) 1/3 /p 2/3 (1,0,0)M R /p (0,1,0)M S /p (0,0,1)M T /p x y z R S T Q(x, y, z) = R(x, y), S(y, z), T(z, x)
18
T HE P RESENCE OF S KEW A simple join Q(x,y,z) = S 1 (x, z), S 2 (y, z) Optimal shares p x = p y = 1, p z = p – Standard parallel hash-join – If the database has no skew, L = O(max{M 1, M 2 } /p) – If it is skewed, the load can be as bad as O(M) (all tuples are sent to the same server) For any value h of z, m j (h) = frequency of h in S j 18
19
S KEW -A WARE J OIN (1) Q(x,y,z) = S 1 (x, z), S 2 (y, z) Idea: identify the heavy hitters and treat them differently h is a heavy hitter in S j if m j (h) > M j /p h is light otherwise CASE 1 (LIGHT) For all light values h, run the HyperCube algorithm (hash- join on z) on all p servers 19
20
S KEW -A WARE J OIN (2) CASE 2 (HEAVY) For any heavy hitter h (either in S 1 or S 2 ) Compute the residual query (a cartesian product) Q[z\h] = S 1 (x, h), S 2 (y, h) using p h exclusive servers. Choose p h such that – The sum of the p h is O(p) – The load for every residual query Q[z\h] is the same 20
21
S KEW : S IMPLE J OIN 21 Theorem Any MPC algorithm that computes the join query in one round must satisfy: The skew-aware join achieves the above optimal load
22
S KEW I N C ONJUNCTIVE Q UERIES For any conjunctive query Q, our algorithm computes the light values using HYPERCUBE – Since there is no skew, this part is optimal For the heavy hitters, it considers the residual queries and assigns appropriately an exclusive number of servers – The values of the heavy hitters and their frequency must be known to the algorithm 22
23
C ONCLUSION Summary Upper and lower bounds for computing Conjunctive Queries in the MPC model in the presence of skew Open Problems What is the load L when we consider more rounds? How do other classes of queries behave? 23
24
Thank you ! 24
25
D UALITY : E DGE P ACKING Fractional edge packing: assign u j to S j such that for each variable x i, the sum of edges that contain it is at most 1 25 q(x, y, z) = R(x, y), S(y, z), T(z, x) 1/2 x y z R S T By duality, the minimum value of the LP is equal to the maximum value, over all edge packings pk(q), of
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.