Parallel Evaluation of Conjunctive Queries Paraschos Koutris and Dan Suciu University of Washington PODS 2011, Athens
Motivation Massive parallelism is necessary nowadays for handling huge amounts of data Parallelism has been popularized in various forms: The MapReduce architecture Languages on the top of MapReduce: PigLatin, Hive Systems for data analytics: Dremmel, SCOPE What is a good theoretical model to capture computation in such massively parallel systems?
Today’s Parallel Models Classic models for parallelism: Circuit complexity, PRAM (Parallel Random Access Machines) The BSP (Bulk-Synchronous Parallel) model [Valiant, ‘90] The LogP model [Culler at al, ‘93] The main bottlenecks: Communication + Synchronization + Data Skew CommunicationSynchronizationData Skew [Afrati and Ullman,EDBT’10]minimize1 stepn/a [Karlof et al., SODA’10]implicit restrictionminimizememory O(n ε ), ε<1 [Hellerstein,SIGMOD’10] (Coordination Complexity) n/aminimize n/a Our ApproachO(n)minimizeload balancing
Our Approach Strict bounds on communication and data skew Minimize synchronization Parallel complexity = # synchronization steps Example: Algorithms A and B process the same amount of data Algorithm B is more efficient than algorithm A Algorithm A Algorithm B
The Massively Parallel Model A universe U, a relational schema and a database instance D P servers: relation R partitioned to R 1, R 2, …, R P A value a from U is generic: Copy a Test for equality: is a = b ? Feed to a hash function : h(a), h’(a, b) hash functions can be chosen randomly at the beginning Computation proceeds in parallel steps, each with 3 phases: Broadcast Phase: The P servers exchange some data B globally, shared among all servers. We require size(B) = O(n ε ), ε < 1 Communication Phase: Each server sends data to other servers Computation Phase: local computation An algorithm for a query Q is load balanced if the expected maximum load is O(n / P) where n = size of input + output data
Datalog Notation for MP the fragment of relation R stored at server s Broadcasting to all servers: :- Point-to-point communication using a hash function h: :- Local computation at server s: :- Communication Phase :- :- Computation Phase :- Intersection Q(x):-R(x),S(x)
The Main Result Every tall-flat conjunctive query can be evaluated in one MP step by a load balanced algorithm Conversely, if a query is not tall-flat, then any algorithm consisting of one MP step can not be load balanced Main Theorem We study relational queries which are: Conjunctive: conjunction of atoms Full: every variable must appear in the head of the query Q : Which full conjunctive queries can be answered by a load balanced algorithm in one MP step?
Tall-Flat Queries Tall Queries: Q(x,y,z):- R(x),S(x,y),T(x,y,z) Flat Queries: Q(x,y,z,w) :- R(x,y),S(x,z),T(x,w) Combine them to get the tall-flat queries: L(x 1,x 2,x 3,x 4,y 1,y 2,y 3 ) :− R 1 (x 1 ), R 2 (x 1,x 2 ), R 3 (x 1,x 2,x 3 ), R 4 (x 1,x 2,x 3,x 4 ), S 1 (x 1,x 2,x 3,x 4,y 1 ), S 2 (x 1,x 2,x 3,x 4,y 2 ), S 3 (x 1,x 2,x 3,x 4,y 3 ) Tall part Flat part
Outline Algorithms for Semijoin Flat Queries Tall Queries Combine for Tall-Flat Queries Impossible Queries
Semijoin: a naïve approach Semijoin operator Q(x,y):- R(x), S(x,y) Communication Phase: send tuples S(a,b),R(a) to server h(a) Computation Phase: locally perform the semijoin Hashing Load balanced? ✔ ✗ S(5,a) S(2,b) S(2,a) S(4,a) S(1,c) S(3,d) S(3,f) S(0,a) S(0,d) S(0,e) S(0,w) S(0,c)
A better approach Same approach as SkewJoin in PigLatin Computing frequent elements : given a relation R(x,…) find the values of x with frequency more than a threshold τ Sampling Local Counting Broadcast Phase compute frequent values : set F = frequent(S) Communication Phase :- not :- y), not :- Computation Phase :- Semijoin
The Broadcast Phase Do we really need a broadcast phase before distributing the data to the servers? Any algorithm computing a semijoin in 1 MP step without a broadcast phase is not load balanced Theorem The purpose of the broadcast phase is to extract information on the data distribution (e.g. identify the frequent values)
Full Join Similar idea to [Zu et al., SIGMOD ‘08] Communication Phase CASE : frequent(R) :- RF(y) :- RF(y) CASE : frequent(S), not frequent(R) SF(y), not RF(y) :- SF(y), not RF(y) CASE : not frequent(R), not frequent(S) not RF(y), not RS(y) not RF(y), not RS(y) Computation Phase :- DS(y,z) :- DR(x,y), :- :- Join Q(x,y,z):-R(x,y),S(y,z)
Flat Queries How can we extend the above ideas to compute flat queries? Q(x,y,z,w) :- R(x,y),S(x,z),T(x,w) We introduce a second step in the broadcast phase to find the frequent values that definitely appear in the final result Why would that be a problem? a is frequent in R, S and does not exist in T The cost of replication of a-tuples would not be justified by the output size The idea generalizes for any flat query, with only 2 broadcast steps
Tall Queries Compute a tall query Q(x,y,z) :- R(x),S(x,y),T(x,y,z) Construct a decision tree to decide whether a tuple will be hashed (and how) or broadcast Example: a tuple t = S(a,b) x in frequent(T) x in frequent(S) x,y @h(x,y,z) t Yes! Broadcast No! Send to h(a,b)
The Main Algorithm Reminder: A tall-flat query consists of a tall and a flat part Tall-query techniques (decision tree) handle the tall part Flat-query techniques handle the flat part We can thus design an algorithm which computes any tall-flat query in 1 MP step (with a 2-step broadcast phase) Every tall-flat conjunctive query can be evaluated in one MP step by a load balanced algorithm Main Theorem (Part 1)
Impossibility Theorems The query RST(x,y):- R(x),S(x,y),T(y) can not be computed in 1 MP step by a load balanced algorithm Lemma 1 The query J(x,y):- R(x),S(x),T(y) can not be computed in 1 MP step by a load balanced algorithm Lemma 2 Any non tall-flat query can not be computed in 1 MP step by a load balanced algorithm Main Theorem (Part 2)
Open Questions How can we leverage data statistics (e.g. relation sizes, value distributions) to design better MP algorithms? What is the minimum number of parallel steps for any query? What is the parallel complexity of other classes of queries (e.g. with union, projections)? At what point does it become more expensive in practice to have a broadcast phase instead of 2 steps?
Questions ??