-Shourie Boddupalli
Data Parallelism Data Parallelism is a form of parallelization of computing across multiple processors in parallel computing environment. A data-parallel framework is very attractive for large- scale data processing since it enables such an application to easily process a huge amount of data on commodity machines
Data Warehouse A data warehouse is an online repository for decision support applications that answer business queries in a short time. Where can data parallelism be used in a Warehouse? Star Schema Star-Join Query
Approaches to process Star-Join Data Parallel Framework (Ex: Hive, CloudBase) - No need for up-to-date hardware & software - Fault-Tolerance provided by hiding complexity. But in case of join-query processing computational efficiency in premature state.
Warehouse Example
Example Query SELECT D_YEAR,S-NATION,P_CATEGORY FROM DATE,CUSTOMER,SUPPLIER,PART,LINORDER WHERE LO_CUSTKEY = C_CUSTKEY AND LO_SUPKEY = S_SUPKEY AND LO_PARTKEY = P_PARTKEY AND C_REGION = ‘AMERICA’ GROUP BY D_YEAR,S_NATION,P_CATEGORY;
Execution plan for the query
Scatter-Gather-Merge This algorithm(as name indicates) has 3 phases Scatter Gather Merge Key Manipulation Technique: Basic idea is to join the fact table with n dimension tables within 3 computational phases
Example of Database and Star-Join Query
Contd. During the scatter phase 1) If the input is a tuple of FT, the tuple is transformed into two key-value pairs as results 2) If the input is a tuple of the dimension tables, the tuple is transformed into a new key-value pair as a result Gather Phase aggregates according to key Merge Phase produces the final results of star-join queries
Algorithm Algorithm 1 (Key manipulation algorithm of Scatter-Gather-Merge) Scatter(r) Input r is a record. 1: if (r is a record of the fact table F) then 2: for each fki do 3: Turn input tuple (fk1, fk2,..., fkn, rF ) into key- value pair ((fki, i), (fk1, fk2, fkn, rF )). 4: Store ((fki, i), (fk1, fk2,..., fkn, rF )). 5: endfor 6: endif 7: if (r is a record of dimension table Di )then 8: Turn input tuple (pki, rDi) into key-value pair ((pki, i), rDi). 9: Store and Distribute ((pki, i), rDi). 10: endif Gather(k, v) Input k is a key (join key). ν is a set of records that have the same join key. 1: Match all ((pki, i), rDi) with all ((fki, i), (fk1, fk2,..., fkn, rF )). 2: Make an output ((fk1, fk2,..., fkn), (rDi, rF )). 3: Store and Distribute ((fk1, fk2,..., fkn), (rDi, rF )). Merge(k, v) Input k is a key (fk1, fk2,..., fkn). ν is a set of records that have the key. 1: Aggregate every record with all ((fk1, fk2,..., fkn), (rDi, rF )) where 1 ≤ i ≤ n. 2: Make an output ((fk1, fk2,..., fkn), (rD1, rD2,..., rDn, rF )). 3: Store ((fk1, fk2,..., fkn), (rD1, rD2,..., rDn, rF )). //final output
Notation Used Di has the primary key PKi that is associated with the foreign key FKi of F where i is the dimension identification number of Di. Each tuple of Di is (pki, rDi) where pki is the value of the primary key PKi and rDi is a vector that contains other attribute values. Each tuple of F (fk1, fk2,..., fkn, rF ) where fki is the value of the foreign key FKi and rF is a vector that contains other attribute values. The vector (fk1, fk2,..., fkn) is unique in the fact table or rF contains the primary key
IO Reduction Technique In case of key manipulation technique there are n intermediate results to generate a final query which needs to be reduced. To reduce the number of intermediate results Bloom filters were introduced.
Algorithm for IO Reduction Algorithm 2 (Scatter-Gather-Merge algorithm) Filter-Construction(r) Input r is a record. BFi is a bloom filter of Di. 1: if (r is a record of dimension table Di 2: and r is satisfied with CDi ) then 3: Store and Distribute r. 4: Add pki to BFi. 5: endif Scatter(r) Input r is a record. 1: if (v is a record of the fact table F) then 2: for each fki do 3: if fki is not contained by the corresponding BFi return 4: endif 5: endfor 6: for each fki do 7: Turn input tuple (fk1, fk2,..., fkn, rF ) into key value pair ((fki, i), (fk1, fk2,..., fkn, rF )). 8: Store and Distribute ((fki, i), (fk1, fk2,..., fkn, rF )). 9: endfor 10: endif 11: if (r is a record of dimension table Di ) then 12: Turn input tuple (pki, rDi) into key-value pair ((pki, i), rDi). 13: Store and Distribute ((pki, i), rDi). 14: endif
Map-Reduce based Scatter-Gather- Merge Algorithm Three Phases - Construction - Scatter & Gather - Merge
Contd.
Map-Reduce based Scatter-Gather- Merge Algorithm Map(k, v) Input k is a key. ν is a record of each participating dimension table that the star-join query has restrictions on. 1: if (v is a record of dimension table Di 2: and v is satisfied with CDi ) then 3: Turn input tuple (pki, rDi) into key value pair ((pki, i), rDi). 4: Emit ((pki, i), rDi). 5: endif Reduce(k, v) Input (k, ν) is a filtered record of each dimension table. BF(i,j ) is a bloom filter of Di for the j th Reduce process. 1: Emit ((pki, i), rDi). 2: Add pki to BF(i,j ). Map(k, v) // scatter function Input k is a key. ν is a record of the fact table and every participating dimension table. 1: if (v is a record of the fact table F) then 2: for each fki do 3: if fki is not contained by the corresponding BF(i,j ) return 4: endif 5: endfor 6: for each fki do 7: Turn input tuple (fk1, fk2,..., fkn, rF ) into key- value pair ((fki, i), (fk1, fk2,..., fkn, rF )). 8: Emit ((fki, i), (fk1, fk2,..., fkn, rF )).
Contd. 9: endfor 10: endif 11: if (v is a record of dimension table Di ) then 12: if (There are restrictions on Di ) then 13: Emit ((pki, i), rDi). 14: else 15: Turn input tuple (pki, rDi) into key-value pair ((pki, i), rDi). 16: Emit ((pki, i), rDi). 17: endif 18: endif Reduce(k, v) // gather function Input k is a key (join key). ν is a set of records that have the same join key. 1: Match all ((pki, i), rDi) with all ((fki, i), (fk1, fk2,..., fkn, rF )) where pki= fki. 2: Make an output ((fk1, fk2,..., fkn), (rDi, rF )). 3: Emit ((fk1, fk2,..., fkn), (rDi, rF )). Map(k, v) Input k is a key (fk1, fk2,..., fkn) and ν is a value (rDi, rF ). 1: Emit ((fk1, fk2,..., fkn), (rDi, rF )). Reduce(k, v) Input k is a key (fk1, fk2,..., fkn). ν is a set of records that have the key. 1: Aggregate every record with all ((fk1, fk2,..., fkn), (rDi, rF )) where 1 ≤ i ≤ n. 2: Make an output ((fk1, fk2,..., fkn), (rD1, rD2,..., rDn, rF )). 3: Emit ((fk1, fk2,..., fkn), (rD1, rD2,..., rDn, rF )).
Experimental Results From the experiments conducted it is observed that the query performance was better when Scatter- Gather-Merge algorithm with Bloom filters fared well compared to case without Bloom filters Even in cases where the warehouse size has increased the same results were obtained.