Download presentation
Presentation is loading. Please wait.
Published byJesse Ferguson Modified over 9 years ago
1
1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade de Coimbra Portugal
2
2 Pedro Furtado, DOLAP 2004 Context Parallelism used for major performance improvement in large Data warehouses Using simple low-cost shared-nothing architecture –Without any efficiency requirements on Network or Nodes NODE PARTITIONED DATA WAREHOUSE Minimize inter-node data exchange requirements –Horizontally fully-partition facts (largest), rest of relations are replicated Hope to obtain near-to-linear speedup
3
3 Pedro Furtado, DOLAP 2004 to run it n times faster … “Divide to conquer” - Horizontally Partition Large Facts (randomly) into n Nodes - Replicate other Relations (Small Dimensions?) Node 1 D2 D1 D3 D4 Sales Node 2 Sales D2 D1 D3 D4 Node 3 D2 D1 D3 D4 Sales D2 D1 D3D4
4
4 Pedro Furtado, DOLAP 2004 Why Replicate Dimensions? We replicated because we would not need to repartition Wouldn´t work with partitioned dimensions: …and you can do other ops independently as well
5
5 Pedro Furtado, DOLAP 2004 Query processing SUM(X) over 1/n FACT, Ds GROUP BY dims SUM(X) over 1/n FACT, Ds GROUP BY dims SUM(X) over 1/n FACT, Ds GROUP BY dims SUM(SUMs) SUM(X) over FACT, dims GROUP BY dims
6
6 Pedro Furtado, DOLAP 2004 Query Processing Steps Rewrite Query Send Query Compute Partial Result Send Partial Results Apply Merge Query Computing Nodes 1. 2. 3. 5. 6. Redistribute Submitter Node Repartition 4. 7.
7
7 Pedro Furtado, DOLAP 2004 Problem (TPC-H case study) Part Supp Supplier Customer Orders Lineitem Part Very large Large ? ? Many typical Schemas are “Complex” – many large relations may exist Medium
8
8 Pedro Furtado, DOLAP 2004 Problem Statement Divide by N … would expect N times faster - Linear Speedup (LS) However, we don´t get the LS
9
9 Pedro Furtado, DOLAP 2004 Our Major Contributions Show these problems experimentally –performance evaluation benchmark TPC-H: We EXPLAIN AND ILLUSTRATE the LARGE RELATIONS problem Identify simple modifications to improve results Analyze the modifications experimentally
10
10 Pedro Furtado, DOLAP 2004 Partitioning Facts (Largest) LI + PS Partitioned PS S C O Li P PS S C O Li P S C O P PS Node 1 Node N
11
11 Pedro Furtado, DOLAP 2004 Generated TPC-H 50GB into 1 and 25 nodes Used PCs (Pentium III 866 MHz CPU) 512MB RAM Oracle 9i, tuned initial setting TPC-H 22 query set Measured Response Time: 1 node against 25 nodes We show that the speedup underachievement is explained mostly by the size of replicated dimensions
12
12 Pedro Furtado, DOLAP 2004 Experimental Results LS Speedup: 25-30 Only a few queries exhibited near-to-LS! Medium Speedup 6-15 Low Speedup 2-6 Very Low Speedup 0.4-1.9
13
13 Pedro Furtado, DOLAP 2004 Some had Linear Speedup… LS Speedup 25-30 S C O Li P Q15: PS S is reasonably small relative to Li/N S C O P Q1, Q6: Li PS Access only fragments (Li/N)
14
14 Pedro Furtado, DOLAP 2004 Others had smaller speedup… Medium Speedup 6-15 S C O Li P S C O P Q14, Q19: Q11 PS P is not small relative to fragment (Li/N) S is not small relative to PS/N
15
15 Pedro Furtado, DOLAP 2004 What Happened… With N nodes we would like to: –process 1/N of the data, have about N times speedup However, we have replicated relations… The amount of speedup degradation depends on the size of R2 relative to R1/N
16
16 Pedro Furtado, DOLAP 2004 Low Speedup Queries: Speedup 2-5.5 S C O Li P S C O P Q3, Q5, Q7, Q10, Q12: Q16: PS O is large relative to Li/N P is large relative to PS/N S C O Li P Q9: PS
17
17 Pedro Furtado, DOLAP 2004 Very Low or No Speedup Queries: Speedup 0.4-2 S C O Li P Q13, Q22: PS Process only replicated relations S C O Li P Q8: PS Includes all replicated relations Q4, Q21, Q2: Scenarios Similar to “Slow Queries”
18
18 Pedro Furtado, DOLAP 2004 What Happened… Not only includes replicated relations… But also replicated relations included are very large in comparison to fragments!
19
19 Pedro Furtado, DOLAP 2004 The same in pictures… Medium speedup Low speedup S C O Li P PS S C O Li P PS O is large relative to Li/N Large speedup S C O Li P PS No speedup at all S C O Li P PS O is large relative to Li/N
20
20 Pedro Furtado, DOLAP 2004 Back to Partitioning Alternatives… Placement alternatives: relation in Single Node vs Replicated (all nodes) vs Partitioned Partitioning function (Round-robin/Random, Range, HASH) Choice of Partitioning attributes Product Supply History (PS) Orders (O) Lineitem (LI) ? ? PS_key O key Customer (C) ? C key Repartitioning = re-hash by exchanging rows between nodes When you partition more than 1 rel => will probably need to repartition e.g.: If you partition LI and O by O_KEY = “equi-partitioned” … LI join PS needs repartitioning of LI … O join C needs repartitioning of O
21
21 Pedro Furtado, DOLAP 2004 Lets Review Related Work… Replicate all but one relation – PRS [Yu et al., TKDE89] –Similar to what we did: replicated all except LI [Yu et al., TKDE89]: “Partition strategy for distributed query processing in fast local networks” Partition using dependencies - PLACEMENT DEPENDENCY [Liu et al, ICDE96] –e.g. partition ORDERs and Co-locate its LINEITEM rows (LI is the dependant relation) [Liu et al, ICDE96]: “A Distributed Query Processing Strategy Using Placement Dependency” [Chen et al, ICPADS 2000]: “An Efficient Algorithm for Distributed Queries Using Partition Dependency”. Parallel Hash Join and Optimization - PHJ –Relations are hash-partitioned, Repartitioning required to re-hash in order to JOIN [DeWitt et al., VLDB11]: “Multiprocessor Hash-Based Join Algorithms” [Liu et al, EDBT96]: “A Hash Partition Strategy for Distributed Query Processing” [Kitsuregawa et al., 1983 ], “Application of hash to database machine and its architecture” [Shasha et al., TODS91]: “Optimizing Equijoin Queries In Distributed Databases … Hash Partitioned”. Workload-based Partitioning and Placement –Determine best partitioning attributes automatically, based on the workload [Daniel Zilio et al. 1994], “Partitioning Key Selection for a Shared-Nothing Parallel Database System” [Rao et al., SIGMOD 2000]: Automating physical database design in a parallel database.
22
22 Pedro Furtado, DOLAP 2004 Local Replicated Join: Join Fragment to replicated relation locally, no data exchanged One Relation must be Replicated –E.g. LI(O_KEY), O() Cost local replicated join = N nodes, relations R, constant
23
23 Pedro Furtado, DOLAP 2004 Local Partitioned Join Join fragments locally, no data exchanged Relations must be equi-partitioned –E.g. LI(O_KEY), O(O_KEY) Cost local join = N nodes, relations R, constant
24
24 Pedro Furtado, DOLAP 2004 Repartition Join Re-hash with data exchange, then join locally Relation Partitions are not co-located –E.g. O(O_KEY), C(C_KEY) Cost Repartition join = , constant weight factors Depends on network configuration
25
25 Pedro Furtado, DOLAP 2004 Proposed Solution “Very Small” Dimensions –Replicate –“Very small” depends on relation sizes and nº of nodes Non-small Dimensions –Hash-Partition by PRIMARY KEY because they “always” join based on PK (with facts) like in placement-dependency, we take advantage of invariant Facts –Find hash-partitioning attribute that minimizes repartitioning costs –Reasonable approximation: most frequent equi-join attr.
26
26 Pedro Furtado, DOLAP 2004 Result of Partitioning (TPC-H) O Li P PS O_KEY S C P_KEY Local Join (equi-partitioned) Replicated Join (with small dimension) Repartitioned Join
27
27 Pedro Furtado, DOLAP 2004 Experimental Results Ship only selected rows from LI … LI join P
28
28 Pedro Furtado, DOLAP 2004 Repartition VS Total Runtime TC = total runtime RC = repartition time Repartition time is reasonably small… Depends on: number of nodes + selectivities –(can be very dependent on selection conditions of specific query)
29
29 Pedro Furtado, DOLAP 2004 Conclusions We have analyzed a basic partitioning strategy (PRS-like) –Largest Relation is partitioned, the others are replicated –The speedup is totally unsatisfactory for many queries We analyzed why this happens: explained by access patterns to replicated relations We tried very simple partitioning alternative –Only very small relations are replicated –Dimensions are partitioned by Primary Key –Hash-partition facts, partitioning key = most frequent join attr We have shown that it works well –prevents very low speedup –provides near to linear speedup for most queries
30
30 Pedro Furtado, DOLAP 2004 Thank You! Questions? www.eden.dei.uc.pt/~pnf pnf@dei.uc.pt
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.