Presentation is loading. Please wait.

Presentation is loading. Please wait.

Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and.

Similar presentations


Presentation on theme: "Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and."— Presentation transcript:

1 Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München

2 Technische Universität München Hardware trends … Huge main memory Massive processing parallelism Non-uniform Memory Access (NUMA) Our server: –4 CPUs –32 cores –1 TB RAM –4 NUMA partitions 2 CPU 0

3 Technische Universität München Main memory database systems VoltDB, Hana, MonetDB HyPer: real-time business intelligence queries on transactional data* 3 * http://www-db.in.tum.de/research/projects/HyPer/

4 Technische Universität München How to exploit these hardware trends? Parallelize algorithms Exploit fast main memory access Kim, Sedlar, Chhugani: Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. VLDB09 Blanas, Li, Patel: Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs. SIGMOD11 AND be aware of fast local vs. slow remote NUMA access 4

5 Technische Universität München NUMA partition 1NUMA partition 3 NUMA partition 2NUMA partition 4 hashable Ignoring NUMA 5 core 1 core 2 core 3 core 4 core 5 core 6 core 7 core 8

6 Technische Universität München How much difference does NUMA make? 100% scaled execution time sortpartitioning merge join (sequential read) 22756 ms 417344 ms 1000 ms 7440 ms 837 ms 12946 ms 6 remote local synchronizedsequential remote local

7 Technische Universität München The three NUMA commandments C1 Thou shalt not write thy neighbors memory randomly -- chunk the data, redistribute, and then sort/work on your data locally. 7 C2 Thou shalt read thy neighbors memory only sequentially -- let the prefetcher hide the remote access latency. C3 Thou shalt not wait for thy neighbors -- dont use fine- grained latching or locking and avoid synchronization points of parallel threads.

8 Technische Universität München Basic idea of MPSM R S chunk R chunk S R chunks S chunks 8

9 Technische Universität München Basic idea of MPSM C1: Work locally: sort C3: Work independently: sort and merge join C2: Access neighbors data only sequentially MJ chunk R chunk S sort R chunks locally sort S chunks locally R chunks S chunks merge join chunks MJ 9

10 Technische Universität München Range partitioning of private input R To constrain merge join work To provide scalability in the number of parallel workers 10

11 Technische Universität München To constrain merge join work To provide scalability in the number of parallel workers Range partitioning of private input R R chunksrange partition R range partitioned R chunks 11

12 Technische Universität München To constrain merge join work To provide scalability in the number of parallel workers S is implicitly partitioned Range partitioning of private input R range partitioned R chunks sort R chunks sort S chunks S chunks 12

13 Technische Universität München To constrain merge join work To provide scalability in the number of parallel workers S is implicitly partitioned Range partitioning of private input R range partitioned R chunks sort R chunks sort S chunks S chunks MJ merge join only relevant parts 13

14 Technische Universität München Range partitioning of private input Time efficient branch-free comparison-free synchronization-free and Space efficient densely packed in-place by using radix-clustering and precomputed target partitions to scatter data to 14

15 Technische Universität München Range partitioning of private input 9 19 7 3 21 1 17 2 23 4 31 8 20 26 21 17 23 31 20 26 chunk of worker W 1 chunk of worker W 2 histogram of worker W 1 7 = 00111 <16 16 17 = 10001 histogram of worker W 2 <16 16 4 3 3 4 prefix sum of worker W 1 0 0 prefix sum of worker W 2 4 3 W1W1 W2W2 W1W1 W2W2 1 19 19=10011 15 5 2=00010 2

16 Technische Universität München Range partitioning of private input 9 19 7 3 21 1 17 2 23 4 31 8 20 26 chunk of worker W 1 chunk of worker W 2 histogram of worker W 1 7 = 00111 <16 17 = 10001 histogram of worker W 2 <16 4 3 3 4 prefix sum of worker W 1 0 0 prefix sum of worker W 2 4 3 7 3 9 1 4 17 31 2 8 21 23 20 26 W1W1 W2W2 W1W1 W2W2 1 19 19=10011 16 5 2=00010

17 Technische Universität München Real C hacker at work …

18 Technische Universität München Skew resilience of MPSM Location skew is implicitly handled Distribution skew: –Dynamically computed partition bounds –Determined based on the global data distributions of R and S –Cost balancing for sorting R and joining R and S 18

19 Technische Universität München Skew resilience 1. Global S data distribution –Local equi-height histograms (for free) –Combined to CDF 7 1 10 15 22 31 66 2 12 17 25 33 42 78 81 90 S1S1 S2S2 # tuples key value CDF 19 50 16 13

20 Technische Universität München Skew resilience 2. Global R data distribution –Local equi-width histograms as before –More fine-grained histograms 13 4 2 31 20 8 6 histogram <8 [8,16) [16,24) 24 13 8 20 31 3 2 1 1 8 = 01000 2 = 00010 20 R1R1

21 Technische Universität München Skew resilience 3. Compute splitters so that overall workloads are balanced*: greedily combine buckets, thereby balancing the costs of each thread for sorting R and joining R and S are balanced # tuples key value CDF histogram 3 2 1 1 += 4 2 6 13 8 20 31 21 * Ross and Cieslewicz: Optimal Splitters for Database Partitioning with Size Bounds. ICDT09

22 Technische Universität München Performance evaluation MPSM performance in a nutshell: –160 mio tuples joined per second –27 bio tuples joined in less than 3 minutes –scales linearly with the number of cores Platform HyPer1: –Linux server –1 TB RAM –4 CPUs with 8 physical cores each Benchmark: –Join tables R and S with schema {[joinkey: 64bit, payload: 64bit]} –Dataset sizes ranging from 50GB to 400GB 22

23 Technische Universität München Execution time comparison MPSM, Vectorwise (VW), and Blanas hash join* 32 workers |R| = 1600 mio (25 GB), varying size of S * S. Blanas, Y. Li, and J. M. Patel: Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs. SIGMOD 2011 23

24 Technische Universität München Scalability in the number of cores MPSM and Vectorwise (VW) |R| = 1600 mio (25 GB), |S|=4*|R| 24

25 Technische Universität München Location skew Location skew in R has no effect because of repartitioning Location skew in S: in the extreme case all join partners of R i are found in only one S j (either local or remote) 25

26 Technische Universität München Distribution skew: anti-correlated data without balanced partitioning with balanced partitioning 26

27 Technische Universität München Distribution skew : anti-correlated data 27

28 Technische Universität München Conclusions MPSM is a sort-based parallel join algorithm MPSM is NUMA-aware & NUMA-oblivious MPSM is space efficient (works in-place) MPSM scales linearly in the number of cores MPSM is skew resilient MPSM outperforms Vectorwise (4X) and Blanas et als hash join (18X) MPSM is adaptable for disk-based processing –See details in paper 28

29 Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München THANK YOU FOR YOUR ATTENTION!


Download ppt "Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and."

Similar presentations


Ads by Google