Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and.

Slides:



Advertisements
Similar presentations
Chapter 13: Query Processing
Advertisements

Analysis of Computer Algorithms
Technische Universität München + Hewlett Packard Laboratories Dynamic Workload Management for Very Large Data Warehouses Juggling Feathers and Bowling.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
1 Multithreaded Programming in Java. 2 Agenda Introduction Thread Applications Defining Threads Java Threads and States Examples.
Chapter 1 The Study of Body Function Image PowerPoint
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
AIFB Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 1 Mind the Web! Valentin Zacharias, Andreas Abecker, Imen.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
Year 6 mental test 5 second questions
1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.
1 Processes and Threads Creation and Termination States Usage Implementations.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and.
ZMQS ZMQS
Chapter 1 Introduction Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Introduction Abstract Views of an Operating System.
Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.
1 NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman *University of Wisconsin - MadisonIBM.
Ken C. K. Lee, Baihua Zheng, Huajing Li, Wang-Chien Lee VLDB 07 Approaching the Skyline in Z Order 1.
Acceleration of Cooley-Tukey algorithm using Maxeler machine
Fast Crash Recovery in RAMCloud
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
1 Copyright © 2012 Oracle and/or its affiliates. All rights reserved. Convergence of HPC, Databases, and Analytics Tirthankar Lahiri Senior Director, Oracle.
Shredder GPU-Accelerated Incremental Storage and Computation
ABC Technology Project
Asaf Cidon. , Tomer M. London
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
2 |SharePoint Saturday New York City
25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
5 August, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
VOORBLAD.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
© 2012 National Heart Foundation of Australia. Slide 2.
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
Chapter 9: The Client/Server Database Environment
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Unit 1:Parallel Databases
HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Addition 1’s to 20.
25 seconds left…...
Håkan Sundell, Chalmers University of Technology 1 Evaluating the performance of wait-free snapshots in real-time systems Björn Allvin.
Week 1.
SE-292 High Performance Computing
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.
© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.
1 Unit 1 Kinematics Chapter 1 Day
PSSA Preparation.
Choosing an Order for Joins
NetSlices: Scalable Multi-Core Packet Processing in User-Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee.
Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.
Query Task Model (QTM): Modeling Query Execution with Tasks 1 Steffen Zeuch and Johann-Christoph Freytag.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
April 30th – Scheduling / parallel
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Parallel Analytic Systems
Presentation transcript:

Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München

Technische Universität München Hardware trends … Massive processing parallelism Huge main memory Non-uniform Memory Access (NUMA) Our server: –4 CPUs –32 cores –1 TB RAM –4 NUMA partitions 2 CPU 0

Technische Universität München Main memory database systems VoltDB, SAP HANA, MonetDB HyPer*: real-time business intelligence queries on transactional data 3 * ICDE 2011 A. Kemper, T. Neumann MPSM for efficient OLAP query processing

Technische Universität München How to exploit these hardware trends? Parallelize algorithms Exploit fast main memory access Kim, Sedlar, Chhugani: Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. VLDB09 Blanas, Li, Patel: Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs. SIGMOD11 AND be aware of fast local vs. slow remote NUMA access 4

Technische Universität München How much difference does NUMA make? 100% scaled execution time sortpartitioning merge join (sequential read) ms ms 1000 ms 7440 ms 837 ms ms 5 remote local synchronizedsequential remote local ms 7440 ms ms

Technische Universität München NUMA partition 1NUMA partition 3 NUMA partition 2NUMA partition 4 hashtable Ignoring NUMA … 6 core 1 core 2 core 3 core 4 core 5 core 6 core 7 core 8

Technische Universität München The three NUMA commandments C1 Thou shalt not write thy neighbors memory randomly -- chunk the data, redistribute, and then sort/work on your data locally. 7 C2 Thou shalt read thy neighbors memory only sequentially -- let the prefetcher hide the remote access latency. C3 Thou shalt not wait for thy neighbors -- dont use fine- grained latching or locking and avoid synchronization points of parallel threads.

Technische Universität München Basic idea of MPSM private R public S chunk private R chunk public S R chunks S chunks 8

Technische Universität München Basic idea of MPSM C1: Work locally: sort C3: Work independently: sort and merge join C2: Access neighbors data only sequentially MJ chunk private R chunk public S sort R chunks locally sort S chunks locally R chunks S chunks merge join chunks MJ 9

Technische Universität München Range partitioning of private input R To constrain merge join work To provide scalability in the number of parallel workers 10

Technische Universität München To constrain merge join work To provide scalability in the number of parallel workers Range partitioning of private input R R chunksrange partition R range partitioned R chunks 11

Technische Universität München To constrain merge join work To provide scalability in the number of parallel workers S is implicitly partitioned Range partitioning of private input R range partitioned R chunks sort R chunks sort S chunks S chunks 12

Technische Universität München To constrain merge join work To provide scalability in the number of parallel workers S is implicitly partitioned Range partitioning of private input R range partitioned R chunks sort R chunks sort S chunks S chunks MJ merge join only relevant parts 13

Technische Universität München Range partitioning of private input R Time efficient branch-free comparison-free synchronization-free and Space efficient densely packed in-place by using radix-clustering and precomputed target partitions to scatter data to 14

Technische Universität München Range partitioning of private input chunk of worker W 1 chunk of worker W 2 histogram of worker W 1 7 = < = histogram of worker W 2 < prefix sum of worker W prefix sum of worker W W1W1 W2W2 W1W1 W2W = =

Technische Universität München Range partitioning of private input chunk of worker W 1 chunk of worker W 2 histogram of worker W 1 7 = <16 17 = histogram of worker W 2 < prefix sum of worker W prefix sum of worker W W1W1 W2W2 W1W1 W2W = =00010

Technische Universität München Location skew is implicitly handled Distribution skew –Dynamically computed partition bounds –Determined based on the global data distributions of R and S –Cost balancing for sorting R and joining R and S Skew resilience of MPSM S1S1 S2S R1R R2R2

Technische Universität München Skew resilience 1. Global S data distribution –Local equi-height histograms (for free) –Combined to Cumulative Distribution Function (CDF) S1S1 S2S2 # tuples key value CDF

Technische Universität München Skew resilience 2. Global R data distribution –Local equi-width histograms as before –More fine-grained histograms histogram <8 [8,16) [16,24) = = R1R1

Technische Universität München Skew resilience 3. Compute splitters so that overall workloads are balanced*: greedily combine buckets, thereby balancing the costs of each thread for sorting R and joining R and S are balanced # tuples key value CDF histogram = * Ross and Cieslewicz: Optimal Splitters for Database Partitioning with Size Bounds. ICDT09

Technische Universität München Performance evaluation MPSM performance in a nutshell: –160 mio records joined per second –27 bio records joined in less than 3 minutes –scales linearly with the number of cores Platform: –Linux server –1 TB RAM –32 cores Benchmark: –Join tables R and S with schema {[joinkey: 64bit, payload: 64bit]} –Dataset sizes ranging from 50GB to 400GB 21

Technische Universität München Execution time comparison MPSM, Vectorwise (VW), and Blanas hash join* 32 workers |R| = 1600 mio (25 GB), varying size of S * S. Blanas, Y. Li, and J. M. Patel: Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs. SIGMOD

Technische Universität München Scalability in the number of cores MPSM and Vectorwise (VW) |R| = 1600 mio (25 GB), |S|=4*|R| 23

Technische Universität München Location skew Location skew in R has no effect because of repartitioning Location skew in S: in the extreme case all join partners of R i are found in only one S j (either local or remote) 24

Technische Universität München Distribution skew: anti-correlated data without balanced partitioning 25 with balanced partitioning R: 80% of the join keys are within the 20% high end of the key domain S: 80% of the join keys are within the 20% low end of the key domain

Technische Universität München Distribution skew : anti-correlated data 26

Technische Universität München Conclusions MPSM is a sort-based parallel join algorithm MPSM is NUMA-aware & NUMA-oblivious MPSM is space efficient (works in-place) MPSM scales linearly in the number of cores MPSM is skew resilient MPSM outperforms Vectorwise (4X) and Blanas et al.s hash join (18X) MPSM is adaptable for disk-based processing –See details in paper 27

Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München THANK YOU FOR YOUR ATTENTION!