Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Slides:

Advertisements

Similar presentations

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Advertisements

Distributed Systems CS

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

When Data Management Systems Meet Approximate Hardware: Challenges and Opportunities Author: Bingsheng He (Nanyang Technological University, Singapore)

03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Parallel Database Systems

OpenFOAM on a GPU-based Heterogeneous Cluster

The Energy Case for Graph Processing on Hybrid Platforms Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University.

Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.

VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Inspector Joins IC-65 Advances in Data Management Systems 1 Inspector Joins By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005 Rammohan.

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Skew Handling in Aggregate Streaming Queries on GPUs Georgios Koutsoumpakis 1, Iakovos Koutsoumpakis 1 and Anastasios Gounaris 2 1 Uppsala University,

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Sunpyo Hong, Hyesoon Kim

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.

1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

These slides are based on the book:

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Efficient Join Query Evaluation in a Parallel Database System

Parallel Algorithm Design

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Linchuan Chen, Xin Huo and Gagan Agrawal

April 30th – Scheduling / parallel

Shanjiang Tang1, Bingsheng He2, Shuhao Zhang2,4, Zhaojie Niu3

(A Research Proposal for Optimizing DBMS on CMP)

Fine-grained vs Coarse-grained multithreading

Mattan Erez The University of Texas at Austin

Fast Accesses to Big Data in Memory and Storage Systems

Presentation transcript:

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug 2013 Jiong He, Mian Lu, Bingsheng He

Outline Motivations System Design Evaluations Conclusions

Importance of Hash Joins In-memory databases –Enable GBs even TBs of data reside in main memory (e.g., the large memory commodity servers) –Are hot research topic recently Hash joins –The most efficient join algorithm in main memory databases –Focus: simple hash joins (SHJ, by ICDE 2004) and partitioned hash joins (PHJ, by VLDB 1999) 3

Hash Joins on New Architectures Emerging hardware –Multi-core CPUs (8-core, 16-core, even many-core) –Massively parallel GPUs (NVIDIA, AMD, Intel, etc.) Query co-processing on new hardware –On multi-core CPUs: SIGMOD’11 (S. Blanas), … –On GPUs: SIGMOD’08 (B. He), VLDB’09 (C. Kim), … –On Cell: ICDE’07 (K. Ross), … 4

Bottlenecks Conventional query co-processing is inefficient –Data transfer overhead via PCI-e –Imbalanced workload distribution 5 CPU Cache GPU Cache Main MemoryDevice Memory PCI-e Light-weight workload: Create context, Send and receive data, Launch GPU program, Post-processing. Heavy-weight workload: All real computations. CPU GPU

The Coupled Architecture CPUGPU Cache Main Memory Coupled CPU-GPU architecture –Intel Sandy Bridge, AMD Fusion APU, etc. New opportunities –Remove the data transfer overhead –Enable fine-grained workload scheduling –Increase higher cache reuse 6

Challenges Come with Opportunities Efficient data sharing –Share main memory –Share Last-level cache (LLC) Keep both processors busy –The GPU cannot dominate the performance –Assign suitable tasks for maximum speedup

Outline Motivations System Design Evaluations Conclusions

Fine-Grained Definition of Steps for Co- Processing Hash join consists of three stages (partition, build and probe) Each stage consists of multiple steps (take build as example) –b 1 : compute # hash bucket –b 2 : access hash bucket header –b 3 : search the key list –b 4 : insert the tuple 9

Co-Processing Mechanisms We study the following three kinds of co- processing mechanisms –Off-loading (OL) –Data-dividing (DD) –Pipeline (PL) With the fine-grained step definition of hash joins, we can easily implement algorithms with those co-processing mechanisms 10

Off-loading (OL) Method: Offload the whole step to one device Advantage: Easy to schedule Disadvantage: Imbalance 11 CPUGPU

Data-dividing (DD) Method: Partition the input at stage level Advantage: Easy to schedule, no imbalance Disadvantage: Devices are underutilized 12 CPUGPU

Pipeline (PL) Method: Partition the input at step level Advantage: Balanced, devices are fully utilized Disadvantage: Hard to schedule 13 CPUGPU

Determining Suitable Ratios for PL is Challenging Workload preferences of CPU & GPU vary Different computation type & amount of memory access across steps Delay across steps should be minimized to achieve global optimization

Cost Model Abstract model for CPU/GPU Estimate data transfer costs, memory access costs and execution costs With the cost model, we can –Estimate the elapsed time –Choose the optimal workload ratios 15 More details can be found in our paper.

Outline Motivations System Design Evaluations Conclusions

System Setting Up System configurations Data sets –R and S relations with 16M tuples each –Two attributes in each tuple: (key, record-ID) –Data skew: uniform, low skew and high skew 17 # coresCore frequency (GHz) Zero copy buffer (MB) Local memory (KB) Cache (MB) CPU GPU

Discrete vs. Coupled Architecture In discrete architecture: –data transfer takes 4%~10% –merge takes 14%~18% The coupled architecture outperforms the discrete by 5%~21% among all variants 15.3% 5.1% 21.5% 6.2%

Fine-grained vs. Coarse-grained For SHJ, PL outperforms OL & DD by 38% and 27% For PHJ, PL outperforms OL & DD by 39% and 23% 19 23% 39% 27% 38%

Unit Costs in Different Steps Unit cost represents the average processing time of one tuple for one device in one step Costs vary heavily across different steps on two devices 20 Partition BuildProbe

Ratios Derived from Cost Model Ratios across steps are different –In the first step of all three stages, the GPU takes should take most of the work (i.e. hashing) Workload dividing are fine-grained at step level 21

Other Findings Results on skewed data Results on input with varying size Evaluations on some design tradeoffs, etc. More details can be found in our paper.

Outline Motivations System Design Evaluations Conclusions

Implement hash joins on the discrete and the coupled CPU-GPU architectures Propose a generic cost model to guide the fine- grained tuning to get optimal performance Evaluate some design tradeoffs to make hash join better exploit the hardware power The first systematic study of hash join co- processing on the emerging coupled CPU- GPU architecture 24

Future Work Design a full-fledged query processor Extend the fine-grained design methodology to other applications on the coupled CPU-GPU architecture 25

Acknowledgement Thank Dr. Qiong Luo and Ong Zhong Liang for their valuable comments This work is partly supported by a MoE AcRF Tier 2 grant (MOE2012-T ) in Singapore and an Interdisciplinary Strategic Competitive Fund of Nanyang Technological University 2011 for “C3: Cloud-Assisted Green Computing at NTU Campus” 26

Questions? 27

Discrete vs. Coupled Architecture For fairness, discrete system is emulated On discrete architecture –hash tables on two devices should be merged after build –Data should be transferred via PCI-e, the overhead is estimated by data size and bandwidth PL is omitted on the discrete architecture due to too much overhead incurred 28

Discrete vs. Coupled Architecture (Cont.) In discrete architecture: –data transfer takes 4%~10% –merge takes 14%~18% The coupled architecture outperforms the discrete by 5%~21% among all variants 15.3% 5.1% 21.5% 6.2%

Cost Model Accuracy 1000 different configurations –The time of estimated ratio settings is very close to the really optimal one 30