Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Felix Halim, Roland H.C. Yap, Yongzheng Wu

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.

Supporting GPU Sharing in Cloud Environments with a Transparent

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Rapid Tomographic Image Reconstruction via Large-Scale Parallelization Ohio State University Computer Science and Engineering Dep. Gagan Agrawal Argonne.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

EECE571R -- Harnessing Massively Parallel Processors ece

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Speedup over Ji et al.'s work

Supporting Fault-Tolerance in Streaming Grid Applications

Linchuan Chen, Xin Huo and Gagan Agrawal

Tools and Techniques for Processing (and Management) of Data

Tools and Techniques for Processing and Management of Data

Outline Midterm results summary Distributed file systems – continued

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Wei Jiang Advisor: Dr. Gagan Agrawal

Data-Intensive Computing: From Clouds to GPU Clusters

An Adaptive Middleware for Supporting Time-Critical Event Response

Resource Allocation for Distributed Streaming Applications

A Map-Reduce System with an Alternate API for Multi-Core Environments

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Presentation transcript:

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen and Gagan Agrawal

Outline  Introduction  Background  System Design  Experiment Results  Related Work  Conclusions and Future Work

Introduction  Motivations  GPUs Suitable for extreme-scale computing Cost-effective and Power-efficient  MapReduce Programming Model Emerged with the development of Data-Intensive Computing  GPUs have been proved to be suitable for implementing MapReduce  Utilizing the fast but small shared memory for MapReduce is chanllenging Storing (Key, Value) pairs leads to high memory overhead, prohibiting the use of shared memory

Introduction  Our approach  Reduction-based method Reduce the (key, value) pair to the reduction object immediately after it is generated by the map function Very suitable for reduction-intensive applications  A general and efficient MapReduce framework Dynamic memory allocation within a reduction object Maintaining a memory hierarchy Multi-group mechanism Overflow handling

Outline  Introduction  Background  System Design  Experiment Results  Related Work  Conclusions and Future Work

MapReduce K1: v, v, v, vK2:vK3:v, vK4:v, v, vK5:v M M Group by Key K1:v k1:v k2:vK1:vK3:v k4:vK4:v k5:vK4:vK1:v k3:v M M M M M M M M M M M M M M R R R R R R R R R R

MapReduce  Programming Model  Map() Generates a large number of (key, value) pairs  Reduce() Merges the values associated with the same key  Efficient Runtime System  Parallelization  Concurrency Control  Resource Management  Fault Tolerance  … …

GPUs Host Kern el 1 Kern el 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) (Device) Grid Constant Memory Texture Memory Device Memory Block (0, 0) Shared Memory Local Memor y Thread (0, 0) Register s Local Memor y Thread (1, 0) Register s Block (1, 0) Shared Memory Local Memor y Thread (0, 0) Register s Local Memor y Thread (1, 0) Register s Host Processing ComponentMemory Component

Outline  Introduction  Background  System Design  Experiment Results  Related Work  Conclusions and Future Work

 Traditional MapReduce map(input) { (key, value) = process(input); emit(key, value); } grouping the key-value pairs (by runtime system) reduce(key, iterator) { for each value in iterator result = operation(result, value); emit(key, result); } System Design

 Reduction-based approach map(input, reduction_object) { (key, value) = process(input); reduction_object->insert(key, value); } reduce(value1, value2) { value1 = operation(value1, value2); } Reduces the memory overhead of storing key-value pairs Makes it possible to effectively utilize shared memory on a GPU Eliminates the need of grouping Especially suitable for reduction-intensive applications System Design

Chanllenges  Result collection and overflow handling  Maintain a memory hierarchy  Trade off space requirement and locking overhead  A multi-group scheme  To keep the framework general and efficient  A well defined data structure for the reduction object

Memory Hierarchy CPU GPU Reduction Object 0 Reduction Object 1 Block 0’s Shared Memory Reduction Object 0 Reduction Object 1 Block 0’s Shared Memory … Device Memory Reduction Object Result Array Host Memory Device Memory

Reduction Object  Updating the reduction object  Use locks to synchronize  Memory allocation in reduction object  Dynamic memory allocation  Multiple offsets in device memory reduction object

Reduction Object KeyIdx[0]ValIdx[0] … Key SizeVal Size Key Data Val Data Memory Allocator Key SizeVal Size Key Data KeyIdx[1] ValIdx[1]

Multi-group Scheme  Locks are used for synchronization  Large number of threads in each thread block  Lead to severe contention on the shared memory RO  One solution: full replication  every thread owns a shared memory RO  leads to memory overhead and combination overhead  Trade-off  multi-group scheme  divide threads in each thread block into multiple sub-groups  each sub-group owns a shared memory RO  Choice of groups numbers  Contention overhead  Combination overhead

Overflow Handling  Swapping  Merge the full shared memory ROs to the device memory RO  Empty the full shared memory ROs  In-object sorting  Sort the buckets in the reduction object and delete the unuseful data  Users define the way of comparing two buckets

Discussion  Reduction-intensive applications  Our framework has a big advantage  Applications with few or no reduction  No need to use shared memory  Users need to setup system parameters  Develop auto-tuning techniques in future work

Extension for Multi-GPU  Shared memory usage can speed up single node execution  Potentially benefits the overall performance  Reduction objects can avoid global shuffling overhead  Can also reduce communication overhead

Outline  Introduction  Background  System Design  Experiment Results  Related Work  Conclusions and Future Work

Experiment Results  Applications used  5 reduction-intensive  2 map computation-intensive  Tested with small, medium and large datasets  Evaluation of the multi-group scheme  1, 2, 4 groups  Comparison with other implementations  Sequential implementations  MapCG  Ji et al.'s work  Evaluating the swapping mechanism  Test with large number of distinct keys

Evaluation of the Multi-group Scheme

Comparison with Sequential Implementations

Comparison with MapCG  With reduction-intensive applications

Comparison with MapCG  With other applications

Comparison with Ji et al.'s work

Evaluation of the Swapping Mechamism  VS MapCG and Ji et al.’s work

Evaluation of the Swapping Mechamism  VS MapCG

Evaluation of the Swapping Mechamism  swap_frequency = num_swaps / num_tasks

Outline  Introduction  Background  System Design  Experiment Results  Related Work  Conclusions and Future Work

Related Work  MapReduce for multi-core systems  Phoenix, Phoenix Rebirth  MapReduce on GPUs  Mars, MapCG  MapReduce-like framework on GPUs for SVM  Catanzaro et al.  MapReduce in heterogeneous environments  MITHRA, IDAV  Utilizing shared memory of GPUs for specific applications  Nyland et al., Gutierrez et al.  Compiler optimizations for utilizing shared memory  Baskaran et al. (PPoPP '08), Moazeni et al. (SASP '09)

Conclusions and Future Work  Reduction-based MapReduce  Storing the reduction object on the memory hierarchy of the GPU  A multi-group scheme  Improved performance compared with previous implementations  Future work: extend our framework to support new architectures

Thank you!