Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.

Supporting GPU Sharing in Cloud Environments with a Transparent

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Processes and threads.

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Lecture 5: GPU Compute Architecture

Speedup over Ji et al.'s work

Supporting Fault-Tolerance in Streaming Grid Applications

Linchuan Chen, Xin Huo and Gagan Agrawal

Advisor: Dr. Gagan Agrawal

Tools and Techniques for Processing (and Management) of Data

Tools and Techniques for Processing and Management of Data

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Linchuan Chen, Peng Jiang and Gagan Agrawal

Department of Computer Science University of California, Santa Barbara

Lecture 5: GPU Compute Architecture for the last time

Outline Midterm results summary Distributed file systems – continued

CS110: Discussion about Spark

Main Memory Background Swapping Contiguous Allocation Paging

Wei Jiang Advisor: Dr. Gagan Agrawal

Data-Intensive Computing: From Clouds to GPU Clusters

GENERAL VIEW OF KRATOS MULTIPHYSICS

An Adaptive Middleware for Supporting Time-Critical Event Response

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Multithreaded Programming

Yi Wang, Wei Jiang, Gagan Agrawal

Fully-dynamic aimGraph

Resource Allocation for Distributed Streaming Applications

A Map-Reduce System with an Alternate API for Multi-Core Environments

Department of Computer Science University of California, Santa Barbara

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

6- General Purpose GPU Programming

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Presentation transcript:

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Linchuan Chen and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

Outline Introduction Background System Design Experiment Results Related Work Conclusions and Future Work

Introduction Motivations GPUs MapReduce Programming Model Suitable for extreme-scale computing Cost-effective and Power-efficient MapReduce Programming Model Emerged with the development of Data-Intensive Computing GPUs have been proved to be suitable for implementing MapReduce Utilizing the fast but small shared memory for MapReduce is chanllenging Storing (Key, Value) pairs leads to high memory overhead, prohibiting the use of shared memory

Introduction Our approach Reduction-based method Reduce the (key, value) pair to the reduction object immediately after it is generated by the map function Very suitable for reduction-intensive applications A general and efficient MapReduce framework Dynamic memory allocation within a reduction object Maintaining a memory hierarchy Multi-group mechanism Overflow handling Before step into deeper, let me first talk about the background information of MapReduce and GPU architecture

Outline Introduction Background System Design Experiment Results Related Work Conclusions and Future Work

MapReduce M M M M M M M M R R R R R Group by Key K1:v k1:v k2:v K1:v K1: v, v, v, v K2:v K3:v, v K4:v, v, v K5:v R R R R R

MapReduce Programming Model Efficient Runtime System Map() Generates a large number of (key, value) pairs Reduce() Merges the values associated with the same key Efficient Runtime System Parallelization Concurrency Control Resource Management Fault Tolerance … …

GPUs Processing Component Memory Component Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) (Device) Grid Constant Memory Texture Device Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host

Outline Introduction Background System Design Experiment Results Related Work Conclusions and Future Work

System Design Traditional MapReduce map(input) { (key, value) = process(input); emit(key, value); } grouping the key-value pairs (by runtime system) reduce(key, iterator) for each value in iterator result = operation(result, value); emit(key, result);

System Design Reduction-based approach map(input) { (key, value) = process(input); reductionobject->insert(key, value); } reduce(value1, value2) value1 = operation(value1, value2); Reduces the memory overhead of storing key-value pairs Makes it possible to effectively utilize shared memory on a GPU Eliminates the need of grouping Especially suitable for reduction-intensive applications

Chanllenges Result collection and overflow handling Maintain a memory hierarchy Trade off space requirement and locking overhead A multi-group scheme To keep the framework general and efficient A well defined data structure for the reduction object

Device Memory Reduction Object Memory Hierarchy GPU Reduction Object 0 Reduction Object 1 Reduction Object 0 Reduction Object 1 … … … … … … Block 0’s Shared Memory Block 0’s Shared Memory Device Memory Reduction Object Result Array Device Memory CPU Host Memory

Reduction Object Updating the reduction object Use locks to synchronize Memory allocation in reduction object Dynamic memory allocation Multiple offsets in device memory reduction object

Reduction Object … … … … Memory Allocator KeyIdx[0] ValIdx[0] Val Data Key Size Val Size Key Data Key Size Val Size Key Data Val Data

Multi-group Scheme Locks are used for synchronization Large number of threads in each thread block Lead to severe contention on the shared memory RO One solution: full replication every thread owns a shared memory RO leads to memory overhead and combination overhead Trade-off multi-group scheme divide threads in each thread block into multiple sub-groups each sub-group owns a shared memory RO Choice of groups numbers Contention overhead Combination overhead

Overflow Handling Swapping In-object sorting Merge the full shared memory ROs to the device memory RO Empty the full shared memory ROs In-object sorting Sort the buckets in the reduction object and delete the unuseful data Users define the way of comparing two buckets

Discussion Reduction-intensive applications Our framework has a big advantage Applications with few or no reduction No need to use shared memory Users need to setup system parameters Develop auto-tuning techniques in future work

Extension for Multi-GPU Shared memory usage can speed up single node execution Potentially benefits the overall performance Reduction objects can avoid global shuffling overhead Can also reduce communication overhead

Outline Introduction Background System Design Experiment Results Related Work Conclusions and Future Work

Experiment Results Evaluating the swapping mechanism Applications used 5 reduction-intensive 2 map computation-intensive Tested with small, medium and large datasets Evaluation of the multi-group scheme 1, 2, 4 groups Comparison with other implementations Sequential implementations MapCG Ji et al.'s work Evaluating the swapping mechanism Test with large number of distinct keys

Evaluation of the Multi-group Scheme

Comparison with Sequential Implementations

Comparison with MapCG With reduction-intensive applications

Comparison with MapCG With other applications

Comparison with Ji et al.'s work

Evaluation of the Swapping Mechamism VS MapCG and Ji et al.’s work

Evaluation of the Swapping Mechamism VS MapCG

Evaluation of the Swapping Mechamism swap_frequency = num_swaps / num_tasks

Outline Introduction Background System Design Experiment Results Related Work Conclusions and Future Work

Related Work MapReduce for multi-core systems MapReduce on GPUs Phoenix, Phoenix Rebirth MapReduce on GPUs Mars, MapCG MapReduce-like framework on GPUs for SVM Catanzaro et al. MapReduce in heterogeneous environments MITHRA, IDAV Utilizing shared memory of GPUs for specific applications Nyland et al., Gutierrez et al. Compiler optimizations for utilizing shared memory Baskaran et al. (PPoPP '08), Moazeni et al. (SASP '09)

Conclusions and Future Work Reduction-based MapReduce Storing the reduction object on the memory hierarchy of the GPU A multi-group scheme Improved performance compared with previous implementations Future work: extend our framework to support new architectures

Thank you!