Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Slides:

Advertisements

Similar presentations

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Advertisements

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

“Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Computations” By Ravi, Ma, Chiu, & Agrawal Presented.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Computer System Architectures Computer System Software

Supporting GPU Sharing in Cloud Environments with a Transparent

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Wei Jiang.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

GPU Architecture and Programming

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Sunpyo Hong, Hyesoon Kim

My Coordinates Office EM G.27 contact time:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Department of Computer Science University of California, Santa Barbara

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Data-Intensive Computing: From Clouds to GPU Clusters

Department of Computer Science University of California, Santa Barbara

6- General Purpose GPU Programming

Presentation transcript:

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, Ohio

Outline Motivation Challenges Involved Generalized Reductions Overall System Design Code Generation Module Dynamic Work Distribution Experimental Results Conclusions 2

Motivation Heterogeneous architectures are common –Eg., Today’s desktops & notebooks –Multi-core CPU + Graphics card on PCI-E 3 of the top 7 in latest top 500 list –Nebulae, No. 1 in peak performance and No. 2 in linpack performance –Use Multi-core CPUs and GPU (C 2050) on each node Multi-core CPU and GPU usage still independent – Resources may be under-utilized Can Multi-core CPU and GPU be used simultaneously for a single computation 3

Challenges Involved Programmability –Manually orchestrate separate code for multi-core CPU and GPU –Eg., Pthread/OpenMP + CUDA Work Distribution –CPU and GPUs have different charateristics –Vary in compute power, memory sizes, and latencies –Distributed non-coherent memories Performance –Is there a benefit? (against CPU-only & GPU-only) –Not much prior work available, need deeper study 4

Contributions We target generalized reduction computations for heterogeneous architectures –Ongoing work considers other dwarfs We focus on a combination of multi-core CPU & a GPU Compiler system automatically generates middleware/ CUDA code from high-level (sequential) C code Efficient dynamic work distribution at runtime We show significant performance gain from the use of heterogeneous configurations –K-Means (60% improvement) –PCA (63% improvement) 5

Generalized Reduction Computations 6 {* Outer sequential loop*} While(unfinished) { {*Reduction loop*} Foreach( element e) { (i, val) = compute(e) RObj(i) = Reduc(Robj(i), val) } Reduction Object Shared Memory Comm./Assoc. operation Similar to Map-Reduce model But only one stage, Reduction Reduction Object, Robj, exposed to programmer Large intermediate results avoided Reduction Object is a shared memory [Race conditions] Reduction operation, Reduc, is associative or commutative Order of processing can be arbitrary We target a particular class of applications in this work

Overall System Design 7 User Input: Simple C code with annotations Application Developer Multi-core Middleware API GPU Code for CUDA Compilation Phase Code Generator Run-time System Worker Thread Creation and Management Map Computation to CPU and GPU Dynamic Work Distribution Key Components

User Input: Format & Example 8 Simple Sequential C code with annotations –Variable Information –Sequential Reduction functions Variable Information Examples –K int [K is an integer variable of size, sizeof(int)] –Centers float* 5 K [Centers is a pointer to float of size 5*K*sizeof(float)] Sequential Reduction Functions –Identify reduction structure –Represent them as one or more reduction functions [unique labels] –Only use variables declared in ‘Variable Information’

Program Analysis & Code Generation 9 Variable Information Sequential Reduction Functions User Input Variable Analyzer Code Analyzer Program Analyzer Code Generator Host Program Kernel Function(s) Executable CUDA Program

Dynamic Work Distribution 10 Key component of runtime system Relative performance of CPU & GPU varies for every application Two General Approaches: 1.Work Sharing –Centralized, Shared Work queue –Worker threads consume from shared queue when idle 2.Work Stealing –Private work queue for each worker process –Idle process steals work from a busy process

Dynamic Work Distribution 11 Our Approach: Centralized Work Sharing, reasons as follows: –Limitation in GPU memory size –High latency memory transfer between CPU and GPU –In case of Work Stealing: If GPU is much faster than CPU, then GPU have to poll CPU frequently for data. –Leads to high data transfer overhead Based on Centralized Work Sharing, we propose two work distribution schemes: Uniform Chunk Size scheme (UCS) Non-Uniform Chunk Size scheme (NUCS)

Uniform Chunk Size Scheme 12 Global Work Queue Idle processor consumes work from the queue FCFS policy Fast worker ends up processing more than slow worker Slow worker still processes reasonable portion of data Master/Job Scheduler Worker 1 Worker n Worker 1 Worker n Fast Workers Slow Workers

Key Observations 13 Important observations based on architecture Each CPU core is slower than a GPU CPU memory latency is relatively much small GPU memory latency is very high Optimization Minimize GPU memory transfer overheads Take advantage of small memory latency of CPU Thus, CPU benefits from small chunks, while, GPU benefits from larger chunks Leads to Non-Uniform Chunk size distribution

Non-Uniform Chunk Size Scheme 14 Start with initial data division where each chunk is of small size If CPU requests, data with initial size is forwarded If GPU requests, a larger chunk is formed by merging smaller chunks Minimizes GPU data transfer and device invocation overhead At the end of processing, idle time is also minimized Chunk 1 Chunk 2 … … Chunk K Initial Data Division Work Dist. System Job Scheduler Merging GPU workers CPU workers Small Chunk Large Chunk

December 4, Experimental Setup Setup AMD Opteron 8350 processors 8 CPU cores 16 GB Main Memory Nvidia GeForce 9800 GTX 512 MB Device Memory Applications K-Means Clustering[6.4 GB] Principal Component Analysis (PCA) [8.5 GB] Both follow generalized reduction structure 15

December 4, Experimental Goals Evaluate the performance from a multi-core CPU and the GPU independently Study how the chunk-size impacts the individual performance. Study performance gain from simultaneously exploiting both multi-core CPU and GPU Evaluation of two dynamic distribution schemes for heterogeneous setting 16

Performance of K-Means CPU-only & GPU-only 17 X-axis 1 X-axis x 20 x

Performance of PCA CPU-only & GPU-only x ~ 4x Contrary to K-Means CPU is faster than GPU

Performance of K-Means (Heterogeneous - UCS) 19 GPU CPU cores 23.9%

Performance of K-Means (Heterogeneous - NUCS) 20 60%

Performance of PCA (Heterogeneous - UCS) %

Performance of PCA (Heterogeneous - NUCS) %

Work Distribution in K-Means 23

Work Distribution in PCA 24

Conclusions Compiler and Run-time support for Generalized reduction computations on heterogeneous architectures Compiler system automatically generates middleware/ CUDA code from high-level sequential C code Run-time system supports efficient work distribution scheme Work distribution scheme minimizes the overheads with GPU data transfer We achieve 60% performance benefit from K-Means and 63% from PCA on heterogeneous configurations 25

Ongoing Work Other dwarfs –Stencil computations –Indirection array based reductions Cluster of Heterogeneous Nodes Support for fault-tolerance Exploiting advanced GPU features 26

27 Thank You! Questions? Contacts: Vignesh Ravi- Wenjing Ma- David Chiu- Gagan Agrawal-