Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Why GPU Computing. GPU CPU Add GPUs: Accelerate Science Applications © NVIDIA 2013.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
st International Conference on Parallel Processing (ICPP)
OpenFOAM on a GPU-based Heterogeneous Cluster
Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Supporting GPU Sharing in Cloud Environments with a Transparent
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Heterogeneity-Aware Peak Power Management for Accelerator-based Systems Heterogeneity-Aware Peak Power Management for Accelerator-Based Systems Gui-Bin.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
David Chiu and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University 1 Supporting Workflows through Data-driven Service.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Parallel Algorithm Design
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Efficient and Simplified Parallel Graph Processing over CPU and MIC
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Linchuan Chen, Xin Huo and Gagan Agrawal
Linchuan Chen, Peng Jiang and Gagan Agrawal
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Data-Intensive Computing: From Clouds to GPU Clusters
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Gary M. Zoppetti Gagan Agrawal
Chapter 01: Introduction
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering, The Ohio State University, Columbus, OH th International Conference on High Performance Computing (HiPC) Presented by Po-Ting Liu 2013/02/21

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Outline Introduction Irregular Reductions Single-Level Partitioning Multi-level Partitioning Framework Experimental Results Conclusions 2

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Introduction Irregular Reductions Single-Level Partitioning Multi-level Partitioning Framework Experimental Results Conclusions 3

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Introduction Trend of heterogeneous architectures 4

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Introduction 5

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Introduction Challenges – Irregular applications – Dividing work between CPU and GPU 6

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Introduction Irregular Reductions Single-Level Partitioning Multi-level Partitioning Framework Experimental Results Conclusions 7

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Irregular Reductions Regular ReductionIrregular Reduction 8

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Irregular Reductions Codes from many scientific and engineering domains contain loops with Irregular Reductions Application – Computational Fluid Dynamics (CFD) – Molecular Dynamics (MD) 9

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Irregular Reductions Irregular → Indirection access 10 Input Output Index

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Introduction Irregular Reductions Single-Level Partitioning Multi-level Partitioning Framework Experimental Results Conclusions 11

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Single-Level Partitioning Computation space (edge) – Coalesced accesses – No data reuse – Ex: IA, Y Reduction space (node) – Data reuse – No coalesced accesses – Ex: RA, X 12

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Single-Level Partitioning Two partitioning choices Computation Space – Partition on edges Reduction Space – Partition on nodes 13

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Single-Level Partitioning Computation Space Partitioning (CSP) nodes 20 nodes

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Single-Level Partitioning From Scatter of viewpoint to see CSP Partition 1Partition 2 … In Out

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Single-Level Partitioning Reduction Space Partitioning (RSP) 16 White node: Output Black node: Input 16 edges 25 edges

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Single-Level Partitioning From Gather of viewpoint to see RSP Partition2Partition 4 … In Out

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Single-Level Partitioning CSP Advantage: – Load Balance on Computation Disadvantage: – Unequal output size in each partition – Replicated elements – Combination cost RSP Advantage: – Balanced output elements – Independent between each partition – Avoid combination cost Disadvantage : – Imbalance on computation – Replicated work 18

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Introduction Irregular Reductions Single-Level Partitioning Multi-level Partitioning Framework Experimental Results Conclusions 19

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Multi-level Partitioning Framework 20 RSP

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Detail work of partition level 21

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Runtime Support And Schemes 22 Task Scheduling Second-level Partitioning Computation Output

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Introduction Irregular Reductions Single-Level Partitioning Multi-level Partitioning Framework Experimental Results Conclusions 23

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Experimental Results Experimental Environment – CPU Two Intel 2.27 GHz Quad core Xeon E5520 CPU (8 cores, 8 threads) – GPU NVIDIA Tesla C2050 GPU – Fermi – 1.15 GHz, 448 cores (14 SM x 32 cores) – Applications Euler (EU), base on Computational Fluid Dynamics (CFD) Molecular Dynamics (MD) 24

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Experimental Results Scalability of IrregularApplications Molecular Dynamics (MD) Euler (EU) GB 2.6 GB 5.3 GB 1.8 GB 2.7 GB 3.4 GB

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Experimental Results Trade-offs between CSP and RSP – MD on CPUs 26

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Experimental Results Trade-offs between CSP and RSP – MD on GPU 27

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Experimental Results Benefits From Pipelining – MD on CPUs + GPU 28

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Experimental Results Benefits From Pipelining – EU on CPUs + GPU 29

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Experimental Results Benefits From Work Stealing Strategy 30

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Experimental Results Performance benefits from using CPU and GPU simultaneity

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Introduction Irregular Reductions Single-Level Partitioning Multi-level Partitioning Framework Experimental Results Conclusions 32

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Conclusions Porting irregular reduction applications on heterogeneous architectures Multi-level Partitioning Framework – Reduction space partitioning – Pipeline scheme – Work stealing An efficient and good scalability framework 33