Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Slides:



Advertisements
Similar presentations
Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.
Advertisements

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
st International Conference on Parallel Processing (ICPP)
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.
Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
A Survey of Parallel Tree- based Methods on Option Pricing PRESENTER: LI,XINYING.
CoNA : Dynamic Application Mapping for Congestion Reduction in Many-Core Systems 2012 IEEE 30th International Conference on Computer Design (ICCD) M. Fattah,
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
SoftCOM 2005: 13 th International Conference on Software, Telecommunications and Computer Networks September 15-17, 2005, Marina Frapa - Split, Croatia.
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Heterogeneity-Aware Peak Power Management for Accelerator-based Systems Heterogeneity-Aware Peak Power Management for Accelerator-Based Systems Gui-Bin.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.
GPU Architecture and Programming
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim.
An Energy-efficient Task Scheduler for Multi-core Platforms with per-core DVFS Based on Task Characteristics Ching-Chi Lin Institute of Information Science,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)
POLITECNICO DI MILANO A SystemC-based methodology for the simulation of dynamically reconfigurable embedded systems Dynamic Reconfigurability in Embedded.
Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Parallel Plasma Equilibrium Reconstruction Using GPU
Ioannis E. Venetis Department of Computer Engineering and Informatics
Quiz for Week #5.
Parallel Programming By J. H. Wang May 2, 2017.
Ching-Chi Lin Institute of Information Science, Academia Sinica
Parallel Algorithm Design
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Parallel Programming in C with MPI and OpenMP
Linchuan Chen, Xin Huo and Gagan Agrawal
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
CS/EE 217 – GPU Architecture and Parallel Programming
Project Title Team Members EE/CSCI 451: Project Presentation
University of Wisconsin-Madison
A Comparison-FREE SORTING ALGORITHM ON CPUs
Adaptive Data Refinement for Parallel Dynamic Programming Applications
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International Conference on 2013/12/191

Outline Introduction Autonomic Management of Data Parallel Computations Targeting CPU/GPU Mixes Evaluating The CPU/GPU Tradeoff Experimental Results Conclusions 2013/12/192

Outline Introduction Autonomic Management of Data Parallel Computations Targeting CPU/GPU Mixes Evaluating The CPU/GPU Tradeoff Experimental Results Conclusions 2013/12/193

Introduction The ubiquitous presence of multiple cores (at least one GPU) Efficient parallelism exploitation 2013/12/194

Introduction Motivation : to determine the division of workload between CPU and GPU an analytical performance model for scheduling tasks among CPU and GPU cores, such that the global execution time of the overall data parallel pattern is optimized 2013/12/195

Outline Introduction Autonomic Management of Data Parallel Computations Targeting CPU/GPU Mixes Evaluating The CPU/GPU Tradeoff Experimental Results Conclusions 2013/12/196

Autonomic Management of Data Parallel Computations Targeting CPU/GPU Mixes To decide whether the parallelism exhibited by the application is suitable for GPUs may be solved by looking at only those parallel patterns that fit the GPU execution model, that is considering data parallel patterns only. To decide how to use the CPU while the GPU is computing 2013/12/197

Autonomic Management of Data Parallel Computations Targeting CPU/GPU Mixes Figuring out whether or not it is beneficial to split a data parallel computation among CPU and GPU cores Figuring out the percentage of tasks to be run on CPU and GPU cores 2013/12/198

Autonomic Management of Data Parallel Computations Targeting CPU/GPU Mixes MAPE loop : Monitor Analyze Plan Execute 2013/12/199

Outline Introduction Autonomic Management of Data Parallel Computations Targeting CPU/GPU Mixes Evaluating The CPU/GPU Tradeoff Experimental Results Conclusions 2013/12/1910

Evaluating The CPU/GPU Tradeoff Two node CPU + main memory GPU + GPU memory First system owns the data, part of which must be sent to the second system Data copy between main memory and GPU memory : Setup and data transmission one core of the CPU → K cores 2013/12/1911

Evaluating The CPU/GPU Tradeoff 2013/12/1912

Evaluating The CPU/GPU Tradeoff 2013/12/1913

Evaluating The CPU/GPU Tradeoff CPU processing time GPU processing time Total execution time 2013/12/1914

Evaluating The CPU/GPU Tradeoff N → O(N) → N matrix multiplication : 2N 2 → O(N 3 ) → N /12/1915

Evaluating The CPU/GPU Tradeoff CPU processing time GPU processing time 2013/12/1916

Evaluating The CPU/GPU Tradeoff CPU processing time GPU processing time 2013/12/1917

Outline Introduction Autonomic Management of Data Parallel Computations Targeting CPU/GPU Mixes Evaluating The CPU/GPU Tradeoff Experimental Results Conclusions 2013/12/1918

Experimental Results 2013/12/1919 Experiment platform

Experimental Results Benchmark b1 Computing the matrix whose elements are the square of the corresponding elements in the input matrix N → O(N) → N Benchmark b2 The simplest matrix multiplication algorithm (three nested loops, no blocking, no further optimization) 2N 2 → O(N 3 ) → N /12/1920

Experimental Results 2013/12/1921

Experimental Results 2013/12/1922

Experimental Results Reduce sum Reduce min 2013/12/1923

Experimental Results 2013/12/1924

Experimental Results 2013/12/1925

Experimental Results P, 0.8P, 0.9P, 1.1P, 1.2P 2013/12/1926

Outline Introduction Autonomic Management of Data Parallel Computations Targeting CPU/GPU Mixes Evaluating The CPU/GPU Tradeoff Experimental Results Conclusions 2013/12/1927

Conclusions The main contribution of this work Computing the ratio between the number of tasks to be executed on CPU and GPU cores to optimize the completion time The classical map and reduce patterns which uses CPU and GPU cores according to the ratio computed by the model where the combined execution of tasks on GPU and CPU cores 2013/12/1928

Q&A 2013/12/1929

Thank you for listening 2013/12/1930