Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)

Slides:

Advertisements

Similar presentations

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Multi-GPU System Design with Memory Networks

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

PVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

OpenFOAM on a GPU-based Heterogeneous Cluster

A many-core GPU architecture.. Price, performance, and evolution.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,

Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,

Supporting GPU Sharing in Cloud Environments with a Transparent

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Ashwin M. Aji, Lokendra S. Panwar, Wu-chun Feng …Virginia Tech Pavan Balaji, James Dinan, Rajeev Thakur …Argonne.

Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator- Based Systems Presented by: Ashwin M. Aji PhD Candidate, Virginia Tech,

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.

Full and Para Virtualization

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Sunpyo Hong, Hyesoon Kim

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

My Coordinates Office EM G.27 contact time:

Philipp Gysel ECE Department University of California, Davis

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

ARCHITECTURE-ADAPTIVE CODE VARIANT TUNING

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

6- General Purpose GPU Programming

Presentation transcript:

synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA) Jiayuan Meng, Pavan Balaji (Argonne National Laboratory, USA)

synergy.cs.vt.edu Diversity in Accelerators Lokendra Panwar 2 Nov, 2008 Nov, 2013 Performance Share of Accelerators in Top500 Systems Source: top500.org

synergy.cs.vt.edu Heterogeneity “Among” Nodes Clusters are deploying different accelerators –Different accelerators for different tasks Example clusters: –“Shadowfax” at NVIDIA GPUs, FPGAs –“Darwin” at LANL: NVIDIA GPUs, AMD GPUs –“Dirac” at NERSC: NVIDIA Tesla and Fermi GPUs Lokendra Panwar 3

synergy.cs.vt.edu Heterogeneity “Among” Nodes Clusters are deploying different accelerators –Different accelerators for different tasks Example clusters: –“Shadowfax” at NVIDIA GPUs, FPGAs –“Darwin” at LANL: NVIDIA GPUs, AMD GPUs –“Dirac” at NERSC: NVIDIA Tesla and Fermi GPUs However … A unified programming model for “all” accelerators: OpenCL –CPUs, GPUs, FPGAs, DSPs Lokendra Panwar 4

synergy.cs.vt.edu Affinity of Tasks to Processors Peak performance doesn’t necessarily translate into actual device performance. Lokendra Panwar 5 ReductionGFLOPsGlobal Memory BW (GB/s) Actual Time(ms) NVIDIA C AMD HD

synergy.cs.vt.edu Affinity of Tasks to Processors Peak performance doesn’t necessarily translate into actual device performance. Lokendra Panwar 6 OpenCL Program ? ReductionGFLOPsGlobal Memory BW (GB/s) Actual Time(ms) NVIDIA C AMD HD

synergy.cs.vt.edu Challenges for Runtime Systems It is crucial for heterogeneous runtime systems to embrace different accelerators in clusters w.r.t. performance and power Examples of OpenCL runtime systems: –SnuCL –VOCL –SOCL Challenges: –Efficiently choose the right device for the right task –Keep the decision making overhead minimal Lokendra Panwar 7

synergy.cs.vt.edu Our Contributions An online workload characterization technique for OpenCL kernels Our model projects the relative ranking of different devices with little overhead An end-to-end evaluation of our technique for multiple architectural families of AMD and NVIDIA GPUs Lokendra Panwar 8

synergy.cs.vt.edu Outline Introduction Motivation Contributions Design Evaluation Conclusion Lokendra Panwar 9

synergy.cs.vt.edu Design Goal: –Rank accelerators for a given OpenCL workload Accurately AND efficiently –Decision making with minimal overhead Lokendra Panwar 10

synergy.cs.vt.edu Design Goal: –Rank accelerators for a given OpenCL workload Accurately AND efficiently –Decision making with minimal overhead Choices: –Static Code Analysis: Fast Inaccurate, as it does not account for dynamic properties: –Input data dependence, memory access patterns, dynamic instructions Lokendra Panwar 11

synergy.cs.vt.edu Design Goal: –Rank accelerators for a given OpenCL workload Accurately AND efficiently –Decision making with minimal overhead Choices: –Static Code Analysis: Fast Inaccurate, as it does not account for dynamic properties: –Input data dependence, memory access patterns, dynamic instructions –Dynamic Code Analysis: Higher accuracy Execute either on actual device or through a “emulator” –Not always feasible to run on actual devices: Data transfer costs, Clusters are “busy” –Emulators are very slow Lokendra Panwar 12

synergy.cs.vt.edu Design – Workload Profiling Lokendra Panwar 13 Emulator OpenCL Kernel Memory Patterns Bank Conflicts Instruction Mix

synergy.cs.vt.edu Design – Workload Profiling “Mini-emulation” –Emulate a single workgroup Collect dynamic characteristics: –Instruction traces –Global and Local memory transactions and access patterns In typical data-parallel workloads, workgroups exhibit similar runtime characteristics –Asymptotically lower overhead Lokendra Panwar 14 Mini Emulator OpenCL Kernel Memory Patterns Bank Conflicts Instruction Mix

synergy.cs.vt.edu Design – Device Profiling Lokendra Panwar 15 GPU 1 GPU 2 GPU N …… Instruction and Memory Microbenchmarks Device Throughput Profiles

synergy.cs.vt.edu Design – Device Profiling Build device throughput profiles: –Modified SHOC microbenchmarks to Obtain hardware throughput with varying occupancy Collect throughputs for instructions, global memory and local memory –Built only once Lokendra Panwar 16 Global and Local memory profile of AMD 7970

synergy.cs.vt.edu Design – Find Performance Limiter Lokendra Panwar 17 Memory Patterns Bank Conflicts Instruction Mix Device Profile Workload Profile

synergy.cs.vt.edu Design – Find Performance Limiter Single workgroup dynamic characteristics  Full kernel characteristics –Device occupancy as scaling factor Lokendra Panwar 18 Compute projected theoretical times: Instructions Global memory Local memory GPUs aggressively try to hide latencies of components Performance limiter = max(t local, t global, t compute )* Compare the normalized predicted times and choose best device *Zhang et. al. A Quantitative Performance Analysis Model for GPU Architectures, HPCA’2011

synergy.cs.vt.edu Design Lokendra Panwar 19 GPU 1 GPU 2 GPU N …… Instruction and Memory Benchmarks Static Profiling Device Profile

synergy.cs.vt.edu Design Lokendra Panwar 20 Mini- Emulator (Single workgroup) GPU Kernel Memory Patterns Bank Conflicts Instruction Mix GPU 1 GPU 2 GPU N …… Instruction and Memory Benchmarks Static Profiling Dynamic Profiling Device Profile

synergy.cs.vt.edu Design Lokendra Panwar 21 Mini- Emulator (Single workgroup) GPU Kernel Effective Instruction Throughput Effective Global Memory Bandwidth Effective Local Memory Bandwidth Relative GPU Performances Memory Patterns Bank Conflicts Instruction Mix GPU 1 GPU 2 GPU N …… Instruction and Memory Benchmarks Static Profiling Dynamic Profiling Device Profile Perf. Limiter? Performance Projection

synergy.cs.vt.edu Outline Introduction Motivation Contributions Design Evaluation Conclusion Lokendra Panwar 22

synergy.cs.vt.edu Experimental Setup Accelerators: –AMD 7970 : Scalar ALUs, Cache hierarchy –AMD 5870: VLIW ALUs –NVIDIA C2050: Fermi Architecture Cache Hierarchy –NVIDIA C1060: Tesla Architecture Simulators: –Multi2simv4.1 for AMD and GPGPU-Sim v3.0 for NVIDIA devices –Methodology agnostic to specific emulator Applications: Lokendra Panwar 23 Floyd Warshall FastWalsh Trasnform MatrixMul (global) MatrixMul (local) Num Nodes = 192Array Size = Matrix Size = [1024,1024] ReductionNBodyAESEncrypt Decrypt Matrix Transpose ArraySize = NumParticles=32768Width=1536, Height=512 Matrix Size = [1024,1024]

synergy.cs.vt.edu Application Boundedness : AMD GPUs Lokendra Panwar 24 Projected Time (Normalized) gmem compute lmem gmem compute lmem

synergy.cs.vt.edu Application Boundedness Summary Lokendra Panwar 25 ApplicationAMD 5870 AMD 7970 NVIDIA C1060 NVIDIA C2050 FloydWarshallgmem FastWalshTransformgmem MatrixTranposegmem MatMul(global)gmem MatMul(local)local gmemcompute Reductiongmem compute NBodycompute AESEncryptDecryptlocalcompute

synergy.cs.vt.edu Accuracy of Performance Projection Lokendra Panwar 26.

synergy.cs.vt.edu Accuracy of Performance Projection Lokendra Panwar 27. Best Device Fast Walsh Floyd Warshal Matmul (global) NbodyAES Encrypt Decrypt ReductionMatmul (local) Mat Transpos e Actual Projected

synergy.cs.vt.edu Accuracy of Performance Projection Lokendra Panwar 28. Best Device Fast Walsh Floyd Warshal Matmul (global) NbodyAES Encrypt Decrypt ReductionMatmul (local) Mat Transpos e Actual Projected

synergy.cs.vt.edu Emulation Overhead – Reduction Kernel Lokendra Panwar 29

synergy.cs.vt.edu Outline Introduction Motivation Contributions Design Evaluation Conclusion Lokendra Panwar 30

synergy.cs.vt.edu 90/10 Paradigm -> 10x10 Paradigm Simplify and specialized tools (“accelerators”) customized for different purposes (“applications”) –Narrower focus on applications (10%) –Simplified and specialized accelerators for each classification Why? –10x lower power, 10x faster -> 100x energy efficient Lokendra Panwar 31 Figure credit: A. Chien, Salishan Conference 2010

synergy.cs.vt.edu Conclusion We presented a “Mini-emulation” technique for online workload characterization for OpenCL kernels –The approach is shown to be sufficiently accurate for relative performance projection –The approach has asymptotically lower overhead than projection using full kernel emulation Our technique is shown to work well with multiple architectural families of AMD and NVIDIA GPUs With the increasing diversity in accelerators (towards 10x10*), our methodology only becomes more relevant. *S. Borkar and A. Chien, “The future of microprocessors,” Communications of the ACM, 2011 Lokendra Panwar 32

synergy.cs.vt.edu Thank You Lokendra Panwar 33

synergy.cs.vt.edu Backup Lokendra Panwar 34

synergy.cs.vt.edu Evolution of Microprocessors: 90/10 Paradigm Derive common cases for applications (90%) –Broad focus on application workloads Architectural improvements for 90% of cases –Design an aggregated generic “core” –Lesser customizability for applications Lokendra Panwar 35 Figure credit: A. Chien, Salishan Conference 2010

synergy.cs.vt.edu 90/10 Paradigm -> 10x10 Paradigm Simplify and specialized tools (“accelerators”) customized for different purposes (“applications”) –Narrower focus on applications (10%) –Simplified and specialized accelerators for each classification Why? –10x lower power, 10x faster -> 100x energy efficient Lokendra Panwar 36 Figure credit: A. Chien, Salishan Conference 2010

synergy.cs.vt.edu Application Boundedness : NVIDIA GPUs Lokendra Panwar 37 Projected Time (Normalized) gmem compute lmem gmem compute

synergy.cs.vt.edu Evaluation: Projection Accuracy (Relative to C1060)

synergy.cs.vt.edu Evaluation: Projection Overhead vs. Actual Kernel Execution of Matrix Multiplication

synergy.cs.vt.edu Evaluation: Overhead of Mini-emulation vs. Full Kernel Emulation of Matrix Multiplication

synergy.cs.vt.edu Evaluation: Overhead of Mini-emulation vs. Full Kernel Emulation of Reduction