Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Slides:



Advertisements
Similar presentations
Prasanna Pandit R. Govindarajan
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Productive Performance Tools for Heterogeneous Parallel Computing Allen D. Malony Department of Computer and Information Science University of Oregon Shigeo.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
OpenFOAM on a GPU-based Heterogeneous Cluster
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Contemporary Languages in Parallel Computing Raymond Hummel.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Performance Tools for GPU-Powered Scalable Heterogeneous Systems Allen D. Malony, Scott Biersdorff, Sameer Shende
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Advanced / Other Programming Models Sathish Vadhiyar.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
ARCHES: GPU Ray Tracing I.Motivation – Emergence of Heterogeneous Systems II.Overview and Approach III.Uintah Hybrid CPU/GPU Scheduler IV.Current Uintah.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
My Coordinates Office EM G.27 contact time:
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
TAU integration with Score-P
Multi-Layer Perceptron On A GPU
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Pipeline parallelism and Multi–GPU Programming
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
© 2012 Elsevier, Inc. All rights reserved.
Introduction to CUDA.
6- General Purpose GPU Programming
Presentation transcript:

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire Tomov†, Guido Juckeland‡, Robert Dietrich‡, Duncan Poole§ and Christopher Lamb§ University of Oregon, Eugene, Department of Computer and Information Science †University of Tennessee, Knoxville, Innovative Computing Laboratory (ICL) ‡Technische Universit¨at Dresden, Center for Information Services and High Performance Computing (ZIH), Germany §NVIDIA Corporation, Santa Clara, CA ICPP Sep, 2011 ~16 Sep, 2011 Reporter : Shih-Meng Teng

Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫ Synchronous method ▫ Event queue method ▫ Callback method Heterogeneous Performance Tools ▫ Tool interoperability Experiment Conclusion 2

Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 3

Introduction The power of GPUs is giving rise to heterogeneous parallel computing, with new demands on programming environments, runtime systems, and tools to deliver high-performing applications. Study focus on Heterogeneous computation model and Alternative CPU-GPU measurement approaches PAPI (Performance API) ‚VampirTrace ƒTAU (Tuning and Analysis Utilities) 4

Introduction(Cont.) Basis of the measurement approach. ▫Synchronous method ▫ Event queue method ▫ Callback method Three experiments ▫Multiple GPU Test ▫Symmetric Matrix Vector Product (SYMV) ▫SHOC Benchmarks – Stencil2D 5

Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 6

Heterogeneous Computation Model 7

Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 8

CPU-GPU Operational Semantics The controlling process (or thread) will bind against one available GPU device. Transfer the necessary input data into the device memory. Launch one or multiple kernels. Copy the results back to host. 9

10

11

12

Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 13

Heterogeneous Execution and Performance Two difficulties: 1.A multi-threaded program each thread can bind against the same GPU device ▫ receive a different context, limiting interaction. 2.Communication of data between GPU devices in different physical hosts now requires three steps: 1.moving the data from device memory to host memory of the sending host, 2.send the data to the receiving host 3.moving the data from host memory to device memory on the receiving host. 14

Heterogeneous execution performance to evaluate several concerns GPU kernel execution CPU-GPU interactions Intra-node execution Inter-node communication 15

Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 16

CPU-GPU Measurement Approaches Three assumptions 1.User code is executed on a GPU device in the form of kernels which run without access to direct performance information. 2.A given device can be logically divided into streams of execution and in each stream kernels are executed sequentially in a pre-determined order. 3.Each kernel is executed after an associated kernel launch that runs on the CPU. These launches also execute in the same sequence as do the kernels (though not necessarily at the same time). 17

Synchronous method 18

Event queue method 19

Callback method 20

Method support and implementation 1) Synchronous method ▫CUDA and OpenCL 2) Event queue method ▫CUDA and OpenCL 3) Callback method ▫Only OpenCL 4) CUPTI (CUDA Performance Tool Interface) ▫CUPTI provides two APIs, the Callback API and the Event API. 21

Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 22

H eterogeneous Performance Tools PAPI CUDA Component Vampir/VampirTrace TAU Performance System 23

24

Tool interoperability 25

Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 26

Experiment - Multiple GPU Test Multi-GPU are used by an application running on a single node. A main-thread spawns multiple solverThreads by CUDA-SDK-simpleMultiGPU. Run a keeneland node. Use 3 C2070-GPU to create TAU profile and Fig7. 27

28

29

Experiment - Symmetric Matrix Vector Product (SYMV) Use PAPI to measurement CUBLAS and MAGMA library on SYMV. ▫ MAGMA :Matrix Algebra on GPU and Multicore Architectures ▫ CUBLAS Library (CUDA Toolkit 3.2). NVIDIA. Memory-bound kernel Use “symmetry” method to reduce bank-conflicts. ▫ Although N 2 /2 element reads are reduced, N 2 /64 writes (and N 2 /64 reads) are introduced. Use Array-padding method to completely eliminate shared cache bank conflicts. 30

31

32

33

34

Experiment - SHOC Benchmarks SHOC(Scalable HeterOgeneous Computing ) Benchmarks Provide some tests for heterogeneous performance tool. Introduce to stencil2D application. Use CUDA version (Fig12-left), tool is VampirTrace ▫ 2-dimensional, 9-point stencil. ▫ 2 keeneland nodes, run MPI process on each node (one GPU per process). Use OpenCL version (Fig12-right), tool is TAU ▫ 8 keeneland nodes, a node run 24 MPI processes ▫ Each MPI process attached to a single GPU device. 35

Fig12. Vampir trace display of Stencil2D execution on 4 MPI processes with 4 GPUs. Time synchronized GPU counter rates convey important performance characteristics of the kernel execution. 36

Fig12. TAU profile of the OpenCL version of the Stencil2D application run on the Keeneland platform with 24 MPI processes and GPUs. The kernel execution times are generally well- balanced across the GPUs. 37

Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫ Synchronous method ▫ Event queue method ▫ Callback method Heterogeneous Performance Tools ▫ Tool interoperability Experiment Conclusion 38

Conclusion Understanding the performance of scalable heterogeneous parallel systems and applications. New challenges : ▫ Instrumentation ▫ Measurement ▫ Analysis of heterogeneous components 39

The research presented here demonstrates support for GPU performance measurement with CUDA and OpenCL in three well- known performance tools PAPI, VampirTrace, and the TAU Performance System. 40

Thanks for your listening and Have a nice day Q & A 41