Validation October 13, 2017.

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

Home Exam 2: Video Encoding on GPUs using nVIDIA CUDA with Managed Memory Home Exam 2: Video Encoding on GPUs using nVIDIA CUDA with Managed Memory September.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

CS 104 Introduction to Computer Science and Graphics Problems

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Profiling & Tuning Applications CUDA Course July István Reguly.

GPGPU platforms GP - General Purpose computation using GPU

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.

CDVS on mobile GPUs MPEG 112 Warsaw, July Our Challenge CDVS on mobile GPUs  Compute CDVS descriptor from a stream video continuously  Make.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Synchronization These notes introduce:

Sunpyo Hong, Hyesoon Kim

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Computer Science: A Structured Programming Approach Using C1 Objectives ❏ To understand the structure of a C-language program. ❏ To write your first C.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Single Instruction Multiple Threads

Prof. Zhang Gang School of Computer Sci. & Tech.

CSE 351 Section 9 3/1/12.

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Introduction to the C Language

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Parallel Computing Lecture

Parallel Algorithm Design

Single Clock Datapath With Control

Lecture 2: Intro to the simd lifestyle and GPU internals

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Lecture 26: Multiprocessors

Presented by: Isaac Martin

Filesystems 2 Adapted from slides of Hank Levy

MASS CUDA Performance Analysis and Improvement

Lecture 5: GPU Compute Architecture for the last time

Synchronization Lecture 23 – Fall 2017.

M. Usha Professor/CSE Sona College of Technology

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Lecture 27: Multiprocessors

Latency Tolerance: what to do when it just won’t go away

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Understanding the TigerSHARC ALU pipeline

Programming with Shared Memory Specifying parallelism

GPU Scheduling on the NVIDIA TX2:

6- General Purpose GPU Programming

Concurrency: Threads, Address Spaces, and Processes

Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho

CS 201 Compiler Construction

Presentation transcript:

Validation October 13, 2017

Validation: GPU vs SNB (Broadwell) http://mlefebvr.web.cern.ch/mlefebvr/20171010_gpu_val_test/ https://github.com/mpbl/mictest/tree/gpu-validation The 3 other plots follow the abscissa with a null ordinate

Bottleneck Hunting October 13, 2017

Single Event Profiling To fill GPUs and hide data transfers, using a number of streams to concurrently schedule event computations on a GPU helps. Can we have a better understanding of all the bottlenecks in the GPU code and can we order them from the most limiting to the least limiting? If we were able to significantly improve the performance of computing a single event, we might positively impact multiple events reconstruction time-to-solution. Note: We'll be using CUDA 9 and the associated nvvp. While still not up-to-par with VTune, This version provides new features making it easier to see what's going on. Let's play with the clone-engine; a priori the one taking the most serious performance hit.

Occupancy: CE Seed-based Achieved is half of theoretical, and we are limited by registers. With a gplex size of 10,000 and a block size of 256, we have 40 blocks. Each block is scheduled on one of the P100's 56 SM. This explains the difference between achieved and theoretical. But that's most likely mitigated by having concurrent events (as additional events get scheduled on free SMs). Note that it does not make sense to use a larger gplex size: the seed-based parallelization has a concurrency factor of at most the number of seed.

Occupancy: CE Track-based parallelization With a ```Gplex``` size of 10,000 (thus equal to the number of seeds), we are in a very similar situation: imbalance between SMs due to a number of blocks that is too small.

Occupancy: CE Track-based parallelization With the track-based parallelization, we can raise the number of threads up to the maximum number of track candidates. Using a GPlex size of 30,000:

Real Issues: Memory Dependencies What does memory dependency mean: Load / store cannot be done Global mem. contains arrays in Global memory AND spilled registers (said to be in local memory, actually a part of global mem.) Why: Resource not available Resource fully utilized Too many requests are outstanding Where does it come from (first suspects): Reorganize to Gplex (aka SlurpInIdx_fn). Updating candidate arrays with new best candidates. Some dependencies on GPlex stores. Loads are fine. MatrixRepresentationStatic.h (Tracks)=> not coalescent, not aligned What should we start with: Profiler do not show hotspots, nor time spent per function, loc,..

Updating Candidate list with new bests Cands Compiled with --keep Register spilling in local memory. Initialize rd1 to 0 … in local memory Track tmp_track; Load track from global_mem And store it right back to local mem Track& base_track = candidates[…] tmp_track = base_track