Validation October 13, 2017.

Slides:



Advertisements
Similar presentations
Part IV: Memory Management
Advertisements

Home Exam 2: Video Encoding on GPUs using nVIDIA CUDA with Managed Memory Home Exam 2: Video Encoding on GPUs using nVIDIA CUDA with Managed Memory September.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
CS 104 Introduction to Computer Science and Graphics Problems
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Profiling & Tuning Applications CUDA Course July István Reguly.
GPGPU platforms GP - General Purpose computation using GPU
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
CDVS on mobile GPUs MPEG 112 Warsaw, July Our Challenge CDVS on mobile GPUs  Compute CDVS descriptor from a stream video continuously  Make.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Synchronization These notes introduce:
Sunpyo Hong, Hyesoon Kim
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Computer Science: A Structured Programming Approach Using C1 Objectives ❏ To understand the structure of a C-language program. ❏ To write your first C.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Single Instruction Multiple Threads
Prof. Zhang Gang School of Computer Sci. & Tech.
CSE 351 Section 9 3/1/12.
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Introduction to the C Language
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Parallel Computing Lecture
Parallel Algorithm Design
Single Clock Datapath With Control
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Lecture 26: Multiprocessors
Presented by: Isaac Martin
Filesystems 2 Adapted from slides of Hank Levy
MASS CUDA Performance Analysis and Improvement
Lecture 5: GPU Compute Architecture for the last time
Synchronization Lecture 23 – Fall 2017.
M. Usha Professor/CSE Sona College of Technology
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 27: Multiprocessors
Latency Tolerance: what to do when it just won’t go away
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Understanding the TigerSHARC ALU pipeline
Programming with Shared Memory Specifying parallelism
GPU Scheduling on the NVIDIA TX2:
6- General Purpose GPU Programming
Concurrency: Threads, Address Spaces, and Processes
Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho
CS 201 Compiler Construction
Presentation transcript:

Validation October 13, 2017

Validation: GPU vs SNB (Broadwell) http://mlefebvr.web.cern.ch/mlefebvr/20171010_gpu_val_test/ https://github.com/mpbl/mictest/tree/gpu-validation The 3 other plots follow the abscissa with a null ordinate

Bottleneck Hunting October 13, 2017

Single Event Profiling To fill GPUs and hide data transfers, using a number of streams to concurrently schedule event computations on a GPU helps. Can we have a better understanding of all the bottlenecks in the GPU code and can we order them from the most limiting to the least limiting? If we were able to significantly improve the performance of computing a single event, we might positively impact multiple events reconstruction time-to-solution. Note: We'll be using CUDA 9 and the associated nvvp. While still not up-to-par with VTune, This version provides new features making it easier to see what's going on. Let's play with the clone-engine; a priori the one taking the most serious performance hit.

Occupancy: CE Seed-based Achieved is half of theoretical, and we are limited by registers. With a gplex size of 10,000 and a block size of 256, we have 40 blocks. Each block is scheduled on one of the P100's 56 SM. This explains the difference between achieved and theoretical. But that's most likely mitigated by having concurrent events (as additional events get scheduled on free SMs). Note that it does not make sense to use a larger gplex size: the seed-based parallelization has a concurrency factor of at most the number of seed.

Occupancy: CE Track-based parallelization With a ```Gplex``` size of 10,000 (thus equal to the number of seeds), we are in a very similar situation: imbalance between SMs due to a number of blocks that is too small.

Occupancy: CE Track-based parallelization With the track-based parallelization, we can raise the number of threads up to the maximum number of track candidates. Using a GPlex size of 30,000:

Real Issues: Memory Dependencies What does memory dependency mean: Load / store cannot be done Global mem. contains arrays in Global memory AND spilled registers (said to be in local memory, actually a part of global mem.) Why: Resource not available Resource fully utilized Too many requests are outstanding Where does it come from (first suspects): Reorganize to Gplex (aka SlurpInIdx_fn). Updating candidate arrays with new best candidates. Some dependencies on GPlex stores. Loads are fine. MatrixRepresentationStatic.h (Tracks)=> not coalescent, not aligned What should we start with: Profiler do not show hotspots, nor time spent per function, loc,..

Updating Candidate list with new bests Cands Compiled with --keep Register spilling in local memory. Initialize rd1 to 0 … in local memory Track tmp_track; Load track from global_mem And store it right back to local mem Track& base_track = candidates[…] tmp_track = base_track