Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Extracted directly from:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA - 2.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Single Instruction Multiple Threads
Appendix C Graphics and Computing GPUs
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Lecture 5: Performance Considerations
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Microbenchmarking the GT200 GPU
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
© David Kirk/NVIDIA and Wen-mei W. Hwu,
CMPT 886: Computer Architecture Primer
Introduction to CUDA Programming
Programming Massively Parallel Processors Performance Considerations
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
General Purpose Graphics Processing Units (GPGPUs)
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking Henry Wong

2 GPUs Graphics Processing Units Increasingly programmable 10x arithmetic and memory bandwidth vs. CPU Commodity hardware Requires 1000-way parallelization NVIDIA CUDA Write GPU thread in C variant Compiled into native instructions and run in parallel on GPU

3 How Well Do We Know the GPU? How much do we know? We know enough to write programs We know basic performance tuning Caches exist Approximate memory speeds Instruction throughput Why do we want to know more? Detailed optimization Compiler optimizations. e.g., Ocelot Performance modeling. e.g., GPGPU-Sim

4 Outline Background on CUDA Pipeline latency and throughput Branch divergence Syncthreads barrier synchronization Memory Hierarchies Cache and TLB organization Cache sharing...more in the paper

5 CUDA Software Model Grid Block Thread Threads branch independently Threads inside Block may interact Blocks are independent

6 CUDA Blocks and SMs Blocks are assigned to SMs SoftwareHardware

7 CUDA Hardware Model SM: Streaming Multiprocessor Warp: Groups of 32 threads like a vector TPC: Thread Processing Cluster Interconnect Memory Hierarchy

8 Goal and Methodology Aim to discover microarchitecture beyond documentation Microbenchmarks Measure code timing and behaviour Infer microarchitecture Measure time using clock() function two clock cycle resolution

9 Arithmetic Pipeline: Methodology Objective: Measure instruction latency and throughput Latency: One thread Throughput: Many threads Measure runtime of microbenchmark core Avoid compiler optimizing away code Discard first iteration: Cold instruction cache for (i=0;i<2;i++) { start_time = clock(); t1 += t2; t2 += t1;... t1 += t2; stop_time = clock(); }

10 Arithmetic Pipeline: Results Three types of arithmetic units SP: 24 cycles, 8 ops/clk SFU: 28 cycles, 2 ops/clk DPU: 48 cycles, 1 op/clk Peak SM throughput 11.2 ops/clk: MUL or MAD + MUL

11 SIMT Control Flow Warps run in lock-step but threads can branch independently Branch divergence Taken and fall-through paths are serialized Taken path (usually “else”) executed first Fall-through path pushed onto a stack int __shared__ sharedvar = 0; while (! sharedvar == tid) ; sharedvar++;

12 Barrier Synchronization Syncthreads: Synchronizes Warps, not threads Does not resync diverged warps Will sync with other warps if (warp0) { if (tid < 16) { shared_array[tid] = tid; __syncthreads(); } else { __syncthreads(); out[tid] = shared_array[tid%16] } if (warp1) { __syncthreads(); } 1 2 if (tid < 16) { shared_array[tid] = tid; __syncthreads(); } else { __syncthreads(); out[tid] = shared_array[tid%16] } Warp0Warp1

13 Texture Memory Average latency vs. memory footprint, stride accesses 5 KB L1 cache, 20-way 256 KB L2 cache, 8-way

14 Texture L1 5 KB, 32-byte lines, 8 cache sets, 20-way set associative L2 is located across a non-uniform interconnect L1 Hit L1 Miss

15 Constant Cache Sharing L1 L2 L3 Three levels of constant cache Two Blocks accessing constant memory L3: Global L2: Per-TPC L1: Per-SM

16 Constant and Instruction Cache Sharing Two different types of accesses L2 and L3 are Constant and Instruction caches

17 Global Memory TLBs 8 MB L1, 512 KB line, 16-way 32 MB L2, 4 KB line, 8-way L2 non-trivial to measure 8 MB L1 32 MB L2

18 Conclusions Microbenchmarking reveals undocumented microarchitecture features Three arithmetic unit types: latency and throughput Branch divergence using a stack Barrier synchronization on warps Memory Hierarchy Cache organization and sharing TLBs Applications Measuring other architectures and microarchitectural features For GPU code optimization and performance modeling

19 Questions...

20 Filling the SP Pipeline 6 warps (24 clocks, 192 “threads”) should fill pipeline 2 warps if sufficient instruction-level independence (not shown) 24-cycle SP pipeline latency Fair scheduling on average

21 Instruction Fetch? Capture burst of timing at each iteration, then average One thread measures timing Other warps thrash one line of L1 instruction cache Instructions are fetched from L1 in groups of 64 bytes