Understanding Outstanding Memory Request Handling Resources in GPGPUs

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Extracted directly from:

NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

What GPGPU-Sim Simulates

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

Operation of the SM Pipeline

Performance in GPU Architectures: Potentials and Distances

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Employing compression solutions under openacc

Sathish Vadhiyar Parallel Programming

Zhichun Zhu Zhao Zhang ECE Department ECE Department

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Addressing Software-Managed Cache Development Effort in GPGPUs

Stash: Have Your Scratchpad and Cache it Too

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Lecture 26: Multiprocessors

Presented by: Isaac Martin

Lecture 27: Multiprocessors

Operation of the Basic SM Pipeline

Mattan Erez The University of Texas at Austin

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Graphics Processing Unit

6- General Purpose GPU Programming

Address-Stride Assisted Approximate Load Value Prediction in GPUs

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar, Ebad Salehi, and Amirali Baniasadi ECE Department University of Victoria June 1, 2015 The Sixth International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2015)

This Work GPUs are energy-efficient high-throughput processors Understanding GPUs is critical We discover GPU microarchitecture details HOW? We develop resource targeting microbenchmarks. GPU Memory pressure Delay Analysis

Outline Overview GPU Micro-architecture design Outstanding Memory Access Handling Unit MSHR or PRT Thread-Latency plot Micro-benchmarking Fermi and Kepler GPUs Conclusions -Alternative designs for handling outstanding memory accesses

NVIDIA-like GPU Micro-architecture Each SM is a multi-threaded SIMD core … Warp Pool GPU SMN Outstanding memory access handling unit ? SM1 SMN Warp 1 | PC | T1 T2 T3 T4 | L1$ L1$ Warp 2 | PC | T5 T6 T7 T8 | Interconnect Scoreboard … … L2$ L2$ Load/Store Unit Int/Float Unit Special Fcn Unit MC1 MCN L1$ DRAM Tag Data Outstanding mem. access handling unit

Outstanding Memory Access Handling Unit MSHR Miss Status/Information Holding Register Used in GPGPU-sim [Bakhoda’2010] PRT Pending Request Table Introduced in an NVIDIA patent [Nyland2013]

MSHR First structure known as memory access handling unit: Miss Status/Information Holding Register A table of N entries Each entry tracks one missing memory block Each entry can be shared among M memory requests

MSHR Sample Operation Cache Miss SM L1 $ Issued Instruction Warp 2 | PC | T5 T6 - T8 | Warp 1 | PC | T1 T2 T3 T4 | Tag Data Block addr. M merged threads MSHR Load/Store Unit MSHR ID #1 #2 #3 #4 0x0A4 T1 T2 0x0 0x04 T5 | T6 | - | T8 T1 | T2 | T3 | T4 0x0A5 T3 T4 N entry 0x0 0x04 0x0A64 | 0x0A68 | - | 0x0A6F 0x0A40 | 0x0A44 | 0x0A50 | 0x0A54 0x0A6 T5 T6 T8 0x04 0x08 0x0F To the Lower Memory Level 0x0A4 | #1 From the Lower Memory Level DATA | #1 DATA | #2 DATA | #3 Block addr 0x0A5 | #2 MSHR ID 0x0A6 | #3

PRT Second structure known as memory access handling unit: Pending Request Table A table of N entries Each entry tracks one warp memory instruction Each entry may generate multiple memory requests Thread mask specifies the associated threads

PRT Sample Operation Cache Miss SM L1 $ Issued Instruction Warp ID Warp 2 | PC | T5 T6 - T8 | Warp 1 | PC | T1 T2 T3 T4 | Tag Data Warp ID Pending Requests PRT ID Offsets in mem. blocks Warp size PRT Load/Store Unit #1 #2 #3 #4 Warp1 2 1 T1 T2 T3 T4 0x0 0x04 0x0 0x04 T5 | T6 | - | T8 T1 | T2 | T3 | T4 Warp2 1 T5 T6 T7 T8 N entry 0x04 0x08 - 0x0F 0x0A64 | 0x0A68 | - | 0x0A6F 0x0A40 | 0x0A44 | 0x0A50 | 0x0A54 PRT ID Block addr To the Lower Memory Level 0x0A4 | #1 | 1100 From the Lower Memory Level DATA | #2 | 1101 DATA | #1 | 0011 DATA | #1 | 1100 0x0A5 | #1 | 0011 Thread mask 0x0A6 | #2 | 1101

Stressing Memory Access Handling Unit Single-thread multiple-loads tic = clock(); char Ld1 = A[128*1]; char Ld2 = A[128*2]; char LdN = A[128*N]; toc = clock(); Time = toc – tic; .... kernel<<<1,1>>>(...) Time Which resource is saturated? Scoreboard  1 2 3 4 5 6 7 8 Multiple-loads (N)

Stressing Memory Access Handling Unit (2) Parallel-threads single-load tic = clock(); char Ld1 = A[128*1]; __syncthreads(); toc = clock(); Time = toc – tic; .... kernel<<<1,M>>>(...) Time Which resource is saturated? Outstanding memory access handling unit 8 16 32 48 64 80 96 Parallel-threads (M)

Stressing Memory Access Handling Unit (3) Parallel-threads multiple-loads 4 load 2 load 1 load tic = clock(); char Ld1 = A[128*1]; char Ld2 = A[128*2]; char LdN = A[128*N]; __syncthreads(); toc = clock(); Time = toc – tic; .... kernel<<<1,M>>>(...) Time 8 16 32 48 64 80 96 Parallel-threads (M) Found a way to occupy resources: Parallel-threads multiple-loads

MSHR vs PRT test? Parallel-threads single-load Different memory patterns: MSHR respond: PRT respond: All Unique: Every thread fetches a unique memory block Time Time X Y Parallel-threads Parallel-threads MSHR respond: PRT respond: Two-Coalesced: Every two neighbor threads fetch a common unique memory block Time Time -How to understand if the structure is MSHR or PRT? X 2X Y Parallel-threads Parallel-threads

Finding MSHR Parameters Memory patterns: All Unique Two-coalesced Four-coalesced Eight-coalesced 128-byte blocks #10 #9 #8 #7 #6 #5 #4 #3 #2 #1 thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #3 #4 #5 #6 #7 #8 #9 thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #3 #4 #5 Global Memory address space thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #3 thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2

Thread-Latency Plot Loads/thread: 1, pattern: All Unique -spikes in variance of latency shows huge change in latency. -lack of data around the beginning of X, e.g. 0 thread, leads to distortion. -we found three neighbors large enough to find the spike and small enough to avoid smoothing effects.

Methodology Two NVIDIA GPUs: CUDA Toolkit v5.5 Tesla C2070 (Fermi architecture) Tesla K20c (Kepler architecture) CUDA Toolkit v5.5 Micro-benchmarking the GPU: MSHR or PRT? Find the parameters of MSHR or PRT structure

#1 Fermi: MSHR vs PRT Test? Multiple-threads single-load MSHR May not be PRT Memory Pattern: All Unique: Every thread fetches a unique memory block Two-Coalesced: Every two neighbor threads fetch a common unique memory block

#1 Fermi: MSHR Parameters thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #a #b #^ #$ #ᵷ #ᶋ thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #3 #4 #5 thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #3 #4 #5 #6 #7 #8 #9 thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #3 #a #b #c We didn’t observe spike beyond merger8 Conclusion: 128 entries + 8 mergers per entry at least Loads/thread: 1, pattern: All Unique Loads/thread: 2, pattern: Four-Coalesced Loads/thread: 1, pattern: Two-Coalesced Loads/thread: 4, pattern: Eight-Coalesced Loads per thread (LdpT) Pattern (1=All Unique, 2=Two-coalesced, etc) Threads # warp = Threads / 32 Unique mem = LdpT * (Threads / Pattern) max # MSHR entries = Unique mem min # MSHR mergers = pattern 1 1 2 2 4 4 8 Config 128 4 256 8 128 256 8 128 256 8 128 Spike 128 1 128 2 128 4 128 8 Finding

#2 Kepler: MSHR vs PRT Test? Memory Pattern: Multiple-threads two-loads All Unique: Every thread fetches a unique memory block PRT May not be MSHR Two-Coalesced: Every two neighbor threads fetch a common unique memory block

#2 Kepler: PRT Parameters We observed scoreboard saturation beyond 3LdpT Conclusion: 44 entries at least Loads per thread (LdpT) Pattern (1=All Unique, 2=Two-coalesced, etc) Threads # warp = Threads / 32 Unique mem = LdpT * (Threads / Pattern) max # PRT entries = # warp x LdpT 1 2 1 3 1 NoSpike - 678 ~22 1356 454 ~15 1362 - ~44 ~45

Conclusion Findings: Optimization for programmer Tesla C2070: MSHR 128 MSHR entry, utmost, 8 mergers per entry, at least MLP per SM: Coalesced = 1024 (128 x 8) Un-coalesced = 128 Tesla K20c: PRT 44 PRT entries, utmost, 32 requests per entry (warp size) Coalesced/Un-coalesced = 1408 (44 x 32) Optimization for programmer Diminishing return of coalescing on recent architectures -We observe slight sensitivity to coalescing under K20c, which shows it’s a powerful architecture in MLP. e.g. we compared naiive matrix-matrix multiplication to tiling optimized version under both architectures.

Thank You! Questions?

References [Nyland2013] L. Nyland et al. Systems and methods for coalescing memory accesses of parallel threads, Mar. 5 2013. US Patent 8,392,669. [Bakhoda’2009] A. Bakhoda et al. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS 2009.

Micro-benchmark Skeleton Modifying the UNQ procedure to get different memory patterns Each access is unique Every two, four, eight, 16, or 32 neighbor threads access a unique block nThreads: 1 - 1024 -Simplify this slide.

Thread-Latency Plot Loads/thread: 1, pattern: merger1 -spikes in variance of latency shows huge change in latency. -lack of data around the beginning of the of X, e.g. 0 thread, leads to distortion. -we found three neighbors large enough to find the spike and small enough to avoid smoothing effects.

#1 Tesla C2070: MSHR or PRT Test? thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #3 #4 #5 #6 #7 #8 #9 #a #b #c #d #e #f #g #h #i thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #3 #4 #5 #6 #7 #8 #9 thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #3 #4 #5 Loads/thread: 2, pattern: merger1 Loads/thread: 1, pattern: merger1 Loads/thread: 1, pattern: merger2 Loads per thread (LdpT) Pattern (merger1, 2, 4, 8) Threads # warp = Threads / 32 Unique mem = LdpT * (Threads / Pattern) max # PRT entries = # warp x LdpT max # MSHR entries = Unique mem 1 2 1 May not be PRT 1 2 MSHR 128 4 64 2 128 256 8 128 4 128 4 128 8 128

#1 Tesla C2070: MSHR Parameters thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #a #b #^ #$ #ᵷ #ᶋ thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #3 #4 #5 #6 #7 #8 #9 thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #3 #a #b #c thread block T1 T2 T3 T4 T5 T6 T7 T8 T9 … #1 #2 #3 #4 #5 We didn’t observe spike beyond merger8 Conclusion: 128 entries + 8 mergers per entry at least Loads/thread: 1, pattern: merger1 Loads/thread: 1, pattern: merger2 Loads/thread: 2, pattern: merger4 Loads/thread: 4, pattern: merger8 Loads per thread (LdpT) Pattern (merger1, 2, 4, 8) Threads # warp = Threads / 32 Unique mem = LdpT * (Threads / Pattern) max # MSHR entries = Unique mem min # MSHR mergers = pattern 1 1 2 2 4 4 8 128 4 256 8 128 256 8 128 256 8 128 128 1 128 2 128 4 128 8

#2 Tesla K20c: MSHR or PRT Test? Loads per thread (LdpT) Pattern (merger1, 2, 4, 8) Threads # warp = Threads / 32 Unique mem = LdpT * (Threads / Pattern) max # PRT entries = # warp x LdpT max # MSHR entries = Unique mem 2 1 2 2 4 May not be MSHR 678 ~22 1356 682 ~22 682 ~22 341 44 1356 44 682 44 341 -remove third column

#2 Tesla K20c: PRT Parameters We observed scoreboard saturation beyond 3LdpT Conclusion: 44 entries at least Loads per thread (LdpT) Pattern (merger1, 2, 4, 8) Threads # warp = Threads / 32 Unique mem = LdpT * (Threads / Pattern) max # PRT entries = # warp x LdpT 1 2 1 3 1 NoSpike - 678 ~22 1356 454 ~15 1362 - ~44 ~45

Stressing Outstanding Memory Access Resources Launch single thread block 1 to 1024 threads Each threads fetches single unique memory block Multiple loads per thread 1 to 4 loads 128-byte blocks 0x0280 Thread Block 0x0200 First Load Second load Global Memory address space 0x0180 Thread#1 Address Address 0x0100 Thread#2 Address Address 0x0080 Thread#3 Address Address -To occupy outstanding memory accessing resources on single SM -Stop before saturating the scoreboard 0x0000