RegLess: Just-in-Time Operand Staging for GPUs

Slides:



Advertisements
Similar presentations
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Advertisements

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Instruction-Level Parallelism (ILP)
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
CS 104 Introduction to Computer Science and Graphics Problems
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Pipelining By Toan Nguyen.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Sunpyo Hong, Hyesoon Kim
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming Qiumin Xu*, Hyeran Jeon ✝, Keunsoo Kim ❖, Won.
CS161 – Design and Architecture of Computer
GCSE OCR Computing A451 The CPU Computing hardware 1.
Gwangsun Kim, Jiyun Jeong, John Kim
Processes and threads.
Chapter 2 Memory and process management
Memory management.
Cache Memory and Performance
CS161 – Design and Architecture of Computer
CS427 Multicore Architecture and Parallel Computing
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Multiscalar Processors
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
ISPASS th April Santa Rosa, California
The University of Adelaide, School of Computer Science
Assembly Language for Intel-Based Computers, 5th Edition
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture 5: GPU Compute Architecture
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Presented by: Isaac Martin
Milad Hashemi, Onur Mutlu, Yale N. Patt
Lecture 5: GPU Compute Architecture for the last time
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Hardware Multithreading
CS399 New Beginnings Jonathan Walpole.
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Lecture Topics: 11/1 General Operating System Concepts Processes
Guest Lecturer TA: Shreyas Chand
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Adapted from the slides of Prof
PROCESSES & THREADS ADINA-CLAUDIA STOICA.
Memory Management (1).
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
CENG 351 Data Management and File Structures
Basic components Instruction processing
6- General Purpose GPU Programming
Address-Stride Assisted Approximate Load Value Prediction in GPUs
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

RegLess: Just-in-Time Operand Staging for GPUs John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, Scott Mahlke University of Michigan Electrical Engineering and Computer Science

Baseline execution model SM GPU has multiple ready threads State for threads must always be active Largest part of state: registers warp 0 w1 w2 w3 warp scheduler RF Read ALUs ALUs LSU L1 RF W/B Talk about how there are 8 SMS to L2/DRAM

GPU RF energy GTX 980 RF Consumes large proportion of GPU energy 256KB per SM 4MB across all SMs Consumes large proportion of GPU energy Large but also high BW Mention that for the 480, we know it was around 13.4%, and as GPUs provision more concurrency, register energy will scale too

Register working set Most of RF is not used in any given window Could we swap registers in and out of a smaller structure? 10% of RF Part 1: we don’t use all the RF

Register file caching RF EUs L1 RF EUs L1 RF cache EUs RF cache RegLess L1 RF

Just-in-time operand staging r1, r2, r3 will be accessed soon Before warp can execute, set up register storage When region finished executing, cache registers used later, erase others Every needed register guaranteed to be present SM warp 0 w1 w2 w3 Operand Staging Unit warp scheduler RF Read ALUs ALUs LSU Mention this is a big simplification Make sure it’s clear the OSU does a setup process and doesn’t just fetch all the registers from L1 L1 RF W/B to L2/DRAM

High-level idea: regions Registers allocated for entire kernel Not all registers always live Some live registers not accessed Break kernel into regions Allocate space in register file for next region time → r0 r1 r2 1 3 2

Regions GPU kernel can be interpreted as control flow graph BB 0 BB 2 GPU kernel can be interpreted as control flow graph CFG made of basic blocks Inside each BB is a dataflow graph + × + BB 1 ld + × + × BB 3

Dataflow graphs in BBs + × + + × + × Dataflow graph has different kinds of edges Inputs: generated before the BB Outputs: generated in one BB, read by another Interior: lifetime begins and ends in same BB + × + ld + × + × r5

Just-in-time setup process Setup process guarantees region has enough resources to run without stalling Inputs: fetch values Interiors: reserve space Hardware will allocate space in staging unit for regions + × + ld + Make sure to mention that interiors are always preferable to inputs because we’ll never have to fetch them × + × r5

Creating regions BBs divided into regions Operand space granted at region granularity Considerations: Smaller regions more flexible to schedule Maximize # interior registers Separate loads and uses Full algorithm in paper BB 0 BB 2 + × + BB 1 ld + × + × BB 3

Region graph Regions have annotations for Region ready to execute when # live registers at once which input registers they need register lifetime Region ready to execute when input registers available space allocated for interiors metadata instructions Inputs: r1 Max live regs: 3 r2 = r1 + 1 r3 = r2 * 3 r1 = r3

Execution walkthrough SM New components: capacity manager OSU (operand staging unit) OSU warp 0 w1 w2 w3 capacity manager warp scheduler RF Read ALUs ALUs LSU L1 RF W/B

Execution walkthrough SM Inputs: r1 Max live regs: 3 Capacity manager uses metadata for new regions Warp not eligible to issue until setup process complete r2 = r1 + 1 r3 = r2 * 3 r1 = r3 OSU warp 0 w1 w2 w3 capacity manager warp scheduler RF Read ALUs ALUs LSU L1 RF W/B

Execution walkthrough SM Capacity manager uses metadata for new regions preload inputs reserve space for interiors Warp not eligible to issue until setup process complete OSU warp 0 w1 w2 w3 capacity manager fetch r1, reserve space for 2 interiors warp scheduler RF Read ALUs ALUs LSU L1 RF W/B

Execution walkthrough SM Interior registers only need to reserve space If inputs not already in OSU, fetched from L1 OSU warp 0 w1 w2 w3 w3 capacity manager w3 warp scheduler fetch r1 RF Read ALUs ALUs LSU L1 r1 data RF W/B

Execution walkthrough SM Inputs: r1 Max live regs: 3 Once reservation process done, warp can begin executing instructions Unlike a cache, no misses possible r2 = r1 + 1 r3 = r2 * 3 r1 = r3 OSU warp 0 w1 w2 w3 w3 capacity manager w3 warp scheduler RF Read ALUs ALUs LSU L1 w3, r1 RF W/B

# remaining registers per bank Capacity manager Responsible for allocating capacity in staging unit to regions Counters for allocations and free space Activates regions when all inputs ready warp 0 warp n warp stack state # prefetches remaining … # remaining registers per bank # active registers per bank

Operand staging unit Holds registers for executing warps capacity manager read, W/B units Holds registers for executing warps Storage allocated by capacity manager Structured like a banked cache Can cache some long-lived registers tags tags tags tags data data data data Mention compressor on this slide as an optimization and segue to next slide compressor L1

Compressor Evicted registers pass through small compressed cache before L1 Reduces pressure on L1 Match fixed patterns such as: 0, 0, 0, 0, ... 0, 4, 8, 12, ... ~75% of evicted registers are compressible OSU preloads/ evictions Compressor L1

Methodology Performance model Power model All Rodinia benchmarks GPGPU-sim, GTX 980 configuration L1 accesses bypassed 512 OSU lines per SM (vs. 2048 in baseline) 48 compressed lines per SM Power model Verilog model of RegLess and baseline RF GPUWattch for other GPU components All Rodinia benchmarks

Register file energy 1.74 First mention why we get good results Compare to RFH: also good improvement, why, but why we’re better RFV: also why we do better RegLess: 75.3% reduction 45.2% for RFV and 62.0% for RFH 11% overall GPU energy savings, when factoring in metadata insns, L1 loads, etc. [1] “A compile-time managed multi-level register file hierarchy”, Gebhart et al [2] “GPU Register File Virtualization”, Jeon et al

Performance Mention lower is better explanation for kmeans, nn; hybridsort

Input preload location 0.9% from L1, less than 0.02% from the rest of memory system

Conclusion EUs Instead of storing all registers or caching, proactively reserve space in small staging unit Regions annotated with register needs Hardware follows compiler directives 75.3% RF energy savings with little performance impact RegLess L1

RegLess: Just-in-Time Operand Staging for GPUs John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, Scott Mahlke University of Michigan Electrical Engineering and Computer Science