RegLess: Just-in-Time Operand Staging for GPUs

RegLess: Just-in-Time Operand Staging for GPUs
John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, Scott Mahlke University of Michigan Electrical Engineering and Computer Science

Baseline execution model
SM GPU has multiple ready threads State for threads must always be active Largest part of state: registers warp 0 w1 w2 w3 warp scheduler RF Read ALUs ALUs LSU L1 RF W/B Talk about how there are 8 SMS to L2/DRAM

GPU RF energy GTX 980 RF Consumes large proportion of GPU energy
256KB per SM 4MB across all SMs Consumes large proportion of GPU energy Large but also high BW Mention that for the 480, we know it was around 13.4%, and as GPUs provision more concurrency, register energy will scale too

Register working set Most of RF is not used in any given window
Could we swap registers in and out of a smaller structure? 10% of RF Part 1: we don’t use all the RF

Register file caching RF EUs L1 RF EUs L1 RF cache EUs RF cache
RegLess L1 RF

Just-in-time operand staging
r1, r2, r3 will be accessed soon Before warp can execute, set up register storage When region finished executing, cache registers used later, erase others Every needed register guaranteed to be present SM warp 0 w1 w2 w3 Operand Staging Unit warp scheduler RF Read ALUs ALUs LSU Mention this is a big simplification Make sure it’s clear the OSU does a setup process and doesn’t just fetch all the registers from L1 L1 RF W/B to L2/DRAM

High-level idea: regions
Registers allocated for entire kernel Not all registers always live Some live registers not accessed Break kernel into regions Allocate space in register file for next region time → r0 r1 r2 1 3 2

Regions GPU kernel can be interpreted as control flow graph
BB 0 BB 2 GPU kernel can be interpreted as control flow graph CFG made of basic blocks Inside each BB is a dataflow graph + × + BB 1 ld + × + × BB 3

Dataflow graphs in BBs + × + + × + ×
Dataflow graph has different kinds of edges Inputs: generated before the BB Outputs: generated in one BB, read by another Interior: lifetime begins and ends in same BB + × + ld + × + × r5

Just-in-time setup process
Setup process guarantees region has enough resources to run without stalling Inputs: fetch values Interiors: reserve space Hardware will allocate space in staging unit for regions + × + ld + Make sure to mention that interiors are always preferable to inputs because we’ll never have to fetch them × + × r5

Creating regions BBs divided into regions
Operand space granted at region granularity Considerations: Smaller regions more flexible to schedule Maximize # interior registers Separate loads and uses Full algorithm in paper BB 0 BB 2 + × + BB 1 ld + × + × BB 3

Region graph Regions have annotations for Region ready to execute when
# live registers at once which input registers they need register lifetime Region ready to execute when input registers available space allocated for interiors metadata instructions Inputs: r1 Max live regs: 3 r2 = r1 + 1 r3 = r2 * 3 r1 = r3

Execution walkthrough
SM New components: capacity manager OSU (operand staging unit) OSU warp 0 w1 w2 w3 capacity manager warp scheduler RF Read ALUs ALUs LSU L1 RF W/B

SM Inputs: r1 Max live regs: 3 Capacity manager uses metadata for new regions Warp not eligible to issue until setup process complete r2 = r1 + 1 r3 = r2 * 3 r1 = r3 OSU warp 0 w1 w2 w3 capacity manager warp scheduler RF Read ALUs ALUs LSU L1 RF W/B

SM Capacity manager uses metadata for new regions preload inputs reserve space for interiors Warp not eligible to issue until setup process complete OSU warp 0 w1 w2 w3 capacity manager fetch r1, reserve space for 2 interiors warp scheduler RF Read ALUs ALUs LSU L1 RF W/B

SM Interior registers only need to reserve space If inputs not already in OSU, fetched from L1 OSU warp 0 w1 w2 w3 w3 capacity manager w3 warp scheduler fetch r1 RF Read ALUs ALUs LSU L1 r1 data RF W/B

SM Inputs: r1 Max live regs: 3 Once reservation process done, warp can begin executing instructions Unlike a cache, no misses possible r2 = r1 + 1 r3 = r2 * 3 r1 = r3 OSU warp 0 w1 w2 w3 w3 capacity manager w3 warp scheduler RF Read ALUs ALUs LSU L1 w3, r1 RF W/B

# remaining registers per bank
Capacity manager Responsible for allocating capacity in staging unit to regions Counters for allocations and free space Activates regions when all inputs ready warp 0 warp n warp stack state # prefetches remaining … # remaining registers per bank # active registers per bank

Operand staging unit Holds registers for executing warps
capacity manager read, W/B units Holds registers for executing warps Storage allocated by capacity manager Structured like a banked cache Can cache some long-lived registers tags tags tags tags data data data data Mention compressor on this slide as an optimization and segue to next slide compressor L1

Compressor Evicted registers pass through small compressed cache before L1 Reduces pressure on L1 Match fixed patterns such as: 0, 0, 0, 0, ... 0, 4, 8, 12, ... ~75% of evicted registers are compressible OSU preloads/ evictions Compressor L1

Methodology Performance model Power model All Rodinia benchmarks
GPGPU-sim, GTX 980 configuration L1 accesses bypassed 512 OSU lines per SM (vs in baseline) 48 compressed lines per SM Power model Verilog model of RegLess and baseline RF GPUWattch for other GPU components All Rodinia benchmarks

Register file energy 1.74 First mention why we get good results Compare to RFH: also good improvement, why, but why we’re better RFV: also why we do better RegLess: 75.3% reduction 45.2% for RFV and 62.0% for RFH 11% overall GPU energy savings, when factoring in metadata insns, L1 loads, etc. [1] “A compile-time managed multi-level register file hierarchy”, Gebhart et al [2] “GPU Register File Virtualization”, Jeon et al

Performance Mention lower is better
explanation for kmeans, nn; hybridsort

Input preload location
0.9% from L1, less than 0.02% from the rest of memory system

Conclusion EUs Instead of storing all registers or caching, proactively reserve space in small staging unit Regions annotated with register needs Hardware follows compiler directives 75.3% RF energy savings with little performance impact RegLess L1

RegLess: Just-in-Time Operand Staging for GPUs
John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, Scott Mahlke University of Michigan Electrical Engineering and Computer Science

RegLess: Just-in-Time Operand Staging for GPUs

Similar presentations

Presentation on theme: "RegLess: Just-in-Time Operand Staging for GPUs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RegLess: Just-in-Time Operand Staging for GPUs

Similar presentations

Presentation on theme: "RegLess: Just-in-Time Operand Staging for GPUs"— Presentation transcript:

Similar presentations

About project

Feedback