Mascar: Speeding up GPU Warps by Reducing Memory Pitstops Ankit Sethia* D. Anoushe Scott Mahlke Jamshidi University of Michigan
GPU usage is expanding Graphics Data Analytics Linear Algebra Machine Learning GPUs are taking centerstage of HPC. Traditional applications like graphics, linear algebra and other simulations are getting more complex and use GPUs as the computational workhorse Modern application like X,Y,Z not only want to exploit the computational power of GPUs, but also exploit the high memory bandwidth provided by GDDR5 Therefore all kinds of data parallel apps, be it compute or memory intensive are exploiting GPUs. look at the performance achieved by the different kinds kernels. All kinds of applications, compute and memory intensive are targeting GPUs Simulation Computer Vision
Performance variation of kernels Compute Intensive Memory Intensive the associated problems with bandwidth or memory saturation Memory intensive kernels saturate bandwidth and get lower performance
Impact of memory saturation - I L1 FPUs LSU L1 FPUs LSU L1 FPUs LSU . . . . The impact of bandwidth saturation on the system is shown through an illustration LSU is load store unit a warp can issue only when one returns Memory System Memory intensive kernels serialize memory requests Critical to prioritize order of requests from SMs
Impact of memory saturation Compute Intensive Memory Intensive To quantify the impact of memory saturation, I will show the earlier plot and add to it the fraction of cycles the LSU is stalled As the memory intensity increases, the cycles the LSU stall increase and become as high as 90%. This shows that the impact of memory saturation on a core receiving data to begin processing is significant. other problems associated with saturation of memory system. Significant stalls in LSU correspond to low performance in memory intensive kernels
Impact of memory saturation - II L1 FPUs LSU W1 W0 L1 LSU FPUs L1 FPUs LSU L1 FPUs LSU . . . . why cache cannot accept new requests Memory System Cache-blocks W1 Data Data present in the cache, but LSU can’t access Unable to feed enough data for processing
Increasing memory resources Large # of MSHRs + Full Associativity + 20% bandwidth boost UNBUILDABLE
Mas + car Mascar During memory saturation: Serialization of memory requests cause less overlap between memory accesses and compute: Memory aware scheduling (Mas) Data present in the cache cannot be reused as data cache cannot accept any request: Cache access re-execution (Car) Mas + car Mascar
Memory Aware Scheduling Saturation Warp 0 Warp 1 Warp 2 Memory requests Memory requests Memory requests the shortcoming of current solutions and how MAS improves the overlap of compute and memory with an analogy. Serving one request and switching to another warp (RR) No warp is ready to make forward progress
Memory Aware Scheduling Saturation Warp 0 Warp 1 Warp 2 GTO issues instructions from another warp whenever: No instruction in the i-buffer for that warp Dependency between instructions Similar to RR as multiple warps may issue memory requests Memory requests Memory requests Memory requests Serving one request and switching to another warp (RR) No warp is ready to make forward progress Serving all requests from one warp and then switching to another One warp is ready to begin computation early (MAS)
MAS operation N Y N Y Schedule in Equal Priority mode Check if memory intensive (MSHRs or miss queue almost full) Y Schedule in Memory Priority mode Assign new owner warp (only owner’s request can go beyond L1) Execute memory inst. only from owner, other warps can execute compute inst. Is the next instruction of owner dependent on already issued load N Y MP mode
Memory saturation flag Implementation of MAS Decode I-Buffer Scoreboard Scheduler . WST Ordered Warps Warp Id WRC Stall bit Memory saturation flag OPtype . .. Mem_Q Head Comp_Q Head . Issued Warp From RF Divide warps as memory and compute warps in ordered warps Warp Readiness Checker (WRC): Tests if a warp should be allowed to issue memory instructions Warp Status Table: Decide if scheduler should schedule from a warp
Mas + car Mascar During memory saturation: Serialization of memory requests cause less overlap between memory accesses and compute: Memory aware scheduling (Mas) Data present in the cache cannot be reused as data cache cannot accept any request: Cache access re-execution (Car) Mas + car Mascar
Cache access re-execution L1 Cache Load Store Unit W1 W2 W0 Cache-blocks HIT W1 Data One can argue that this problem can be solved by adding more MSHRs. Re-execution Queue Better than adding more MSHRs: More MSHR cause faster saturation of memory Cause faster thrashing of data cache
Experimental methodology GPGPUSim 3.2.2 – GTX 480 architecture SMs – 15, 32 PEs/SM Schedulers – LRR, GTO, OWL and CCWS L1 cache – 32kB, 64 sets, 4 way, 64 MSHRs L2 cache – 768kB, 8 way, 6 partitions, 200 core cycles DRAM – 32 requests/partition, 440 core cycles
Performance of compute intensive kernels Performance of compute intensive kernels is insensitive to scheduling policies
Performance of memory intensive kernels 3.0 4.8 4.24 Bandwidth intensive Cache Sensitive Overall categories and comparison with CCWS and leuko-1 Scheduler GTO OWL CCWS Mascar Bandwidth Intensive 4% 17% Cache Sensitive 24% 4% 55% 56% 13% 4% 24% 34% Overall
Conclusion 34% speedup 12% energy savings During memory saturation: Serialization of memory requests cause less overlap between memory accesses and compute: Memory aware scheduling (Mas) allows one warp to issue all its requests and begin early computation Data present in the cache cannot be reused as data cache cannot accept any request: Cache access re-execution (Car) exploits more hit-under-miss opportunities through re-execution queue 34% speedup 12% energy savings
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops Questions??