Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
Ankit Sethia* D. Anoushe Scott Mahlke Jamshidi University of Michigan

GPU usage is expanding Graphics Data Analytics Linear Algebra Machine Learning GPUs are taking centerstage of HPC. Traditional applications like graphics, linear algebra and other simulations are getting more complex and use GPUs as the computational workhorse Modern application like X,Y,Z not only want to exploit the computational power of GPUs, but also exploit the high memory bandwidth provided by GDDR5 Therefore all kinds of data parallel apps, be it compute or memory intensive are exploiting GPUs. look at the performance achieved by the different kinds kernels. All kinds of applications, compute and memory intensive are targeting GPUs Simulation Computer Vision

Performance variation of kernels
Compute Intensive Memory Intensive the associated problems with bandwidth or memory saturation Memory intensive kernels saturate bandwidth and get lower performance

Impact of memory saturation - I
L1 FPUs LSU L1 FPUs LSU L1 FPUs LSU The impact of bandwidth saturation on the system is shown through an illustration LSU is load store unit a warp can issue only when one returns Memory System Memory intensive kernels serialize memory requests Critical to prioritize order of requests from SMs

Impact of memory saturation
Compute Intensive Memory Intensive To quantify the impact of memory saturation, I will show the earlier plot and add to it the fraction of cycles the LSU is stalled As the memory intensity increases, the cycles the LSU stall increase and become as high as 90%. This shows that the impact of memory saturation on a core receiving data to begin processing is significant. other problems associated with saturation of memory system. Significant stalls in LSU correspond to low performance in memory intensive kernels

Impact of memory saturation - II
L1 FPUs LSU W1 W0 L1 LSU FPUs L1 FPUs LSU L1 FPUs LSU why cache cannot accept new requests Memory System Cache-blocks W1 Data Data present in the cache, but LSU can’t access Unable to feed enough data for processing

Increasing memory resources
Large # of MSHRs + Full Associativity + 20% bandwidth boost UNBUILDABLE

Mas + car Mascar During memory saturation:
Serialization of memory requests cause less overlap between memory accesses and compute: Memory aware scheduling (Mas) Data present in the cache cannot be reused as data cache cannot accept any request: Cache access re-execution (Car) Mas + car Mascar

Memory Aware Scheduling
Saturation Warp 0 Warp 1 Warp 2 Memory requests Memory requests Memory requests the shortcoming of current solutions and how MAS improves the overlap of compute and memory with an analogy. Serving one request and switching to another warp (RR) No warp is ready to make forward progress

Memory Aware Scheduling
Saturation Warp 0 Warp 1 Warp 2 GTO issues instructions from another warp whenever: No instruction in the i-buffer for that warp Dependency between instructions Similar to RR as multiple warps may issue memory requests Memory requests Memory requests Memory requests Serving one request and switching to another warp (RR) No warp is ready to make forward progress Serving all requests from one warp and then switching to another One warp is ready to begin computation early (MAS)

MAS operation N Y N Y Schedule in Equal Priority mode
Check if memory intensive (MSHRs or miss queue almost full) Y Schedule in Memory Priority mode Assign new owner warp (only owner’s request can go beyond L1) Execute memory inst. only from owner, other warps can execute compute inst. Is the next instruction of owner dependent on already issued load N Y MP mode

Memory saturation flag
Implementation of MAS Decode I-Buffer Scoreboard Scheduler . WST Ordered Warps Warp Id WRC Stall bit Memory saturation flag OPtype . .. Mem_Q Head Comp_Q Head . Issued Warp From RF Divide warps as memory and compute warps in ordered warps Warp Readiness Checker (WRC): Tests if a warp should be allowed to issue memory instructions Warp Status Table: Decide if scheduler should schedule from a warp

Mas + car Mascar During memory saturation:
Serialization of memory requests cause less overlap between memory accesses and compute: Memory aware scheduling (Mas) Data present in the cache cannot be reused as data cache cannot accept any request: Cache access re-execution (Car) Mas + car Mascar

Cache access re-execution
L1 Cache Load Store Unit W1 W2 W0 Cache-blocks HIT W1 Data One can argue that this problem can be solved by adding more MSHRs. Re-execution Queue Better than adding more MSHRs: More MSHR cause faster saturation of memory Cause faster thrashing of data cache

Experimental methodology
GPGPUSim – GTX 480 architecture SMs – 15, 32 PEs/SM Schedulers – LRR, GTO, OWL and CCWS L1 cache – 32kB, 64 sets, 4 way, 64 MSHRs L2 cache – 768kB, 8 way, 6 partitions, 200 core cycles DRAM – 32 requests/partition, 440 core cycles

Performance of compute intensive kernels
Performance of compute intensive kernels is insensitive to scheduling policies

Performance of memory intensive kernels
3.0 Bandwidth intensive Cache Sensitive Overall categories and comparison with CCWS and leuko-1 Scheduler GTO OWL CCWS Mascar Bandwidth Intensive 4% 17% Cache Sensitive 24% 4% 55% 56% 13% 4% 24% 34% Overall

Conclusion 34% speedup 12% energy savings During memory saturation:
Serialization of memory requests cause less overlap between memory accesses and compute: Memory aware scheduling (Mas) allows one warp to issue all its requests and begin early computation Data present in the cache cannot be reused as data cache cannot accept any request: Cache access re-execution (Car) exploits more hit-under-miss opportunities through re-execution queue 34% speedup 12% energy savings

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
Questions??

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Similar presentations

Presentation on theme: "Mascar: Speeding up GPU Warps by Reducing Memory Pitstops"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Similar presentations

Presentation on theme: "Mascar: Speeding up GPU Warps by Reducing Memory Pitstops"— Presentation transcript:

Similar presentations

About project

Feedback