Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan

Graphics Simulation Linear Algebra Data Analytics Machine Learning Computer Vision Resource Requirements of GPU applications are diverging 2 GPU usage is expanding

Memory Motivation: Imbalanced GPU resource utilization 3

1 thread imbalanced 30270 threads imbalanced 30270 threads want a resource Resource saturated GPUs use SIMT execution model 4

MemoryMemoryMemory Imbalanced GPU resource utilization Large number of threads cause early saturation of some resources and under-utilization of others Compute Intensive Memory Intensive 5

6 Kernels saturate some resource much faster than others Boost bottleneck resource for performance improvement Opportunity 1: Throttle under-utilized resources for energy savings Opportunity 2:

Modulating hardware resources 7 ResourceParameter Var. Compute Core FPU, IALU,etc. Frequency 1 ±15% Memory Memory L2, DRAM,etc. Frequency 2 ±15% # of Thread Blocks 3 L1 Data Cache 1 - Max

8 Boosting bottleneck resources Memory Intensive Legend: Cache Sensitive Compute Intensive Kernel Type Core Frequency Memory Frequency Number of Threads Compute Memory Cache Actions for performance improvement Example: Increasing core frequency

9 Kernel TypeCore Frequency Memory Frequency Number of Threads Compute Memory Cache Throttling under-utilized resources Memory Intensive Legend: Cache Sensitive Compute Intensive Example: Decreasing core frequency Actions for energy savings

Dynamic v/s Static Decisions 10 Inter-invocation performance variation Intra-invocation variation Dynamic decisions are needed to fully utilize resource modulation Invocation number

Objectives Modulate 3 key parameters –Core-side Frequency –Memory-side Frequency –Number of thread blocks 2 modes of operations –Performance Mode (Boosting resources) –Energy Mode (throttling resources) Make dynamic decisions 11

SM Block Scheduler SM 0 SM 1SM N Equalizer Counters Frequency Manager VF Regulator New Parameters Blocks New Frequency... Requests Equalizer Inst. Buffer Warp Scheduler Equalizer Overview 12 Samples of warp state taken over window of cycles New decisions made every window Frequency manager arbitrates decision of all cores

State of warps - System heartbeat State of warps: Waiting (W) – Waiting for data to be ready Excess Mem (X mem ) – Ready to issue to memory pipeline Excess ALU (X alu ) – Ready to issue to arithmetic pipeline 13 Compute MemoryCache Unsaturated

Distinguishing memory and cache requirements Assumption: i) Run maximum threads for memory intensive kernels ii) Reduce threads for cache sensitive kernels 14 Reducing threads for memory kernels won’t hurt as long as bandwidth is fully utilized # of threads Performance Bandwidth Saturated Under- utilization Memory intensive kernels # of threads Performance Cache thrashing Under- utilization Optimal Cache sensitive kernels

Equalizer Algorithm 15 Input: X mem, X alu, Waiting, Active, W CTA Check if highly compute intensive (X alu > W CTA ) Check if highly memory intensive (X mem > W CTA ) Check if memory intensive (X mem > 2) Check if majority warps are idle (Waiting < Active/2) Check more compute or more memory (X mem > X alu ) N N N N N Take cache sensitive actions Take compute intensive actions Take memory intensive actions Take compute intensive actions Y Y Y Y W CTA is the number of warps in a thread block

Experimental Setup 16 SimulatorGPGPUSim 3.2.2 Kernels27, Rodinia and Parboil Power ModellingGPUWattch Voltage RegulationOn-chip, 512 cycles latency SM/Memory Frequency f - 15%, f, f + 15% Observation window4096 cycles Sampling rate128 cycles

Results – Performance mode 17 Compute Memory Cache Unsaturated 2.84 -67% -19%-11% TechniquePerformance Energy Equalizer22% 7% SM Boost6% 12% Mem Boost7% 8%

Equalizer Dynamism 18 Inter-invocation adaptiveness Intra-invocation adaptiveness

Conclusion Critical to match hardware’s abilities to kernel’s requirements Equalizer understands kernel’s requirement by watching state of warps By modulating hardware dynamically: –22% performance at 6% energy overhead –15% energy savings at 5% performance gain 19

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Questions?

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Similar presentations

Presentation on theme: "Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Similar presentations

Presentation on theme: "Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan."— Presentation transcript:

Similar presentations

About project

Feedback