Divergence-Aware Warp Scheduling

Divergence-Aware Warp Scheduling
Timothy G. Rogers1, Mike O’Connor2, Tor M. Aamodt1 1The University of British Columbia 2NVIDIA Research MICRO 2013 Davis, CA

… … … GPU Main Memory L2 cache Threads Warp Streaming Multiprocessor
10000’s concurrent threads Grouped into warps Scheduler picks warp to issue each cycle Main Memory L2 cache Threads Warp … Streaming Multiprocessor Streaming Multiprocessor … W1 W2 … Memory Unit Warp Scheduler L1D

Can waste memory bandwidth
2 Types of Divergence Branch Divergence Threads Warp … if(…) { … } Effects functional unit utilization 1 1 Aware of branch divergence Memory Divergence Main Memory … Aware of memory divergence Can waste memory bandwidth AND Warp … Focus on improving performance

Motivation Improve performance of programs with memory divergence
Parallel irregular applications Economically important (server computing, big data) Transfer locality management from SW to HW Software solutions: Complicate programming Not always performance portable Not guaranteed to improve performance Sometimes impossible

Explicit Scratchpad Use
Programmability Case Study Sparse Vector-Matrix Multiply GPU-Optimized Version 2 versions from SHOC Divergent Version Explicit Scratchpad Use Added Complication Dependent on Warp Size Divergence Each thread has locality Parallel Reduction

… Previous Work 1 Active Mask 1 Active Mask Lost Locality Detected
Scheduling used to capture intra-thread locality (MICRO 2012) W2 Warp Scheduler Memory Unit L1D W1 … W3 WN 1 Active Mask Go Go Go Stop Go Stop 1 Active Mask Lost Locality Detected Previous Work Divergence-Aware Warp Scheduling Reactive Detects interference then throttles Proactive Predict and be Proactive Branch divergence aware Unaware of branch divergence All warps treated equally Adapt to branch divergence Outperformed by profiled static throttling Case Study: Divergent code 50% slowdown Outperform static solution Case Study: Divergent code <4% slowdown

Adapt to branch divergence
Divergence-Aware Warp Scheduling How to be proactive Identify where locality exists Limit the number of warps executing in high locality regions Adapt to branch divergence Create cache footprint prediction in high locality regions Account for number of active lanes to create per-warp footprint prediction. Change the prediction as branch divergence occurs.

Static Load Instructions in GC workload Locality Concentrated in Loops
Where is the locality? Examine every load instruction in program Loop Load 1 Load 2 Load 3 Load 4 Load 5 Load 6 Load 7 Load 8 Load 9 Load 10 Load 11 Load 12 Load 13 Hits/Misses PKI Static Load Instructions in GC workload Locality Concentrated in Loops

Hits on data accessed in immediately previous trip
Locality In Loops Limit Study Hits on data accessed in immediately previous trip How much data should we keep around? Other Fraction cache hits in loops Line accessed last iteration Line accessed this iteration

DAWS Objectives Predict the amount of data accessed by each warp in a loop iteration. Schedule warps in loops so that aggregate predicted footprint does not exceed L1D.

Both Used To Create Cache Footprint Prediction
Observations that enable prediction Memory divergence in static instructions is predictable Data touched by divergent loads dependent on active mask Main Memory Divergence … load Warp 1 Warp 0 Both Used To Create Cache Footprint Prediction Main Memory Main Memory 4 accesses 2 accesses Divergence Divergence Warp Warp 1 1 1 1 1 1

Online characterization to create cache footprint prediction
Detect loops with locality Classify loads in the loop Compute footprint from active mask Some loops have locality Some don’t Limit multithreading here Loop with locality while(…) { load 1 … load 2 } Diverged Not Diverged Loop with locality Warp 0 1 while(…) { load 1 … load 2 } 4 accesses Diverged Warp 0’s Footprint = 5 cache lines + Not Diverged 1 access

DAWS Operation Example
Warp0 1 1st Iter. Warp1 1 1st Iter. DAWS Operation Example Example Compressed Sparse Row Kernel Cache Cache Both warps capture spatial locality together Cache Locality Detected A[0] A[64] A[96] A[128] Hit A[0] A[64] A[96] A[128] Hit x30 A[32] int C[]={0,64,96,128,160,160,192,224,256}; void sum_row_csr(float* A, …) { float sum = 0; int i =C[tid]; while(i < C[tid+1]) { sum += A[ i ]; ++i; } … A[160] A[192] A[224] Loop Stop Go Warp1 1 1st Iter. Divergent Branch 1 Diverged Load Detected Want to capture spatial locality Memory Divergence 4 Active threads Warp0 1 2nd Iter. Warp0 1 33rd Iter. Warp 0 has branch divergence Go Go Go Time1 Time0 Time2 Cache Footprint 4 Stop Footprint = 3X1 Warp1 Early warps profile loop for later warps No locality detected = no footprint Footprint = 4X1 Warp0 Warp1 No Footprint Footprint decreased Warp0

More schedulers in paper
Methodology GPGPU-Sim (version 3.1.0) 30 Streaming Multiprocessors 32 warp contexts (1024 threads total) 32k L1D per streaming multiprocessor 1M L2 unified cache Compared Schedulers Cache-Conscious Wavefront Scheduling (CCWS) Profile based Best-SWL Divergence-Aware Warp Scheduling (DAWS) More schedulers in paper

Sparse MM Case Study Results
Performance (normalized to optimized version) Within 4% of optimized with no programmer input Divergent Code Execution time

Sparse MM Case Study Results
Properties (normalized to optimized version) <20% increase in off- chip accesses Divergent code off-chip accesses Divergent code issues 2.8x less instructions Divergent version now has potential energy advantages 13.3

Cache-Sensitive Applications
Breadth First Search (BFS) Memcached-GPU (MEMC) Sparse Matrix-Vector Multiply (SPMV-Scalar) Garbage Collector (GC) K-Means Clustering (KMN) Cache-Insensitive Applications in paper

Results Normalized Speedup BFS MEMC SPMV-Scalar GC KMN HMean
Overall 26% improvement over CCWS Outperform Best-SWL in highly branch divergent Results Normalized Speedup BFS MEMC SPMV-Scalar GC KMN HMean

Summary Divergent loads in GPU programs.
Questions? Problem Divergent loads in GPU programs. Software solutions complicate programming Solution DAWS Captures opportunities by accounting for divergence Result Overall 26% performance improvement over CCWS Case Study: Divergent code performs within 4% code optimized to minimize divergence

Divergence-Aware Warp Scheduling

Similar presentations

Presentation on theme: "Divergence-Aware Warp Scheduling"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Divergence-Aware Warp Scheduling

Similar presentations

Presentation on theme: "Divergence-Aware Warp Scheduling"— Presentation transcript:

Similar presentations

About project

Feedback