Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Slides:

Advertisements

Similar presentations

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

University of Central Florida Understanding Software Approaches for GPGPU Reliability Martin Dimitrov* Mike Mantor† Huiyang Zhou* *University of Central.

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.

Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

Sunpyo Hong, Hyesoon Kim

Operation of the SM Pipeline

Performance in GPU Architectures: Potentials and Distances

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

CUDA programming Performance considerations (CUDA best practices)

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Zorua: A Holistic Approach to Resource Virtualization in GPUs

Gwangsun Kim, Jiyun Jeong, John Kim

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Stash: Have Your Scratchpad and Cache it Too

Architecture Background

Lecture 5: GPU Compute Architecture

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Spare Register Aware Prefetching for Graph Algorithms on GPUs

RegLess: Just-in-Time Operand Staging for GPUs

Presented by: Isaac Martin

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

NVIDIA Fermi Architecture

Operation of the Basic SM Pipeline

Mattan Erez The University of Texas at Austin

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

ITAP: Idle-Time-Aware Power Management for GPU Execution Units

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High Performance Computer Architecture, Orlando, Florida, USA

Outline Background Motivation Mitigation: WarpMan Experiments Conclusions 2

Register File Threads 3 Overview of GPU Architecture ALU Control ALU Cache ALU Warp DRAM Warp TB … … Shared Memory

Motivation: Typically large TB size (512, e.g.) –More efficient data sharing/communication within a TB –Limited total TB number Register File TB Unused Registers Resource Fragmentation 4

Motivation Warp-Level Divergence: …. TB Warp1 Warp2 Warp3 Warp4 Finished warps within the same TB don’t finish at the same time Resources cannot be released promptly Unused Resources 5

Outline Background Motivation –Characterization : Mitigation: WarpMan Experiments Conclusions 6

Characterization: Register File TB Unused Resources Spatial Resource underutilization Finished Temporal Resource underutilization 7

Spatial Resource Underutilization Register resource as an example 28%17% 46% 8

Temporal Resource Underutilization Case Study: Ray Tracing –6 warps per TB –Study TB0 as an example RTRU = 49.7% RTRU: ratio of temporal resource underutilization 9

Why There Is Temporal Resource Underutilization? Input-dependent workload imbalance –Same code, different input: “if(a < 123)” Program-dependent workload imbalance –Code like if(tid < 32) Memory divergence –Some warps experience more cache hits than others Warp scheduling policy –Scheduler prioritizes certain warps than others 10

Characterization: RTRU 11

Outline Background Motivation –Characterization : –Micro-benchmarking Mitigation: WarpMan Experiments Conclusions 12

Micro-benchmark Code runs on both GTX480 and GTX __global__ void TB_resource_kernel(…, bool call = false){ 2. if(call) bloatOccupancy(start, size); clock_t start_clock = clock(); 4. if(tid < 32){ //tid is the threadid within a TB 5 clock_offset = 0; 6. while( clock_offset < clock_count ) { 7. clock_offset = clock() - start_clock; 8. } 9. } 10. clock_t end_clock = clock(); 11. d_o[index] = start_clock; //index is the global thread id 12. d_e[index] = end_clock; 13.} 13

Micro-benchmarking Results > Using CUDA device [0]: GeForce GTX 480 > Detected Compute SM 3.0 hardware with 8 multi-processors. … CTA 250 Warp 0: start 80, end 81 CTA 269 Warp 0: start 80, end 81 CTA 272 Warp 0: start 80, end 81 CTA 283 Warp 0: start 80, end 81 CTA 322 Warp 0: start 80, end 81 CTA 329 Warp 0: start 80, end 81 … 14

Outline Background Motivation Mitigation: WarpMan Experiments Conclusions 15

WarpMan SM TB0 TB1 TB-level Resource Management Unused Resources Finished Warp2 Finished Warp1 Finished Warp0 TB2 16 Warp Level Resource Management cycle Workload TB0 TB2 TB1 Warp2 Warp0 Warp1

SM TB0 TB1 TB-level Resource ManagementWarpMan SM TB0 TB1 Unused Resources Finished Warp TB2 Warp0 From TB2 Warp1 From TB2 Finished Released Resource Warp2 From TB2 17 cycle Workload Warp0 and warp 1 WarpMan TB0 TB1 Warp2 warp2 Warp0 Warp1 Saved Cycle Warp Level Resource Management

WarpMan ---- Design Dispatch logic –Traditional TB-level dispatching logic –Add partial TB dispatch logic Workload buffer –Store the dispatched but not running partial TBs 18

Dispatching TB-level Resource Check Warp-level Resource Check Resources required for a TB Resources required for a Warp A full TB A partial TB Workload to be dispatched Shared memory Warp entries TB entries Registers 19 The shared memory is still allocated at the TB level

Workload Buffer Store the dispatched but not running TB –Hardware TB id (assigned by the hardware) –Software TB id (defined by the software) –Start warp id –End warp id –Valid bit bits 20

Workload Buffer Store the dispatched but not running TB TB120 WarpMan SM TB118 TB117 Unused Resources TB Num Warp0 From TB120 Warp1 From TB120 Start Warp ID End Warp ID Valid Workload buffer Finished Warp2 From TB

Outline Background Motivation: Mitigation: WarpMan Experiments Conclusions 22

Methodology Use GPUWattch for both timing and energy evaluation Baseline Architecture: (GTX480) –15 SMs, with SIMD size of 32, running at 1.4Ghz –Max TBs per SM is 8, Max threads per SM is 1536 –Scheduling policy: round robin / two level –16KB L1 cache, 48 KB shared memory. 128KB regs Applications from: Nvidia CUDA SDK Rodinia Benchmark Suit GPGPUsim 23

Performance Results: temp: allow early finished warps to release resource for new warps temp + spatial: resources are allocated /released at warp level The performance improvements can be as high as 71%/76% On average, 15.3% improvements 24

Energy Results The energy savings can be as high as over 20%, and 6% on average

A Software Alternative Change the software to have a smaller TB size Change the hardware to enable more concurrent TBs Inefficient shared memory usage / synchronization Decrease the data locality More as we proceed to the experimental results… a smaller TB size 26

Comparing to the Software Alternative CT and ST: software approach decreases L1 locality NN and BT: reduced total number of threads On average: 25% improvement VS 48% degradation 125% 52% 27

Related Work Resource underutilization due to branch divergence or thread- level divergence has been well studied. Yi Yang et al [Pact-21] targets at the shared memory resource management and is complementary to our proposed WarpMan scheme. D. Tarjan, et al [US Patent-2009], proposes to use virtual register table to manage physical register file to enable more concurrent TBs 28

Conclusion We highlight the limitations of TB-level resource management we characterize warp-level divergence and reveal the fundamental reasons for such divergent behavior; we propose WarpMan and show that it can be implemented with minor hardware changes we show that our proposed solution is highly effective and achieves significant performance improvements and energy savings Questions? 29