Divergence-Aware Warp Scheduling

Slides:



Advertisements
Similar presentations
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.
More on threads, shared memory, synchronization
Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike.
Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Memory System Characterization of Big Data Workloads
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
Multiscalar processors
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Modeling GPU non-Coalesced Memory Access Michael Fruchtman.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
How should a highly multithreaded architecture, like a GPU, pick which threads to issue? Cache-Conscious Wavefront Scheduling Use feedback from the memory.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Cache-Conscious Wavefront Scheduling
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
FPGA-Based System Design: Chapter 7 Copyright  2004 Prentice Hall PTR Topics n Hardware/software co-design.
Lecture 15 Calculating and Improving Cache Perfomance
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Sunpyo Hong, Hyesoon Kim
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
My Coordinates Office EM G.27 contact time:
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner, and T. Aamodt.
Part 3: Research Directions 2 3 Advancing Computer Systems without Technology Progress DARPA/ISAT Workshop, March 26-27, 2012 Mark Hill & Christos Kozyrakis.
CUDA programming Performance considerations (CUDA best practices)
Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013.
1 Lecture 5a: CPU architecture 101 boris.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Single Instruction Multiple Threads
Employing compression solutions under openacc
EECE571R -- Harnessing Massively Parallel Processors ece
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Presented by: Isaac Martin
NVIDIA Fermi Architecture
November 14 6 classes to go! Read
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Henk Corporaal TUEindhoven 2011
Operation of the Basic SM Pipeline
CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads S. –Y Lee, A. A. Kumar and C. J Wu ISCA 2015.
General Purpose Graphics Processing Units (GPGPUs)
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Presentation transcript:

Divergence-Aware Warp Scheduling Timothy G. Rogers1, Mike O’Connor2, Tor M. Aamodt1 1The University of British Columbia 2NVIDIA Research MICRO 2013 Davis, CA

… … … GPU Main Memory L2 cache Threads Warp Streaming Multiprocessor 10000’s concurrent threads Grouped into warps Scheduler picks warp to issue each cycle Main Memory L2 cache Threads Warp … Streaming Multiprocessor Streaming Multiprocessor … W1 W2 … Memory Unit Warp Scheduler L1D

Can waste memory bandwidth 2 Types of Divergence Branch Divergence Threads Warp … if(…) { … } Effects functional unit utilization 1 1 Aware of branch divergence Memory Divergence Main Memory … Aware of memory divergence Can waste memory bandwidth AND Warp … Focus on improving performance

Motivation Improve performance of programs with memory divergence Parallel irregular applications Economically important (server computing, big data) Transfer locality management from SW to HW Software solutions: Complicate programming Not always performance portable Not guaranteed to improve performance Sometimes impossible

Explicit Scratchpad Use Programmability Case Study Sparse Vector-Matrix Multiply GPU-Optimized Version 2 versions from SHOC Divergent Version Explicit Scratchpad Use Added Complication Dependent on Warp Size Divergence Each thread has locality Parallel Reduction

… Previous Work 1 Active Mask 1 Active Mask Lost Locality Detected Scheduling used to capture intra-thread locality (MICRO 2012) W2 Warp Scheduler Memory Unit L1D W1 … W3 WN 1 Active Mask Go Go Go Stop Go Stop 1 Active Mask Lost Locality Detected Previous Work Divergence-Aware Warp Scheduling Reactive Detects interference then throttles Proactive Predict and be Proactive Branch divergence aware Unaware of branch divergence All warps treated equally Adapt to branch divergence Outperformed by profiled static throttling Case Study: Divergent code 50% slowdown Outperform static solution Case Study: Divergent code <4% slowdown

Adapt to branch divergence Divergence-Aware Warp Scheduling How to be proactive Identify where locality exists Limit the number of warps executing in high locality regions Adapt to branch divergence Create cache footprint prediction in high locality regions Account for number of active lanes to create per-warp footprint prediction. Change the prediction as branch divergence occurs.

Static Load Instructions in GC workload Locality Concentrated in Loops Where is the locality? Examine every load instruction in program Loop Load 1 Load 2 Load 3 Load 4 Load 5 Load 6 Load 7 Load 8 Load 9 Load 10 Load 11 Load 12 Load 13 Hits/Misses PKI Static Load Instructions in GC workload Locality Concentrated in Loops

Hits on data accessed in immediately previous trip Locality In Loops Limit Study Hits on data accessed in immediately previous trip How much data should we keep around? Other Fraction cache hits in loops Line accessed last iteration Line accessed this iteration

DAWS Objectives Predict the amount of data accessed by each warp in a loop iteration. Schedule warps in loops so that aggregate predicted footprint does not exceed L1D.

Both Used To Create Cache Footprint Prediction Observations that enable prediction Memory divergence in static instructions is predictable Data touched by divergent loads dependent on active mask Main Memory Divergence … load Warp 1 Warp 0 Both Used To Create Cache Footprint Prediction Main Memory Main Memory 4 accesses 2 accesses Divergence Divergence Warp Warp 1 1 1 1 1 1

Online characterization to create cache footprint prediction Detect loops with locality Classify loads in the loop Compute footprint from active mask Some loops have locality Some don’t Limit multithreading here Loop with locality while(…) { load 1 … load 2 } Diverged Not Diverged Loop with locality Warp 0 1 while(…) { load 1 … load 2 } 4 accesses Diverged Warp 0’s Footprint = 5 cache lines + Not Diverged 1 access

DAWS Operation Example Warp0 1 1st Iter. Warp1 1 1st Iter. DAWS Operation Example Example Compressed Sparse Row Kernel Cache Cache Both warps capture spatial locality together Cache Locality Detected A[0] A[64] A[96] A[128] Hit A[0] A[64] A[96] A[128] Hit x30 A[32] int C[]={0,64,96,128,160,160,192,224,256}; void sum_row_csr(float* A, …) { float sum = 0; int i =C[tid]; while(i < C[tid+1]) { sum += A[ i ]; ++i; } … A[160] A[192] A[224] Loop Stop Go Warp1 1 1st Iter. Divergent Branch 1 Diverged Load Detected Want to capture spatial locality Memory Divergence 4 Active threads Warp0 1 2nd Iter. Warp0 1 33rd Iter. Warp 0 has branch divergence Go Go Go Time1 Time0 Time2 Cache Footprint 4 Stop Footprint = 3X1 Warp1 Early warps profile loop for later warps No locality detected = no footprint Footprint = 4X1 Warp0 Warp1 No Footprint Footprint decreased Warp0

More schedulers in paper Methodology GPGPU-Sim (version 3.1.0) 30 Streaming Multiprocessors 32 warp contexts (1024 threads total) 32k L1D per streaming multiprocessor 1M L2 unified cache Compared Schedulers Cache-Conscious Wavefront Scheduling (CCWS) Profile based Best-SWL Divergence-Aware Warp Scheduling (DAWS) More schedulers in paper

Sparse MM Case Study Results Performance (normalized to optimized version) Within 4% of optimized with no programmer input Divergent Code Execution time

Sparse MM Case Study Results Properties (normalized to optimized version) <20% increase in off- chip accesses Divergent code off-chip accesses Divergent code issues 2.8x less instructions Divergent version now has potential energy advantages 13.3

Cache-Sensitive Applications Breadth First Search (BFS) Memcached-GPU (MEMC) Sparse Matrix-Vector Multiply (SPMV-Scalar) Garbage Collector (GC) K-Means Clustering (KMN) Cache-Insensitive Applications in paper

Results Normalized Speedup BFS MEMC SPMV-Scalar GC KMN HMean Overall 26% improvement over CCWS Outperform Best-SWL in highly branch divergent Results Normalized Speedup BFS MEMC SPMV-Scalar GC KMN HMean

Summary Divergent loads in GPU programs. Questions? Problem Divergent loads in GPU programs. Software solutions complicate programming Solution DAWS Captures opportunities by accounting for divergence Result Overall 26% performance improvement over CCWS Case Study: Divergent code performs within 4% code optimized to minimize divergence