Veynu Narasiman The University of Texas at Austin Michael Shebanow

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Computer Organization and Architecture

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Optimization on Kepler Zehuan Wang

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

Pipelined Processor II CPSC 321 Andreas Klappenecker.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

RISC Architecture RISC vs CISC Sherwin Chan.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.

EKT303/4 Superscalar vs Super-pipelined.

Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

Sunpyo Hong, Hyesoon Kim

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

My Coordinates Office EM G.27 contact time:

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

PipeliningPipelining Computer Architecture (Fall 2006)

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

The Present and Future of Parallelism on GPUs

COMP 740: Computer Architecture and Implementation

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Prof. Onur Mutlu Carnegie Mellon University

Simultaneous Multithreading

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Lecture: SMT, Cache Hierarchies

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture: SMT, Cache Hierarchies

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

General Purpose Graphics Processing Units (GPGPUs)

6- General Purpose GPU Programming

Presented by Ondrej Cernin

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang Joo Lee Intel Rustam Miftakhutdinov The University of Texas at Austin Onur Mutlu Carnegie Mellon University Yale N. Patt The University of Texas at Austin MICRO-44 December 6th, 2011 Porto Alegre, Brazil

Rise of GPU Computing GPUs have become a popular platform for general purpose applications New Programming Models CUDA ATI Stream Technology OpenCL Order of magnitude speedup over single-threaded CPU Over the past few years . . . . . . Allow the programmer to create 1000’s of threads each of which executes the same Previous work has successfully ported over several applications to the GPU and in some cases

How GPUs Exploit Parallelism Multiple GPU cores (i.e., Streaming Multiprocessors) Focus on a single GPU core Exploit parallelism in 2 major ways: Threads grouped into warps Single PC per warp Warps executed in SIMD fashion Multiple warps concurrently executed Round-robin scheduling Helps hide long latencies The GPU is a massively parallel chip which consists of . . . Since all threads are executing the same kernel . . . Warp size for NVIDIA is 32 When a warp executes, all threads within the warp execute the same instruction in parallel across the SIMD resources of the core implying that the SIMD width equals the warp size The second way . . . 48 warps on NVIDIA hardware The warps are scheduled in the fetch stage using a . . . Having so many warps concurrently executing . . . Since when 1 warp is stalled . . .

The Problem Despite these techniques, computational resources can still be underutilized Two reasons for this: Branch divergence Long latency operations

Branch Divergence Current PC: B A 1111 Current Active Mask: 1001 1111 Taken Not Taken 1001 0110 B C D 0110 C D 1111 D 1111 D Reconverge PC Active Mask Execute PC

Long Latency Operations Core All Warps Compute All Warps Compute Req Warp 0 Memory System Req Warp 1 Req Warp 15 time Round Robin Scheduling, 16 total warps

Good Bad 32 warps, 32 threads per warp, SIMD width = 32, round-robin scheduling

Large Warp Microarchitecture (LWM) Alleviates branch divergence Fewer, but larger warps Warp size much greater than SIMD width Total thread count and SIMD-width stay the same Dynamically breaks down large warp into sub-warps Can be executed on existing SIMD pipeline Rearrange active mask as 2D structure Number of columns = SIMD width Search each column for an active thread to create new sub-warp

Large Warp Microarchitecture Example Decode Stage 1 1 1 Sub-warp 1 mask Sub-warp 0 mask Sub-warp 2 mask Sub-warp 1 mask Sub-warp 0 mask Sub-warp 0 mask 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

More Large Warp Microarchitecture Divergence stack still used Handled at the large warp level How large should we make the warps? More threads per warp  more potential for sub-warp creation Too large a warp size can degrade performance Re-fetch policy for conditional branches Must wait till last sub-warp finishes Optimization for unconditional branch instructions Don’t create multiple sub-warps Sub-warping always completes in a single cycle

Two Level Round Robin Scheduling Split warps into equal sized fetch groups Create initial priority among the fetch groups Round-robin scheduling among warps in same fetch group When all warps in the highest priority fetch group are stalled Rotate fetch group priorities Highest priority fetch group becomes least Warps arrive at a stalling point at slightly different points in time Better overlap of computation and memory latency

Round Robin vs Two Level Round Robin Core All Warps Compute All Warps Compute Req Warp 0 Memory System Req Warp 1 Req Warp 15 time Round Robin Scheduling, 16 total warps Group 0 Group 1 Group 0 Group 1 Core Compute Compute Compute Compute Saved Cycles Req Warp 0 Req Warp 1 Req Warp 7 Memory System Req Warp 8 Req Warp 9 Req Warp 15 time Two Level Round Robin Scheduling, 2 fetch groups, 8 warps each

More on Two Level Scheduling What should the fetch group size be? Enough warps to keep pipeline busy in the absence of long latency stalls Too small Uneven progression of warps in the same fetch group Destroys data locality among warps Too large Reduces benefits of two-level scheduling More warps stall at the same time Not just for hiding memory latency Complex instructions (e.g., sine, cosine, sqrt, etc.) Two-level scheduling allows warps to arrive at such instructions at slightly different points in time

Combining LWM and Two Level Scheduling 4 large warps, 256 threads each Fetch group size = 1 large warp Problematic for applications with few long latency stalls No stalls  no fetch group priority changes Single large warp starved Branch re-fetch policy for large warps  bubbles in pipeline Timeout invoked fetch group priority change 32K instruction timeout period Alleviates starvation

Register File and On Chip Memories Methodology Simulate single GPU core with 1024 thread contexts divided into 32 warps each with 32 threads Scalar Front End 1-wide fetch, decode 4KB single ported I-Cache Round-robin scheduling SIMD Back End In order, 5 stages, 32 parallel SIMD lanes Register File and On Chip Memories 64KB Register File 128KB, 4-way, D-Cache with 128B line size 128KB, 32-banked private memory Memory System Open row, first-come first-serve scheduling 8 banks, 4KB row buffer per bank 100-cycle row hit latency, 300-cycle row conflict latency 32 GB/s memory bandwidth

Overall IPC Results LWM+2Lev improves performance by 19.1% over baseline and by 11.5% over TBC

IPC and Computational Resource Utilization

Conclusion For maximum performance, the computational resources on GPUs must be effectively utilized Branch divergence and long latency operations cause them to be underutilized or unused We proposed two mechanism to alleviate this Large Warp Microarchitecture for branch divergence Two-level scheduling for long latency operations Improves performance by 19.1% over traditional GPU cores Increases scope of applications that can run efficiently on a GPU Questions