Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Slides:



Advertisements
Similar presentations
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
Advertisements

GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011 Taewoo Lee
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
Understanding Outstanding Memory Request Handling Resources in GPGPUs
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.
EKT303/4 Superscalar vs Super-pipelined.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Sunpyo Hong, Hyesoon Kim
Operation of the SM Pipeline
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Performance in GPU Architectures: Potentials and Distances
(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp Superscalar GP-GPU design of SIMT.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
Single Instruction Multiple Threads
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
COMP 740: Computer Architecture and Implementation
Gwangsun Kim, Jiyun Jeong, John Kim
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Mattan Erez The University of Texas at Austin
Lecture 26: Multiprocessors
Presented by: Isaac Martin
Hardware Multithreading
Coe818 Advanced Computer Architecture
Lecture 27: Multiprocessors
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria

This Work  Accelerators o Designed to maximize throughput o ILT: fetch the same instruction repeatedly o Wasted  Our solution: o Keep fetched instructions in small buffer, save energy  Key result: 19% front-end energy reduction Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 2

Outline  Background  Instruction Locality  Exploiting Instruction Locality o Decoded-Instruction Buffer o Row Buffer o Filter Cache  Case Study: Filter Cache o Organization o Experimental Setup o Experimental Results Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 3

Heterogonous Systems  Heterogonous system to achieve optimal performance/watt o Superscalar speculative out-of-order processor for latency-intensive serial workloads o Accelerator (Multi-threaded in-order SIMD processor) for High-throughput parallel workloads  6 of 10 Top500.org supercomputers today employ accelerators o IBM Power BQC 16C 1.60 GHz (1 st, 3 th, 8 th, and 9 th ) o NVIDIA Tesla (6 th and 7 th ) 4 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

GPUs as Accelerators  GPUs are most available accelerators o Class of general-purpose processors named SIMT o Integrated on same die with CPU (Sandy Bridge, etc)  High energy efficiency o GPU achieves 200 pJ/instruction o CPU achieves 2 nJ/instruction 5 [Dally’2010] Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

SIMT Accelerator  SIMT (Single-Instruction Multiple-Thread)  Goal is throughput  Deep-multithreaded  Designed for latency hiding  8- to 32-lane SIMD 6 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

Streaming Multiprocessor (SM), CTA & Warps  Threads of same thread-block (CTA) o Communicate through fast shared memory o Synchronized through fast synchronizer  A CTA is assigned to one SM  SMs execute in warp (group of 8-32 threads) granularity 7 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

Warping Benefits  Thousands of threads are scheduled zero-overhead o Context of threads are all on core  Concurrent threads are grouped into warps o Share control-flow tracking overhead o Reduce scheduling overhead o Improve utilization of execution units (SIMD efficiency) 8 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

Energy Reduction Potential in GPUs  Huge amount of context o Caches o Shared Memory o Register file o Execution units  To many inactive threads o Synchronization o Branch/Memory divergence  High Locality o Similar behavior by different threads 9 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

Baseline Pipeline Front-end  Modeled according to NVIDIA Patents  3-Stage Front-end o Instruction Fetch (IF) o Instruction Buffer (IB) o Instruction Dispatch (ID)  Energy breakdown o I-Cache second most energy consuming 10 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

SIMD Back-end SM Pipeline Front-end Example Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 11 W1↓↓↓↓ W2↓↓↓↓ Warp Scheduler I-Cache Instruction Buffer insnsrc1src2dest W1 W2 Scoreboard Field1Field2 W1 W2 Instruction Scheduler Operand Buffering lane1lane2lane3lane4 Register File W1 W2 12 Code sequence: 1: add r2 <- r0, r1 2: ld r3 <- [r2] 3: add r0 r1 r2 ld r2 -- r3 r2 r0 for all lanes r1 for all lanes r2 for all lanes r3 for all lanes r0 for all lanes r0 r0t0 r0t1 r0t2 r0t3 r1 r1t0 r1t1 r1t2 r1t3 r0t0 r1t0 r0t1 r1t1 r0t2 r1t2 r0t3 r1t3 r2t0 r2t1 r2t2 r2t3 r3 r2 r2t0 r2t1 r2t2 r2t3 r2t0 0 r2t1 0 r2t2 0 r2t3 0 r3t0 r3t1 r3t2 r3t3 1 3 r2 for all lanes r3 for all lanes 1 PC

Inter-Thread Instruction Locality (ITL)  Warps are likely to fetch/and decode same instruction  The percentage of instructions already fetched by other currently active warps recently: 12 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

Exploiting ITL  Toward performance improvement o Minor improvement by reducing the latency of arithmetic pipeline  Toward energy saving o Fetch/decode bypassing similar to loop buffering o Reducing accesses to I-Cache Row buffer Filter Cache (our case study) 13 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

Decoded- Instruction Buffer Fetch/Decode Bypassing Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 14 W1↓↓↓↓ W2↓↓↓↓ Warp Scheduler Instruction Buffer insnsrc1src 2 dest W1 W2 add r0 r1 r2 ld r2 -- r3 1 1 PC Decoded-insn I-Cache I-Cache tag I-Cache data No need to access I-Cache and decode logic if buffer hits PC Buffer can bypass 42% of instruction fetches

Row Buffer Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 15 W1↓↓↓↓ W2↓↓↓↓ Warp Scheduler I-Cache Instruction Buffer insnsrc1src2dest W1 W2 add r0 r1 r2 ld r2 -- r3 1 1 PC I-Cache tag I-Cache data MUX Buffer last accessed I-Cache line PC Row Buffer

Filter Cache (Our Case Study) Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 16 W1↓↓↓↓ W2↓↓↓↓ Warp Scheduler I-Cache Instruction Buffer insnsrc1src2dest W1 W2 add r0 r1 r2 ld r2 -- r3 1 1 PC Filter Cache I-Cache tag I-Cache data MUX Buffering last fetched instruction in a set-associative table PC

Filter Cache Enhanced Front-end  Bypass I-Cache accesses to save dynamic power  32-entry (256-byte) FC o FC hit rate Up to ~100% o Front-end Energy Saving Up to 19% o Front-end area overhead 4.7% o Front-end leakage overhead 0.7% 17 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

Methodology  Cycle-accurate simulation of CUDA workloads by GPGPU-sim o Configured to model NVIDIA Tesla architecture o 16 8-wide SMs o 1024 threads/SM o 48 KB D-L1$/SM o 4 KB I-L1$/SM (256-byte lines)  21 Workloads from: o RODINIA (Backprop, …) o CUDA SDK (Matrix Multiply, …) o GPGPU-sim (RAY,…) o Parboil (CP) o Third-party sequence alignment (MUMmerGPU++) Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 18

Methodology (2)  Energy evaluations under 32-nm technology using CACTI 19 Area (μm 2 ) Leakage (mW) Energy per R/W (pJ)Delay (ps) I-Cache tag I-Cache data Instruction Buf Scoreboard Operand Buf FC tag (32-entry) FC data (32-entry) FC tag (16-entry) FC data (16-entry) Modeled by a wide tag array Modeled by a data array Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs Scoreboard6921 Instruction Buf. Operand Buf.

Experimental Results  FC hit rate and energy saving o 32-entry FC o 1024-thread per SM o Round-robin warp scheduler  Sensitivity analysis under o FC size o Thread per SM o Warp scheduler 20 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

FC Hit Rate and Energy Saving FC hit rate Baseline I-Cache energy (nJ) I-Cache + FC energy (nJ) Front-end energy-saving using FC CP100% % HSPT 89% % LPS83% % MP30% % MTM95% % NN99% % RAY76% % SCN97% % Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 21 Low-Concurrent Warps, Divergent BranchHigh-Concurrent Warps, Coherent Branch MP30% % CP100% %

Sensitivity Analysis  Filter Cache size o Larger FC provides higher hit-rate but has higher static/dynamic energy  Thread per SM o Higher thread per SM, higher the chance of instruction re-fetch  Warp Scheduling o Advanced warp schedulers (latency hiding or data cache locality boosters) may keeping the warps at the different paces Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 22 Round-robinTwo-level W0W1W0W1W0W1 Compute Memory Pending W0 W1 Time

Sensitivity to Multithreading-Depth 23 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs Threads Per SM: ~ 1% hit reduction ~ 1% reduction in savings

Sensitivity to Warp Scheduling 24 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs Round-RobinTwo-LevelWarp Scheduler: ~1% hit reduction ~1% reduction in savings

Sensitivity to Filter Cache Size 25 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 3216Number of entries in FC: ~5% hit reduction up to ~1% reduction in savings (due to lower hit rate ) Overall ~2% increase in savings (due to smaller FC)

Conclusion & Future Works  We have evaluated instruction locality among concurrent warps under deep-multithreaded GPU  The locality can be exploited for performance or energy- saving  Case Study: Filter cache provides 1%-19% energy-saving for the pipeline  Future Works: o Evaluating the fetch/decode bypassing o Evaluating concurrent kernel GPUs Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 26

Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 27 Thank you! Question?

Backup-Slides Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 28

References [Dally’2010] W. J. Dally, GPU Computing: To ExaScale and Beyond, SC Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 29

Workloads Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 30 Abbr. Name and Suite Grid SizeBlock Size#InsnCTA/SM BFSBFS Graph [2]16x(8)16x(512)1.4M1 BKPBack Propagation [2]2x(1,64)2x(16,16)2.9M4 CPCoulumb Poten. [19](8,32)(16,8)113M8 DYNDyn_Proc [2]13x(35)13x(256)64M4 FWALFast Wal. Trans. [18] 6x(32) 3x(16) (128) 7x(256) 3x(512) 11M2, 4 GASGaussian Elimin. [2]48x(3,3)48x(16,16)9M1 HSPTHotspot [2](43,43)(16,16)76M2 LPSLaplace 3D [1](4,25)(32,4)81M6 MP2MUMmer-GPU++ [8] big(196)(256)139M2 MPMUMmer-GPU++ [8] small(1)(256)0.3M1 MTMMatrix Multiply [18](5,8)(16,16)2.4M4 MU2MUMmer-GPU [2] big(196)(256)75M4

Workloads (2) Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs 31 Abbr. Name and Suite Grid SizeBlock Size#InsnCTA/SM MUMUMmer-GPU [2] small(1)(100)0.2M1 NNCNearest Neighbor [2]4x(938)4x(16)5.9M8 NNNeural Network [1] (6,28) (25,28) (100,28) (10,28) (13,13) (5,5) 2x(1) 68M5, 8 NQUN-Queen [1](256)(96)1.2M1 NWNeedleman-Wun. [2] 2x(1) … 2x(31) (32) 63x(16)12M2 RAYRay Tracing [1](16,32)(16,8)64M3 SCNScan [18](64)(256)3.6M4 SR1Speckle Reducing [2] big4x(8,8)4x(16,16)9.5M2, 3 SR2Speckle Reducing [2] small4x(4,4)4x(16,16)2.4M1