Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

Slides:



Advertisements
Similar presentations
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Improving GPU Performance via Improved SIMD Efficiency Ahmad Lashgar ECE Department University of Tehran Supervisors: Ahmad Khonsari Amirali Baniasadi.
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
GPU Programming with CUDA – Optimisation Mike Griffiths
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Understanding Outstanding Memory Request Handling Resources in GPGPUs
Divergence-Aware Warp Scheduling
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.
WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
1 Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)
Sunpyo Hong, Hyesoon Kim
Performance in GPU Architectures: Potentials and Distances
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.
CUDA programming Performance considerations (CUDA best practices)
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Employing compression solutions under openacc
Sathish Vadhiyar Parallel Programming
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Zhichun Zhu Zhao Zhang ECE Department ECE Department
ISPASS th April Santa Rosa, California
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Lecture 26: Multiprocessors
Presented by: Isaac Martin
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Lecture 27: Multiprocessors
ITAP: Idle-Time-Aware Power Management for GPU Execution Units
Presentation transcript:

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria

This Work  Accelerators o Control-flow amortized over tens of threads called warp o Warp size impacts branch/memory divergence & memory access coalescing o Small Warp: Low Branch/Memory Divergence (+), Low Memory Coalescing (-) o Large Warp: High Branch Divergence/Memory (-), High Memory Coalescing(+)  Key question: Which processor provides higher energy-efficiency? o Small-warp, coalescing-enhanced o Large-warp, control-flow enhanced  Key result: Small-warp enhanced processor better than large-warp enhanced processor 2 Towards Green GPUs: Warp Size Impact Analysis

Outline  Branch/Memory divergence  Memory Access Coalescing  Warp Size Impact on Divergence and Coalescing  Warp Size: Large or Small? o Use machine models to find the answer: o Small-Warp Coalescing-Enhanced Machine (SW+) o Large-Warp Control-flow-Enhanced Machine (LW+)  Experimental Results  Conclusion 3 Towards Green GPUs: Warp Size Impact Analysis

Warping  Opportunities o Reduce scheduling overhead o Improve utilization of execution units (SIMD efficiency) o Exploit inter-thread data locality  Challenges o Memory divergence o Branch divergence 4 Towards Green GPUs: Warp Size Impact Analysis

Memory Divergence  Threads of a warp may take hit or miss in L1 access 5 J = A[S]; // L1 cache access L = K * J; Hit Miss Hit Time Stall WarpT0T1T2T3 WarpT0T1T2T3 Towards Green GPUs: Warp Size Impact Analysis

Branch Divergence  Branch instruction can diverge to two different paths dividing the warp to two groups: 1.Threads with taken outcome 2.Threads with not-taken outcome 6 If(J==K){ C[tid]=A[tid]*B[tid]; }else if(J>K){ C[tid]=0; } Warp T0XXT3 Warp Time XT1T2X T0T1T2T3 T0XXT3 T0T1T2T3 Towards Green GPUs: Warp Size Impact Analysis

Memory Access Coalescing  Common memory access of neighbor threads are coalesced into one transaction 7 WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 Hit Miss Hit Miss Mem. Req. AMem. Req. B Mem. Req. C Mem. Req. DMem. Req. E ABAB CCCC DEED Towards Green GPUs: Warp Size Impact Analysis

Coalescing Width  Range of the threads in a warp which are considered for memory access coalescing o NVIDIA G80 -> Over sub-warp o NVIDIA GT200 -> Over half-warp o NVIDIA GF100 -> Over entire warp  When the coalescing width is over entire warp, optimal warp size depends on the workload 8 Towards Green GPUs: Warp Size Impact Analysis

Warp Size  Warp Size is the number of threads in warp  Why small warp? (not lower that SIMD width) o Less branch/memory divergence o Less synchronization overhead at every instruction  Why large warp? o Greater opportunity for memory access coalescing  We study warp size impact on performance 9 Towards Green GPUs: Warp Size Impact Analysis

Warp Size and Branch Divergence  Lower the warp size, lower the branch divergence 10 If(J>K){ C[tid]=A[tid]*B[tid]; else{ C[tid]=0; } ↓↓↓↓↓↓↓↓ ↓↓↓↓↓↓ ↓↓ ↓↓↓↓↓↓↓↓ 2-thread warp T1T2T3T4T5T6T7T8 No branch divergence 4-thread warp Branch divergence Towards Green GPUs: Warp Size Impact Analysis

Warp Size and Branch Divergence (continued) 11 WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 WarpT0T1XX WarpT4T5T6T7 WarpXT9T10T11 WarpXXT2T3 WarpT8XXX WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 Warp Time T0T1T2T3 T4T5T6T7 T8T9T10T11 Warp T0T1XX T4T5T6T7 XT9T10T11 Warp XXT2T3 XXXX T8XXX Warp T0T1T2T3 T4T5T6T7 T8T9T10T11 Small warpsLarge warps Saving some idle cycles Towards Green GPUs: Warp Size Impact Analysis

Warp Size and Memory Divergence 12 WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 Time Small warpsLarge warps Hit Miss Hit Warp T0T1T2T3 Hit Miss Hit Warp T0T1T2T3 T8T9T10T11 T4T5T6T7 Stall WarpT0T1T2T3 WarpT4T5T6T7 T4T5T6T7 T8T9T10T11 WarpT8T9T10T11 Improving latency hiding Towards Green GPUs: Warp Size Impact Analysis

Warp Size and Memory Access Coalescing 13 WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 Time Small warpsLarge warps Miss Warp T0T1T2T3 Miss T4T5T6T7 T8T9T10T11 Miss Req. A Req. B Req. A Req. B Req. A Req. B Reducing the number of memory accesses using wider coalescing 5 memory requests2 memory requests Towards Green GPUs: Warp Size Impact Analysis

Warp Size Impact on Coalescing  Larger the warp, higher the coalescing rate 14 Towards Green GPUs: Warp Size Impact Analysis

Warp Size Impact on Idle Cycles  Larger the warp, higher divergence and higher idle cycles o but may reduce the idle cycles due to coalescing gain 15 Towards Green GPUs: Warp Size Impact Analysis

Warp Size Impact on Energy  Larger warps reduce energy if the coalescing gain could dominate the exacerbated divergence 16 Towards Green GPUs: Warp Size Impact Analysis

Warp Size Impact on Performance  Larger warps improve performance if the coalescing gain could dominate the exacerbated divergence 17 Towards Green GPUs: Warp Size Impact Analysis

Warp Size Impact on Energy-efficiency  Larger warps improve energy-efficiency if the coalescing gain could dominate the exacerbated divergence 18 Towards Green GPUs: Warp Size Impact Analysis

Approach 19 Baseline machine Small Warp Enhanced (SW+): -Ideal MSHR to compensate coalescing lost Large Warp Enhanced (LW+): -MIMD lanes to compensate branch divergence Towards Green GPUs: Warp Size Impact Analysis

SW+  Warps as wide as SIMD width o Minimize branch/memory divergence o Improve latency hiding  Compensating the deficiency -> Ideal MSHR o Compensating small-warp deficiency (memory access coalescing lost) o In order to merge inter-warp memory transaction, Ideal MSHR tags the per-warp outstanding MSHRs 20 Towards Green GPUs: Warp Size Impact Analysis

LW+  Warps 8x larger than SIMD width o Improve memory access coalescing  Compensating the deficiency -> Lock-step MIMD execution o Compensate large warp deficiency (branch/memory divergence) o Parallel Fetch/Decode unit per lane 21 Towards Green GPUs: Warp Size Impact Analysis

Methodology  Performance simulation through GPGPU-sim and power simulation through McPat o Six Memory Controllers (76 GB/s) o 16 8-wide SMs (332.8 GFLOPS) o 1024-thread per code o Warp Size: 8, 16, 32, and 64  Workloads o RODINIA o CUDA SDK o GPGPU-sim 22 Towards Green GPUs: Warp Size Impact Analysis

Coalescing Rate  SW+: 103%, 67%, 40% higher coalescing vs. 16, 32, 64 thd/warps  LW+: 47%, 21%, 1% higher coalescing vs. 16, 32, 64 thd/warps 23 Towards Green GPUs: Warp Size Impact Analysis

Idle Cycles  SW+: 12%, 8%, 10% less Idle Cycles vs. 8, 16, 32 thd/warps  LW+: 4%, 1%, 3% less Idle Cycles vs. 8, 16, 32 thd/warps 24 Towards Green GPUs: Warp Size Impact Analysis

Energy  SW+: Outperforms 8 (26%) thd/warps.  LW+: Outperforms SW+ (19%), 8 (51%), 16 (3%) thd/warps. 25 Towards Green GPUs: Warp Size Impact Analysis

Performance  SW+: Outperforms LW+ (7%), 8 (18%), 16(15%), 32 (25%) thd/warps.  LW+: Outperforms 8 (11%), 16 (8%), 32 (17%), 64 (30%) thd/warps. 26 Towards Green GPUs: Warp Size Impact Analysis 3.2

Energy-efficiency  SW+: Outperforms LW+ (62%), 8 (136%), 16(13%), 32 (4%) thd/warps.  LW+: Outperforms 8 (46%), 64 (8%) thd/warps. 27 Towards Green GPUs: Warp Size Impact Analysis

Conclusion & Future Works  Warp Size Impacts Coalescing Rate, Idle Cycles, Performance, and Energy  Investing in Enhancement of small-warp machine returns higher gain than investing in enhancement of large-warp  We use machine models to explore the answer  Evaluating wider machine models (including LWM-enhanced large-warp machine) 28 Towards Green GPUs: Warp Size Impact Analysis

29 Thank you! Question? Towards Green GPUs: Warp Size Impact Analysis

Backup-Slides 30 Towards Green GPUs: Warp Size Impact Analysis

Warping  Thousands of threads are scheduled zero-overhead o All the context of threads are on-core  Tens of threads are grouped into warp o Execute same instruction in lock-step 31 Towards Green GPUs: Warp Size Impact Analysis

Key Question  Which warp size should be decided as the baseline? o Then, investing in augmenting the processor toward removing the associated deficiency  Machine models to find the answer Towards Green GPUs: Warp Size Impact Analysis 32

GPGPU-sim Config 33 Towards Green GPUs: Warp Size Impact Analysis NoC #SMs / #memory controllers16 / 6 Number of SM Sharing an Network Interface2 SM #thread per SM / SIMD width1024 / 32 Maximum allowed CTA per SM8 Shared Memory/Register File size16KB/64KB Warp Size8 / 16 / 32 / 64 L1 Data/Texture/Constant cache64KB : 16KB : 16KB Clocking Core / Interconnect / DRAM 1300 / 650 / 800 MHz Memory banks per memory ctrl : DRAM Scheduling Policy8 : FCFS

Workloads 34 Towards Green GPUs: Warp Size Impact Analysis NameGrid SizeBlock Size#Insn BFS: BFS Graph [3]16x(8,1,1)16x(512,1)1.4M BKP: Back Propagation [3]2x(1,64,1)2x(16,16)2.9M CP: Distance-Cutoff Coulomb Potential [1](8,32,1)(16,8,1)113M GAS: Gaussian Elimination [3]48x(3,3,1)48x(16,16)8.8M HSPT: Hotspot [3](43,43,1)(16,16,1)76.2M LPS: Laplace equation on regular 3D grid [1](4,25)(32,4)81.7M MP: MUMmer-GPU++ [6](1,1,1)(256,1,1)0.3M MU: MUMmer-GPU [1](1,1,1)(100,1,1)0.15M NN: Neural Network [1] (6,28) (50,28) (100,28) (10,28) (13,13) (5,5) 2x(1,1) 68.1M NNC: Nearest Neighbor [3]4x(938,1,1)4x(16,1,1)5.9M NQU: N-Queen [1](256,1,1)(96,1,1)1.2M RAY: Ray-tracing [1](16,32)(16,8)64.9M SC: Scan[18](64,1,1)(256,1,1)3.6M SR1: SRAD [3] (large dataset)3x(8,8,1)3x(16,16)9.1M SR2: SRAD [3] (small dataset)4x(4,4,1)4x(16,16)2.4M