Performance in GPU Architectures: Potentials and Distances

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.
Latency considerations of depth-first GPU ray tracing
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
1 Wilson W. L. Fung Tor M. Aamodt University of British Columbia HPCA-17 Feb 14, 2011.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Improving GPU Performance via Improved SIMD Efficiency Ahmad Lashgar ECE Department University of Tehran Supervisors: Ahmad Khonsari Amirali Baniasadi.
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Power and Frequency Analysis for Data and Control Independence in Embedded Processors Farzad Samie Amirali Baniasadi Sharif University of Technology University.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
Understanding Outstanding Memory Request Handling Resources in GPGPUs
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.
CUDA programming Performance considerations (CUDA best practices)
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Employing compression solutions under openacc
Sathish Vadhiyar Parallel Programming
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
ISPASS th April Santa Rosa, California
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Spare Register Aware Prefetching for Graph Algorithms on GPUs
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Presentation transcript:

Performance in GPU Architectures: Potentials and Distances Ahmad Lashgar ECE University of Tehran Amirali Baniasadi ECE University of Victoria WDDD-9 June 5, 2011

This Work Goal: Investigating GPU performance for general-purpose workloads How: Studying the isolated impact of Memory divergence Branch divergence Context-keeping resources Key finding: Memory has the biggest impact. Branch divergence solution needs memory consideration. A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Outline Background Performance Impacting Parameters Machine Models Performance Potentials Performance Distances Sensitivity Analysis Conclusion A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Interconnection Network GPU Architecture Interconnection Network MCtrl6 DRAM1 DRAM6 . TPC1 SM1 SM2 SM3 MCtrl1 MCtrl2 DRAM2 MCtrl5 DRAM5 TPC10 Number of concurrent CTAs per SM is limited by the size of 3 shared resources: Thread Pool Register File Shared Memory GPU architecture; Our configuration in this paper. GPU is a scalable array of SMs and Memory Controllers communicating through interconnection network. For an specific workload, number of concurrent CTAs per SM is limited by the size of 3 shared resources Notice real-GPUs do not store PC and CTAID per thread. Storing these data per warp is enough. Thread Pool L1Data L1Cost L1Text PE32 PE1 PE2 PE31 Register File CTAID Program Counter TID . Shared Memory … A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Branch Divergence SM is SIMD processor Group of threads (warp) execute the same instruction on the lanes. Branch instruction potentially diverge warp to two groups: Threads with taken outcome Threads with not-taken outcome A: // Pre-Divergence if(CONDITION) { B: //NT path } else C: //T path D: // reconvergence point A 1 B 1 C 1 D 1 A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Control-flow mechanism Control-flow solutions address this. Previous solutions: Postdominator Reconvergence (PDOM) Masking and serializing in diverging paths, finally reconverging all paths Dynamic Warp Formulation (DWF) Regrouping the threads in diverging paths into new warps A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

PDOM SIMD Utilization over time A W0 1111 W0 RPC PC Mask Vector - D 1 W0 RPC PC Mask Vector D C 1 B - W0 RPC PC Mask Vector - A 1 W0 RPC PC Mask Vector D 1 - W0 RPC PC Mask Vector D 1 B - W0 RPC PC Mask Vector D B 1 - W1 1111 TOS B Color on CFG show activation of a warp and gray shows inactivation (masked warp). C W0 0110 W0 0110 W0 1001 W1 RPC PC Mask Vector - A 1 W1 RPC PC Mask Vector D C 1 B - W1 RPC PC Mask Vector D 1 - W1 RPC PC Mask Vector - D 1 W1 RPC PC Mask Vector D B 1 - W1 RPC PC Mask Vector D 1 B - W1 0001 W1 0001 W1 1110 TOS D W0 1111 W1 1111 Dynamic regrouping of diverged threads at same path increases utilization A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

DWF SIMD Utilization over time Warp Pool A W0 1111 W0 1111 Wi PC Mask Vector W0 B 1 W1 C W2 Wi PC Mask Vector W0 B 1 W1 W2 C W3 Wi PC Mask Vector W0 A 1 W1 Wi PC Mask Vector W0 D 1 W1 C W2 Wi PC Mask Vector W0 D 1 W1 Wi PC Mask Vector W0 A 1 W1 Wi PC Mask Vector W0 A 1 W1 D Wi PC Mask Vector W0 D 1 W1 W2 Wi PC Mask Vector W0 D 1 W1 W2 C Wi PC Mask Vector W0 B 1 W1 A W2 C W1 1111 W1 1111 B Notice warp pool needs to keep TIDs instead of “Mask Vector” Colors of the warps shows potential of different thread-placement C W0 0110 W2 1001 W1 1111 W0 0111 W1 0001 Merge Possibility W2 1000 W3 1110 D W0 0111 W0 1111 W1 1111 W2 1000 A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Performance impacting parameters Memory Divergence Increase of memory pressure with un-coalesced memory accesses Branch Divergence Decrease of SIMD efficiency with inter-warp diverging-branch Workload Parallelism CTA-limiting resources bound memory latency hiding capability Concurrent CTAs share 3 CTA-limiting resources: Shared Memory Register File Thread Pool A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

X X - Y Y - Z Z Machine Models Isolates the impact of each parameter: DC: DWF Control-flow PC: PDOM Control-flow IC: Ideal Control-flow (MIMD) Limited Resources :LR Unlimited Resources :UR IM: Ideal Memory M: Real Memory A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Machine Models continued… LR-DC-M LR-PC-M LR-IC-M LR-DC-IM LR-PC-IM LR-IC-IM UR-DC-M UR-PC-M UR-IC-M UR-DC-IM UR-PC-IM UR-IC-IM Real-Memory Limited per SM resources Ideal-Memory Real-Memory Unlimited per SM resources Ideal-Memory A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Control-Flow Mechanisms Methodology GPGPU-sim v2.1.1b 13 benchmarks from RODINIA benchmark suite and CUDA SDK 2.3 Parameter Value NoC Total Number of SMs 30 Number of Memory Ctrls 6 Number of SM Sharing an Interconnect 3 SM Warp Size 32 Threads Number of Thread per SM 1024 Number of Register per SM 16384 32-bit Number of PEs per SM 32 Shared Memory Size 16KB L1 Data Cache 32KB Parameter Value Clocking Core Clock 325 MHz Interconnect Clock 650 MHz DRAM memory Clock 800MHz Control-Flow Mechanisms Base DWF issue heuristic Majority PDOM warp scheduling round-robin CONFIGURATION A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Performance Potentials The speedup can be reached if the impacting parameter is idealized 3 Potentials (per control-flow mechanism): Memory Potential Speedup due to ideal memory Control Potential Speedup due to free-of-divergence architecture Resource Potential Speedup due to infinite CTA-limiting resources per SM A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Performance Potentials continued… In this work all performances are normalized to LR-DC-M A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Memory Potentials PDOM 59% DWF 61% Two-sided arrow A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Resource Potentials DWF 8.6% PDOM 9.4% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Control Potentials PDOM -7% DWF 2% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Performance Distances How much an otherwise ideal GPU is distanced from ideal due to the parameter. 3 Distances: Memory Distance Distance form ideal GPU due to real memory Resource Distance Distance from ideal GPU due to limited resources Control Distance Distance from ideal GPU due to branch divergence A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Performance Distances continued… A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Memory Distance 40% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Resource Distance 2% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Control Distances PDOM DWF 15% 8% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Sensitivity Analysis Validating the findings under aggressive configurations: Aggressive-Memory 2x L1 caches 2x Number of memory controllers Aggressive-Resource 2x CTA-limiting resources Limited to performance potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Aggressive-memory Memory Potentials PDOM memory potential 28% DWF memory potential 28% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Aggressive-memory continued… Control Potentials PDOM control potential -8% DWF control potential -0.4% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Aggressive-memory continued… Resource Potentials PDOM resource potential 8% DWF resource potential ~0% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Aggressive-resource Memory Potentials PDOM memory potential 51% DWF memory potential 52% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Aggressive-resource continued… Control Potentials PDOM control potential -8% DWF control potential 2% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Aggressive-resource continued… Resource Potentials PDOM resource potential 4% DWF resource potential 3% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Conclusion A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Conclusion Performance in GPUs Findings: Potentials: Improvement by idealizing Memory: 59% and 61% for PDOM and DWF Control: -7% and 2% for PDOM and DWF Resource: 9.4% and 8.6 for PDOM and DWF Distances: Distance from ideal system due to a none-ideal factor Memory: 40% Control: 8% and 15% for PDOM and DWF Resource: 2% Findings: Memory has the biggest impact among the 3 factors Improving control-flow mechanism has to consider memory pressure Same trend under aggressive memory and context-keeping resources A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Thank you. Questions? A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Why 32 PEs per SM GPGPU-sim v2.1.1b coalesces memory accesses over SIMD width slices of a warp separately, similar to pre-Fermi GPUs: Example: Warp Size = 32, PEs per SM = 8 4 independent coalescing domains in a warp We used 32 PEs per SM with ¼ clock rate to model coalescing similar to Fermi GPUs: 0-7 8-15 16-23 24-31 0-31 A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.