Microarchitectural Performance Characterization of Irregular GPU Kernels Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Slides:

Advertisements

Similar presentations

Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

Latency considerations of depth-first GPU ray tracing

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

Power Characteristics of Irregular GPGPU Programs Jared Coplin and Martin Burtscher Department of Computer Science 1.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.

Understanding Outstanding Memory Request Handling Resources in GPGPUs

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA Dynamic Parallelism

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Sunpyo Hong, Hyesoon Kim

Performance in GPU Architectures: Potentials and Distances

(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

ISPASS th April Santa Rosa, California

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

RegLess: Just-in-Time Operand Staging for GPUs

Presented by: Isaac Martin

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Exploring Non-Uniform Processing In-Memory Architectures

NVIDIA Fermi Architecture

Many-Core Graph Workload Analysis

6- General Purpose GPU Programming

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Presentation transcript:

Microarchitectural Performance Characterization of Irregular GPU Kernels Molly A. O’Neil and Martin Burtscher Department of Computer Science

Introduction  GPUs as general-purpose accelerators  Ubiquitous in high performance computing  Spreading in PCs and mobile devices  Performance and energy efficiency benefits…  …when code is well-suited!  Regular (input independent) vs. irregular (input determines control flow and memory accesses)  Lots of important irregular algorithms  More difficult to parallelize, map less intuitively to GPUs Microarchitectural Performance Characterization of Irregular GPU Kernels 2

Outline  Impact on GPU performance characteristics of…  Branch divergence  Memory coalescing  Cache and memory latency  Cache and memory bandwidth  Cache size  First, review GPU coding best practices for good performance Microarchitectural Performance Characterization of Irregular GPU Kernels 3

Best Practice #1: No Divergence Microarchitectural Performance Characterization of Irregular GPU Kernels 4  To execute in parallel, threads in a warp must share identical control flow  If not, execution serialized into smaller groups of threads that do share control flow path  branch divergence

Best Practice #2: Coalescing Microarchitectural Performance Characterization of Irregular GPU Kernels 5  Memory accesses within a warp must be coalesced  Within a warp, memory references must fall within the same cache line  If not, accesses to additional lines are serialized

Best Practice #3: Load Balance Microarchitectural Performance Characterization of Irregular GPU Kernels 6  Balance work between warps, threads, and thread blocks  All 3 difficult for irregular codes  Data-dependent behavior makes it difficult to assign works to threads to achieve coalescing, identical control flow, load balance  Very different from CPU code considerations

Simulation Study  Want to better understand irregular apps’ specific demands on GPU hardware  To help software developers optimize irregular codes  As a baseline for exploring hardware support for broader classes of codes  GPGPU-Sim v a few extra perf. counters  GTX 480 (Fermi) configuration  Added configuration variants to scale latency, bandwidth, cache size, etc. Microarchitectural Performance Characterization of Irregular GPU Kernels 7

Applications from LonestarGPU Suite  Breadth-First Search (BFS)  Label each node in graph with min level from start node  Barnes-Hut (BH)  N-body algorithm using octree to decompose space around bodies  Mesh Refinement (DMR)  Iteratively transform ‘bad’ triangles by retriangulating surrounding cavity  Minimum Spanning Tree (MST)  Contract minimum edge until single node  Single-Source Shortest Paths (SSSP)  Find shortest path to each node from source 8 Microarchitectural Performance Characterization of Irregular GPU Kernels

Semi-Regular  FP Compression (FPC)  Lossless data compression for DP floating-point values  Irregular control flow  Traveling Salesman (TSP)  Find minimal tour in graph using iterative hill climbing  Irregular memory accesses Regular  N-Body (NB)  N-body algorithm using all-to-all force calculation  Monte Carlo (MC)  Evaluates fair call price for set of options  CUDA SDK version Microarchitectural Performance Characterization of Irregular GPU Kernels 9 Applications from Other Sources  Inputs result in working set ≥5 times default L2 size

Application Performance  Peak = 480 IPC  As expected, regular mostly means better performing  BH is the exception: primary kernel regularized  Clear tendency for lower IPCs for irregular codes  But no simple rule to delineate regular vs. irregular Microarchitectural Performance Characterization of Irregular GPU Kernels 10

Branch Divergence Microarchitectural Performance Characterization of Irregular GPU Kernels 11  Active instructions at warp issue  32 = no divergence  Only one code <50% occupied  Theoretical speedup  Assumes each issue had 32 active insts.

Memory Coalescing Microarchitectural Performance Characterization of Irregular GPU Kernels 12  Avg # of memory accesses by each global/local ld/st  >1 = uncoalesced  Percentage of stalls due to uncoalesced accesses  Provides an upper bound on speedup

Memory Coalescing Microarchitectural Performance Characterization of Irregular GPU Kernels 13  New configuration to artificially remove pipeline stall penalty from non-coalesced accesses  With no further improvements to memory pipeline, with increased-capacity miss queues and MSHRs  Not intended to model realistic improvement

L2 and DRAM Latency Microarchitectural Performance Characterization of Irregular GPU Kernels 14  Scaled L2 hit and DRAM access latencies  Doubled, halved, zeroed  Most benchmarks more sensitive to L2 latency  Even with input sizes several times the L2 capacity

Interconnect and DRAM Bandwidth Microarchitectural Performance Characterization of Irregular GPU Kernels 15  Halved/doubled interconnect (L2) bandwidth and DRAM bus width  Benchmark sensitivities similar to latency results  L2 large enough to keep sufficient warps ready

Cache Behavior Microarchitectural Performance Characterization of Irregular GPU Kernels 16  Very high miss ratios (generally >50% in L1)  Irregular codes have much greater MPKI  BFS & SSSP: lots of pointer-chasing, little spatial locality

Cache Size Scaling Microarchitectural Performance Characterization of Irregular GPU Kernels 17  Halved, doubled both (data) cache sizes  Codes sensitive to interconnect bandwidth are also sensitive to L1D size  BH tree prefixes: L2 better at exploiting locality in traversals  Most codes hurt more by smaller L2 than L1D

Individual Application Analysis Microarchitectural Performance Characterization of Irregular GPU Kernels 18 Large memory access penalty in irregular apps Divergence penalty less than we expected Synchronization penalty also below expectation Regular codes have mostly fully- occupied cycles Computation pipeline hazards (rather than LS)

Conclusions  Irregular codes  More load imbalance, branch divergence, and uncoalesced memory accesses than regular codes  Less branch divergence, synchronization, and atomics penalty than we expected  Software designers successfully addressing these issues  To support irregular codes, architects should focus on improving memory-related slowdowns  Improving L2 latency/bandwidth more important than improving DRAM latency/bandwidth Microarchitectural Performance Characterization of Irregular GPU Kernels 19

Questions? Acknowledgments  NSF Graduate Research Fellowship grant  NSF grants , , and  Grants and gifts from NVIDIA Corporation Microarchitectural Performance Characterization of Irregular GPU Kernels 20

Related Work  Simulator-based characterization studies  Bakhoda et al. (ISPASS’09), Goswami et al. (IISWC’10), Blem et al. (EAMA’11), Che et al. (IISWC’10), Lee and Wu (ISPASS’14)  CUDA SDK, Rodinia, Parboil (no focus on irregularity)  Meng et al. (ISCA’10) – dynamic warp hardware modification  PTX emulator studies (also SDK, Rodinia, Parboil)  Kerr et al. (IISWC’09) – GPU Ocelot, Wu et al. (CACHES’11)  Hardware performance counters  Burtscher et al. (IISWC’12) – LonestarGPU, Che et al. (IISWC’13) Microarchitectural Performance Characterization of Irregular GPU Kernels 22

Input Sizes Microarchitectural Performance Characterization of Irregular GPU Kernels 23 CodeInput BFSNYC road network (~264K nodes, ~734K edges) (working set = 3898 kB = 5.08x L2 size) RMAT graph (250K nodes, 500K edges) BH494K bodies, 1 time step (working set = 7718 kB = 10.05x L2 size) DMR50.4K nodes, ~100.3K triangles, maxfactor = 10 (working set w/ maxfactor 10 = 7840 kB = 10.2x L2 size) 30K nodes, 60K triangles MSTNYC road network (~264K nodes, ~734K edges) (working set = 3898 kB = 5.08x L2 size) RMAT graph (250K nodes, 500K edges) SSSPNYC road network (~264K nodes, ~734K edges) (working set = 3898 kB = 5.08x L2 size) RMAT graph (250K nodes, 500K edges) FPCobs_error dataset (60 MB), 30 blocks, 24 warps/block num_plasma dataset (34 MB), 30 blocks, 24 warps/block TSPatt48 (48 cities, 15K climbers) eil51 (51 cities, 15K climbers) NB23,040 bodies, 1 time step MC256 options

Secondary Inputs Microarchitectural Performance Characterization of Irregular GPU Kernels 24

GPGPU-Sim Configurations Microarchitectural Performance Characterization of Irregular GPU Kernels 25 LatencyBus width L1D L2 ROPDRAMIctDRAMCPSz (PS)Sz (PL)MQMSMMSizeMQMSMM Default Y /2x ROP120200"""""""""""" 2x ROP480200"""""""""""" 1/2x DRAM240100"""""""""""" 2x DRAM240400"""""""""""" No Latency00"""""""""""" 1/2x L1D Cache Y x L1D Cache"""""3296""""""" 1/2x L2 Cache"""""1648"""384""" 2x L2 Cache""""""""""1536""" 1/2x DRAM Bandwidth240200"2Y x DRAM Bandwidth"""8"""""""""" 1/2x Ict + DRAM B/W""162" " """""""" 2x Ict + DRAM B/W""648" " """""""" No Coalesce Penalty N NCP + Impr L1 Miss""""N""166416"4324 NCP +Impr L1+L2 Miss""""N""166416"8648 Latencies represent number of shader core cycles. Cache sizes in kB. ROP=Raster Operations Pipeline (models L2 hit latency). Ict = Interconnect (flit size). CP=Coalesce penalty, PS = Prefer Shared Mem, PL = Prefer L1, MQ=Miss queue entries, MS=Miss status holding register entries, MM=Max MSHR merges

Issue Bin Priority Microarchitectural Performance Characterization of Irregular GPU Kernels 26