Spare Register Aware Prefetching for Graph Algorithms on GPUs

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

High Performing Cache Hierarchies for Server Workloads

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Extracted directly from:

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Sunpyo Hong, Hyesoon Kim

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Reza Yazdani Albert Segura José-María Arnau Antonio González

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Stash: Have Your Scratchpad and Cache it Too

Optimization Code Optimization ©SoftMoore Consulting.

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Accelerating Linked-list Traversal Through Near-Data Processing

Accelerating Linked-list Traversal Through Near-Data Processing

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

RegLess: Just-in-Time Operand Staging for GPUs

Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.

Presented by: Isaac Martin

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Using Packet Information for Efficient Communication in NoCs

Address-Value Delta (AVD) Prediction

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Lecture 20: OOO, Memory Hierarchy

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

15-740/ Computer Architecture Lecture 14: Prefetching

Many-Core Graph Workload Analysis

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

6- General Purpose GPU Programming

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Overview Problem Solution CPU vs Memory performance imbalance

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Spare Register Aware Prefetching for Graph Algorithms on GPUs Nagesh B Lakshminarayana Hyesoon Kim

Graph Algorithms A rich set of applications and domains would benefit from speeding up the processing of graphs Web Search Networks Data Analytics Chip Design Network Security Bioinformatics 9 November 2018

Speeding Up Graph Algorithms with GPUs GPUs have been used to speed up many applications They provide high throughput and memory bandwidth and are power and energy efficient Recently, there have been efforts to speed up graph algorithms using GPUs Parallel graph algorithms have been implemented to take advantage of the capabilities of GPU 9 November 2018

Efficient Graph Algorithms on GPUs - Challenge Data dependent memory accesses Uncoalesced accesses and memory divergence Idle cores due to inability to hide memory access latency completely 9 November 2018

Potential Solution: Prefetching for Graph Algorithms Prefetch data-dependent memory accesses found commonly in graph algorithms 9 November 2018

Effective Prefetching for Graph Algorithms - Challenges Identifying the pattern to be prefetched Prefetching can be less effective due to eviction of prefetched data from cache Large number of threads and small cache space per thread can cause eviction of prefetched data Due to evictions additional accesses to lower levels of memory hierarchy are required, this could degrade performance Prefetching can be more effective by using spare registers 9 November 2018

Spare Registers in GPUs CPUs use all available registers; GPUs may have unused registers RF RF RF CPU GPU 9 November 2018

Spare Registers in GPUs Considerable spare registers are often available Threads are allocated to GPU cores at thread block granularity Multiple resources must be available to allocate a new thread block thread thread block Unused thread block Unused  thread block thread block RF RF RF Threads are allocated at thread block granularity Insufficient registers for additional thread block Other resource bottlenecks prevent further thread allocation 9 November 2018

Spare Registers in GPUs Fraction of unused register space out of a 128KB register file Spare registers can be utilized by threads already assigned 9 November 2018

Solution: Prefetching for Graph Algorithms Prefetch a data-dependent memory access pattern found commonly in graph algorithms Use spare registers to make prefetches more effective Unused thread block thread block thread block RF 9 November 2018

Outline Introduction and Motivation SRAP: Spare Register Aware Prefetching Evaluation Conclusion 9 November 2018

SRAP – Idea Prefetch a data dependent access pattern commonly found in graph algorithms Use spare registers to hold prefetched data Prefetched data is inserted into cache as well Overwrite spare register only after prefetched data has been used; prefetched data is available in registers even if it is evicted from cache 9 November 2018

Background – Graph Representation on GPUs Compact Adjacency List representation is used Graph G(V, E) is represented by two arrays – Va (vertex list) and Ea (edge list) There could be additional arrays for different node and edge attributes Vertex list size is O(#vertices), edge list size is O(#edges) 9 November 2018

Background – Graph Representation 2 2 #vertices = 5 #edges = 5 1 1 1 3 3 3 4 4 1 2 3 4 Vertex list Va Vertex id 1 Index in Ea where edges for vertex start 3 6 7 9 Neighbor 1 3 2 Edge list 1 2 3 4 5 6 7 8 9 Ea 3 4 1 Ea index 3 1 Vertex attribute list 2 3 4 Vertex id 9 November 2018

SRAP – Target Access Pattern Neighbor list iteration Access neighboring vertex and then attribute(s) of neighbor Processing vertex 1 3 6 7 9 Vertex list 1 2 4 1 3 Edge list 1 3 2 4 5 6 7 8 9 3 4 3 4 3 1 Vertex attribute list 2 3 4 3 4 9 November 2018

SRAP – Example with Target Access Pattern Skeleton code from kernel which expands BFS tree Each thread is assigned one vertex If vertex is active, iterate the neighbor list of that vertex For each neighbor examine whether neighbor has already been visited Attribute-load is dependent on neighbor-load id = g_graph_edges[i]; visited = g_graph_visited[id]; //neighbor list begins at “start” and ends at “(end - 1)” for(int i = start; i < end; ++i) { int id = g_graph_edges[i]; bool visited = g_graph_visited[id]; if(!visited) { … } id Load neighbor Load neighbor attribute id 9 November 2018

SRAP – BFS Example … … N-Load Cache block A-Load Cache block for(int i = start; i < end; ++i) { int id = g_graph_edges[i]; bool visited = g_graph_visited[id]; if(!visited) { … } N-Load A-Load N-Load id0 id1 id2 Cache block A-Load … … visited1 visited0 visited2 Cache block Cache block Cache block 9 November 2018

SRAP – Prefetching Target Access Pattern N-Load is independent and strided, and A-Load is dependent on N-Load and is random Prefetching only N-Load would not be very beneficial N-Load has a very small stride and would access the same cache block most of the time Prefetching A-Load would be more beneficial To prefetch A-Load we have to prefetch N-Load for(int i = start; i < end; ++i) { int id = g_graph_edges[i]; bool visited = g_graph_visited[id]; if(!visited) { … } N-Load A-Load 9 November 2018

SRAP – Prefetching for Graph Algorithms Prefetch both N-Load and A-Load Use tag based mechanism in hardware for identifying target loads One set for identifying the access patterns of loads Another set for detecting dependencies between loads Enhance pipeline Inject instructions into pipeline to prefetch into spare registers Manage available spare registers Compiler assisted mechanism and mechanism with dynamic prefetch distance in paper 9 November 2018

SRAP – Inserting Prefetches Example Loop with identified access pattern Two loads into r2 and r3, second load dependent on first load First load has a stride of 4 ld r2, [r1] add r3, r0, r2 ld r4, [r3] loop: … … ld r2, [r1] add r3, r0, r2 ld r4, [r3] @p2 bra loop r2 r3 r2 N-Load A-Load address calculation A-Load r3 9 November 2018

SRAP – Inserting Prefetches Instruction stream with no prefetching and with SRAP No prefetching iterN … ld r2, [r1] add r3, r0, r2 ld r4, [r3] … @p2 bra loop iterN+1 … iterN … ld r2, [r1] ld spare1, [r1+4] add r3, r0, r2 ld r4, [r3] add spare2, r0, spare1 ld spare2, [spare2] … @p2 bra loop iterN+1 … N-Load A-Load N-Prefetch A-Prefetch iterN … ld r2, [r1] ld spare1, [r1+4] add r3, r0, r2 ld r4, [r3] add spare2, r0, spare1 ld spare2, [spare2] … @p2 bra loop iterN+1 … mov r2, spare1 N-Load A-Load N-Prefetch A-Prefetch Use N-Prefetch iterN … ld r2, [r1] ld spare1, [r1+4] add r3, r0, r2 ld r4, [r3] add spare2, r0, spare1 ld spare2, [spare2] … @p2 bra loop N-Load A-Load N-Prefetch A-Prefetch iterN … ld r2, [r1] ld spare1, [r1+4] add r3, r0, r2 ld r4, [r3] add spare2, r0, spare1 ld spare2, [spare2] … @p2 bra loop iterN+1 … mov r2, spare1 N-Load A-Load N-Prefetch A-Prefetch Use N-Prefetch iterN … ld r2, [r1] ld spare1, [r1+4] add r3, r0, r2 ld r4, [r3] … @p2 bra loop N-Load A-Load N-Prefetch iterN … ld r2, [r1] add r3, r0, r2 ld r4, [r3] … @p2 bra loop N-Load A-Load iterN … ld r2, [r1] ld spare1, [r1+4] add r3, r0, r2 ld r4, [r3] add spare2, r0, spare1 ld spare2, [spare2] … @p2 bra loop iterN+1 … mov r2, spare1 mov r4, spare2 N-Load A-Load N-Prefetch A-Prefetch Use N-Prefetch Use A-Prefetch iterN … ld r2, [r1] ld spare1, [r1+4] add r3, r0, r2 ld r4, [r3] add spare2, r0, spare1 ld spare2, [spare2] … @p2 bra loop iterN+1 … mov r2, spare1 mov r4, spare2 N-Load A-Load N-Prefetch A-Prefetch Use N-Prefetch Use A-Prefetch SRAP iterN … ld r2, [r1] ld spare1, [r1+4] add r3, r0, r2 ld r4, [r3] add spare2, r0, spare1 ld spare2, [spare2] … @p2 bra loop iterN+1 … mov r2, spare1 mov r4, spare2 N-Load A-Load N-Prefetch A-Prefetch Use N-Prefetch Use A-Prefetch N-Load A-Load N-Load A-Load 9 November 2018

Outline Introduction and Motivation SRAP: Spare Register Aware Prefetching Evaluation Conclusion 9 November 2018

Evaluation Methodology MacSim Simulator Trace-driven, cycle-level simulator Simulate GTX580 (Fermi) GPU Evaluate kernels from 9 graph algorithms implementations IIIT-Hyd, LonestarGPU, some implemented based on previous work Comparison against Stride, Stream and GHB prefetchers for 3 different inputs Since algorithms are iterative kernels are invoked multiple times for each input Results with prefetch distance of 1 shown, other results (compiler- assisted SRAP, SRAP with dynamic prefetch distance) in paper 9 November 2018

IPC Improvements Since algorithms are iterative, each kernel is invoked multiple times for a given input SRAP for all kernels and all inputs – 8%avg and 45%max Good benefit with SRAP No benefit with prior prefetchers A-Load cannot be prefetched since it can be updated by other threads 9 November 2018

Benefit From Inserting Into Spare Registers Caching – Similar to SRAP but attribute-loads are inserted into cache only (Results for only H-BFS with one input shown, other results are in the paper) For H-BFS with r4_2e20 input – SRAP – 9%avg and 17%max, Caching – 5%avg and 17%max For all kernels and inputs and invocations – SRAP – 8%avg and Caching – 6%avg Additional DRAM requests 9 November 2018

Conclusion For graph algorithms, prefetching can reduce idle cycles due to memory accesses, but Graph algorithms have data dependent accesses Effectiveness of prefetching is reduced due to eviction of prefetched data from cache Soln: SRAP – prefetches a data dependent access pattern; uses spare registers to hold prefetched data and eliminates additional memory accesses due to eviction of prefetch data The best performing variant of SRAP shows a performance improvement of 10% on average and up to 51% 9 November 2018

Thank you! 9 November 2018