Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013.

Slides:



Advertisements
Similar presentations
Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
Advertisements

Energy Efficient GPU Transactional Memory via Space-Time Optimizations Wilson W. L. Fung Tor M. Aamodt.
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1, Mike O’Connor 3, Tor Aamodt 1 Image source:
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
Continuously Recording Program Execution for Deterministic Replay Debugging.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
October 2003 What Does the Future Hold for Parallel Languages A Computer Architect’s Perspective Josep Torrellas University of Illinois
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
How should a highly multithreaded architecture, like a GPU, pick which threads to issue? Cache-Conscious Wavefront Scheduling Use feedback from the memory.
Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
Divergence-Aware Warp Scheduling
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
Cache-Conscious Wavefront Scheduling
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
My Coordinates Office EM G.27 contact time:
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Part 3: Research Directions 2 3 Advancing Computer Systems without Technology Progress DARPA/ISAT Workshop, March 26-27, 2012 Mark Hill & Christos Kozyrakis.
COMP 740: Computer Architecture and Implementation
15-740/ Computer Architecture Lecture 3: Performance
Speculative Lock Elision
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Transactional Memory : Hardware Proposals Overview
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 5: GPU Compute Architecture
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
The University of Adelaide, School of Computer Science
Lecture 26: Multiprocessors
The University of Adelaide, School of Computer Science
Hardware Transactional Memory for GPU Architectures*
Mattan Erez The University of Texas at Austin
Lecture 5: GPU Compute Architecture for the last time
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Lecture 27: Multiprocessors
Operation of the Basic SM Pipeline
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 5: Synchronization and ILP
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013

7/3/ Advancing Computer Systems without Technology Progress DARPA/ISAT Workshop, March 26-27, 2012 Mark Hill & Christos Kozyrakis Decreasing cost per unit computation 1971: Intel : Datacenter 1981: IBM : iPhone

3 Ease of Programming Hardware Efficiency Single Core OoO Superscalar CPU Brawny (OoO) Multicore ASIC Better 16K thread, SIMT Accelerator Wimpy (In-order) Multicore (how to get here?)

4 Ease of Programming Hardware Efficiency Heterogeneity helps…

Review: Amdahl’s Law 7/3/ Easy to accelerateHard to accelerate

What defines division between hard and easy? Fraction hard = f(problem, prog. model, SW budget) 7/3/2016 6

Goal: 7/3/ easy to accelerate (Acc. Arch1) easy to accelerate (Acc. Arch2)

8 Ease of Programming Hardware Efficiency Better ?

Increase Accelerator Efficiency (x-axis) Control Flow Memory Improve Accelerator Programmability (y-axis) Easier coding Fewer bugs Easier debugging 7/3/2016 9

10 SIMT Execution (MIMD on SIMD) (Levinthal SIGGRAPH’84) A: n = foo[tid.x]; B: if (n > 10) C: …; else D: …; E: … Time Branch Divergence A 1234 B 1234 C D 12 foo[] = {4,8,12,16}; B 1111 PCActive Mask 1100 D 0011 C E

11 Dynamic Warp Formation (Fung: MICRO 2007, HPCA 2011) Time A C D E B Warp A C D E B Warp A C D E B Warp 2 Reissue/Memory Latency SIMD Efficiency 58  88% C C Pack 22% average [HPCA’11]

Memory 7/3/

Scheduler affects access pattern Cache Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler Round Robin Scheduler Cache Greedy then Oldest Scheduler ld A,B,C,D… DCBADCBA ld Z,Y,X,W ld A,B,C,D WXYZWXYZ... ld Z,Y,X,W DCBADCBA DCBADCBA ld A,B,C,D…... Warp 0 Warp 1 Warp 0 Warp 1 ld A,B,C,D…

Use scheduler to shape access pattern Cache Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler Greedy then Oldest Scheduler Cache Cache-Conscious Wavefront Scheduling (Rogers: MICRO 2012, Top Picks 2013) ld A,B,C,D DCBADCBA WXYZWXYZ... ld Z,Y,X,W Warp 0 Warp 1 63% perf. improvement working set size per warp

Easier coding 7/3/

Accelerator Coherence Challenges Challenges of introducing coherence messages on a GPU 1. Traffic: transferring messages 2. Storage: tracking message 3. Complexity: managing races between messages GPU cache coherence without coherence messages? YES – using global time

Core 1 L1D ▪▪ ▪ Temporal Coherence (Singh: HPCA 2013) Global time Interconnect ▪▪▪ L2 Bank A= Core 2 L1D Local Timestamp > Global Time  VALID Local Timestamp > Global Time  VALID Global Timestamp < Global Time  NO L1 COPIES Global Timestamp < Global Time  NO L1 COPIES Related: Library Cache Coherence

T=0T=11 T=15 Core 1 L1D Interconnect L2 Bank Core 2 L1D Temporal Coherence Example ▪▪ ▪ A=0 0 0 Load A T=10 A=0 10 A=0 10 A=0 10 Store A=1 A=1 A=0 10 No coherence messages

Complexity Non-Coherent L1 Non-Coherent L2 MESI L1 States MESI L2 States TC-Weak L1 TC-Weak L2

Interconnect Traffic Reduces traffic by 53% over MESI and 23% over GPU-VI Lower traffic than 16x- sized 32-way directory Interconnect Traffic NO-COH MESI GPU-VI TC-Weak Do not require coherence

Performance TC-Weak with simple predictor performs 85% better than disabling L1 caches MESI GPU-VI TC-Weak Require coherence NO-L1 Speedup

Fewer bugs 7/3/

23 Lifetime of Accelerator Application Development Time Functionality Performance ? Time Fine-Grained Locking Time Transactional Memory

24 Are TM and GPUs Incompatible? GPU uarch very different from multicore CPU. KILO TM [Fung MICRO’11, Top Picks’12] Hardware TM for GPUs Half performance of fine grained locking Chip area overhead of 0.5%

25 T0T1T2T3 25 Hardware TM for GPUs Challenge #1: SIMD Hardware On GPUs, scalar threads in a warp/wavefront execute in lockstep... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit... Committed A Warp with 4 Scalar Threads Aborted Branch Divergence! T0T1T2T3T0T1T2T3

... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit KILO TM – Solution to Challenge #1: SIMD Hardware Transaction Abort  Like a Loop  Extend SIMT Stack Abort

27 Register File CPU Core 10s of Registers Hardware TM for GPUs Challenge #2: Transaction Rollback Checkpoint Register TX TX Abort Register File GPU Core (SM) 32k Registers Warp 2MB Total On-Chip Storage Checkpoint?

28 KILO TM – Solution to Challenge #2: Transaction Rollback SW Register Checkpoint  Most TX: Reg overwritten first appearance (idempotent)  TX in Barnes Hut: Checkpoint 2 registers TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit Abort Overwritten

29 Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol Not Available on (current) GPUs No Private Data Cache per Thread Signatures? 1024-bit / Thread 3.8MB / 30k Threads

30 TX2 atomic {A=B+2} Private Memory Read-Log Write-Log KILO TM: Value-Based Conflict Detection TX1 atomic {B=A+1} Private Memory Read-Log Write-Log Global Memory A=1 B=2 TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit A=1 B=0 A=1 B=0 A=2 B=2 Self-Validation + Abort:  Only detects existence of conflict (not identity) TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit

Easier debugging 31

0 __global__ void BFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 level[id] = level[tid] + 1; 7 active[id] = true; 8 … 9 } } } } V0V0 V2V2 V1V1 level = 1 active = 1 level = 2 active = 1 BFS algorithm Published in HiPC 2007 BFS algorithm Published in HiPC 2007

Reaching Quantum Boundary Global Memory Read Only Store Buffers Local Memory Wavefront s … Load Op Commit Atomic Op 1.Instruction Count 2.Atomic Operations 3.Memory Fences 4.Workgroup Barriers 5.Execution Complete 2x Slowdown GPUDet (Jooybar: ASPLOS 2013)

Summary Start from efficient architecture and try to improve programmability 7/3/  Get efficiency and keep programmers happy reasonably

35 Thanks! Questions?