Crossing Guard: Mediating Host-Accelerator Coherence Interactions

Slides:

Advertisements

Similar presentations

Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Lecture 7. Multiprocessor and Memory Coherence

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

To Include or Not to Include? Natalie Enright Dana Vantrease.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Hierarchical Cache Coherence Protocol Verification One Level at a Time through Assume Guarantee Xiaofang Chen, Yu Yang, Michael Delisi, Ganesh Gopalakrishnan.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.

The University of Adelaide, School of Computer Science

Verification of Hierarchical Cache Coherence Protocols for Future Processors Student: Xiaofang Chen Advisor: Ganesh Gopalakrishnan.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo.

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

IntroductionSnoopingDirectoryConclusion IntroductionSnoopingDirectoryConclusion Memory 1A 2B 3C 4D 5E Cache 1 1A 2B 3C Cache 2 3C 4D 5E Cache 4 1A 2B.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.

Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Performance of the Shasta distributed shared memory protocol Daniel J. Scales Kourosh Gharachorloo 創造情報学専攻 M グェントアンドゥク.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

© Wenisch 2012 Lecture 11 Slide 1 EECS 570 Designing a Directory Protocol: Nomenclature Local Node (L) r Node initiating the transaction we care about.

ECE/CS 552: Shared Memory © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith.

1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.

A Systematic Methodology to Develop Resilient Cache Coherence Protocols Konstantinos Aisopos (Princeton, MIT) Li-Shiuan Peh (MIT)

The University of Adelaide, School of Computer Science

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Protecting Host Systems from Imperfect Hardware Accelerators Lena E. Olson PhD Final Defense August 17 th, 2016.

Lecture 8: Snooping and Directory Protocols

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Architecture and Design of AlphaServer GS320

Lecture: Large Caches, Virtual Memory

The University of Adelaide, School of Computer Science

A New Coherence Method Using A Multicast Address Network

A Study on Snoop-Based Cache Coherence Protocols

Multiprocessor Cache Coherency

CMSC 611: Advanced Computer Architecture

CS5102 High Performance Computer Systems Distributed Shared Memory

Cache Coherence (controllers snoop on bus transactions)

Lecture 2: Snooping-Based Coherence

Chip-Multiprocessor.

Lecture: Cache Innovations, Virtual Memory

Improving Multiple-CMP Systems with Token Coherence

Yiannis Nikolakopoulos

DDM – A Cache-Only Memory Architecture

Lecture 9: Directory Protocol Implementations

Lecture 9: Directory-Based Examples

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

Cache coherence CEG 4131 Computer Architecture III

Dynamic Verification of Sequential Consistency

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Border Control: Sandboxing Accelerators

Presentation transcript:

Crossing Guard: Mediating Host-Accelerator Coherence Interactions Lena E. Olson*, Mark D. Hill, David A. Wood University of Wisconsin-Madison * Now at Google ASPLOS 2017 April 10th, 2017

Accelerators are here! Complex, programmable accelerators increasingly prevalent Many applications: graphics, scientific computing, video encoding, machine learning, etc… Accelerators may benefit from cache coherent shared memory May be designed by third parties

However… Host coherence protocols may be proprietary and complex Bugs in accelerator implementations might crash host system! Crossing Guard: coherence interface to safely translate accelerator ↔ host protocol Accel CPU Accel $ Host $ Hello! 你好! XG

Outline Goals Design Guarantees Evaluation

Crossing Guard Goals When adding accelerators to host coherence protocol: Allow accelerators customized caches Simple, standardized accelerator coherence interface Guarantee safety for the host system

1. Why Customize Caches? CPU caches have to work with most types of workloads Accelerators may only run some workloads! Optimize caches for likely data access patterns Number of levels, writeback vs. writethrough, MSI vs VI, etc. Accel Accel Accel Accel Accel Accel L1 $ L1 $ L1 $ L1 $ VI L1 $ VI L1 $ L2 $ L2 $

2. Why Simple, Standardized Interface? Host systems speak different protocols… Expensive to redesign for each one! Intel, AMD, ARM, IBM, Oracle… CCIX shows industry cares! Accel L1 $ Hello? 你好? Bonjour? Hej? Host Directory

2. Why Simple, Standardized Interface? L1 controller from gem5’s MOESI_hammer Events States (Transition table in style of Sorin et al.)

3. Why Host Safety? A I A S A I Addr State Owner/Sharers Req Accel CPU CPU Accel Cache (#0) Cache #1 Cache #2 Addr State A I Addr State A S Addr State A I Addr State Owner/Sharers Req A SS 1, 2 - Directory

3. Why Host Safety? ? ? ? ? ? ? A I A S A I Ack Accel Cache (#0) Cache #1 Cache #2 Addr State A I Addr State A S Addr State A I Ack ? ? ? ? ? ? Addr State Owner/Sharers Req A SS 1, 2 - Directory

3. Why Host Safety? A M A I A I Inv Addr State Owner/Sharers Req Accel Cache (#0) Cache #1 Cache #2 Addr State A M Addr State A I Addr State A I Inv Req: dir Addr State Owner/Sharers Req A MT 0 - Addr State Owner/Sharers Req A MT_I 0 - Directory

Outline Goals Design Guarantees Evaluation

Crossing Guard Hardware translating between host and accelerator protocols Set of accelerator ↔ host coherence messages (like an API) Accel CPU Accel $ Host $ Hello! 你好! XG

Crossing Guard Interface Accelerator  Host Requests Host  Accelerator Responses GetS, GetM PutS, PutE, PutM DataS, DataE, DataM Writeback Ack Host  Accelerator Requests Accelerator  Host Responses Invalidate InvAck, Clean Writeback, Dirty Writeback

Crossing Guard Hides implementation details of host protocol No counting acks, sending unblocks, handling races, etc. Moves protocol complexity into Crossing Guard hardware Only implemented once per host system By experts!

Experimental Implementation Coherence controllers / protocols implemented in slicc Simulations using gem5 Code and transition tables available online http://research.cs.wisc.edu/multifacet/xguard/

Outline Goals Design Guarantees Evaluation

1. Customize Caches Designed + implemented two sample systems Private Per-Core L1 at Accelerator Accel L1 Accel L1 Accel L1 CPU L1 CPU L1 XG XG XG Host Directory / L2

1. Customize Caches Designed + implemented two sample systems Private L1s + Shared L2 at Accelerator Accel L1 Accel L1 Accel L1 CPU L1 CPU L1 Accel L2 XG Host Directory / L2

2. Simple, Standardized Interface Single-level Accelerator Cache using Crossing Guard Interface Controller States Transitions AMD Hammer-like Private $$ 24 148 Crossing Guard Single-Level Private L1 5 20

2. Simple, Standardized Interface Implemented Crossing Guard controller for two host protocols AMD Hammer-like Exclusive MOESI Two-Level MESI Inclusive Modularity: Host and Accelerator protocol choice is completely independent

2. Simple, Standardized Interface Accel Cache Cache #1 Cache #2 Addr State A M Addr State A I Addr State A B Addr State A I Addr State A I Addr State A S DataM Ack Ack GetM Inv Req: 0 Addr State Acks Reqs Timer A M 0 - 0 Addr State Acks Reqs Timer A SM -1 - 0 Addr State Acks Reqs Timer A IM 0 - 0 Addr State Acks Reqs Timer A SM -2 - 0 Addr State Acks Reqs Timer A I 0 - 0 Data Acks:-2 Cache #0 UnblockM GetM Addr State Owner/Sharers Req A SM_MB 1, 2 0 Addr State Owner/Sharers Req A M 0 - Addr State Owner/Sharers Req A SS 1, 2 - Directory

2. Simple, Standardized Interface Accel Cache Cache #1 Cache #2 Addr State A M Addr State A I Addr State A IM Addr State A I Addr State A I Addr State A S DataM Ack Ack GetM Addr State Acks Reqs Timer A SM -1 - 0 Addr State Acks Reqs Timer A SM -2 - 0 Addr State Acks Reqs Timer A IM 0 - 0 Addr State Acks Reqs Timer A I 0 - 0 Addr State Acks Reqs Timer A M 0 - 0 Data Acks:-2 Cache #0 GetM UnblockM Addr State Owner/Sharers Req A SM_MB 1, 2 0 Addr State Owner/Sharers Req A M 0 - Addr State Owner/Sharers Req A SS 1, 2 - Directory

3. Host Safety A I A S A I Ack Addr State Acks Reqs Timer A I 0 - 0 Accel Cache Cache #1 Cache #2 Addr State A I Addr State A S Addr State A I Ack Addr State Acks Reqs Timer A I 0 - 0 Cache #0 Addr State Owner/Sharers Req A SS 1, 2 - Directory

3. Host Safety A M A S A I Inv Addr State Acks Reqs Timer Accel Cache Cache #1 Cache #2 Addr State A M Addr State A S Addr State A I Inv Addr State Acks Reqs Timer A MI 0 dir 1210 Addr State Acks Reqs Timer A M 0 - 0 Addr State Acks Reqs Timer A I 0 - 1210 Inv (Req: dir) Cache #0 Data Addr State Owner/Sharers Req A MT_I 0 - Addr State Owner/Sharers Req A WB 0 - Addr State Owner/Sharers Req A MT 0 - Time: 1500 Time: 500 Time: 200 Time: 210 Time: 1000 Directory

Outline Goals Design Guarantees Evaluation

Evaluation Does it provide coherence to correct accelerator? Does it provide safety to host? Does it allow high performance?

I. Correctness Testing Are coherence invariants are maintained when accelerator is acting correctly? How? Random tester Store-Load pairs to random addresses Check integrity of data Ran for 160 billion load/store pairs Local coverage: 100% states, 100% events, > 99% transitions

II. Fuzz Testing Is host safety maintained when accelerator misbehaves? How? Replace accelerator cache with evil controller Generates random coherence messages to random addresses Desired outcome: No deadlocks / crashes Ran for 7 billion load/store pairs Local Coverage: 100% states, 100% events, > 99% transitions

III. Performance Testing gem5-gpu Rodinia workloads MESI Inclusive host protocol Normalized Accelerator Execution Time Benchmark

Crossing Guard Summary Provides simple, standardized interface to ease accelerator development Correctness when accelerator is correct Host safety when accelerator is incorrect Low performance overhead

Questions? ? ? ? ? ? ?

Backup Follows

Two-Level Accelerator Protocol (1) Private L1s + Shared L2 at Accelerator Accel L1 Accel L1 Accel L1 CPU L1 CPU L1 Accel L2 XG Host Directory / L2

Two-Level Accelerator Protocol (2) L1 Controller (M state contains dirty/clean bit)

Two-Level Accelerator Protocol (3) L2 Controller (Coordinates Sharing among Accelerator L1s)

Crossing Guard Invariants Crossing Guard Guarantees to Host: Accelerator requests must be correct Consistent with block stable state at accelerator Consistent with block transient state at accelerator Accelerator responses must be correct Received within a reasonable time ( + Border Control Protections!)

Crossing Guard Variants Full State Crossing Guard Inclusive directory of accelerator state + Places few restrictions on host protocol + Can hide all errors - Requires tag + metadata storage for all blocks Transactional Crossing Guard Stores only data for in-flight transactions + Small storage + Provides most safety properties - Requires some host tolerance

Single-Level Cache

Simulation Parameters

Time Spent Simulating (Random) Configuration Time XG Full + Hammer + 1 Level 5.28 years XG Full + Hamer + 2 Level 2.51 years XG Full + MESI Inc + 1 Level 133 days XG Full + MESI Inc + 2 Level 223 days XG Trans. + Hammer + 1 Level 3.17 years XG Trans. + Hammer + 2 Level 1.38 years XG Trans + Inc + 1 Level 90 days XG Trans + Inc + 2 Level 103 days TOTAL 13.9 years

Full Coverage %s (Random) Full State XG Single-level Two-level Hammer-like 99 99.8 MESI Inclusive 100 99.4 Transactional XG 99.3 99.5 99.7

Time Spent Simulating (Fuzz) Configuration Time XG Full + Hammer-like 1.62 years XG Full + MESI Inclusive 287 days XG Transactional + Hammer-like 5.3 years XG Transactional + MESI Inclusive 41 days Total 7.82 years

Full Coverage %s (Fuzz) Full State Crossing Guard Fuzz Tester Hammer-like 99.3 MESI Inclusive 99.7 Transactional Crossing Guard 100

PutS Accelerator Messages Why? Some host protocols use them Simplify management of Full State Crossing Guard Cannot implement Transactional Crossing Guard + host protocol with PutS without them Bandwidth Impact Carry no data Only between accelerator cache  Crossing Guard, not host system ~1-4% of that bandwidth in experiments. Could be reduced by setting a flag at Crossing Guard.

Why not Model Checking? Model checking is useful! Industrial implementation of Crossing Guard would use. Academic tools have limitations  Benefit from symmetry, but Crossing Guard system asymmetric May only work with one block in system Substantial implementation overhead This work was a proof of concept Random / Fuzz testing not perfect, but results suggestive. Even models can have mistakes!

Performance: Hammer-like

Performance: MESI Inclusive

Performance (Hammer-like)

Template A I A I A I Ack GetM Addr State Acks Reqs Timer A I 0 - 0 Accel Cache Cache #1 Cache #2 Addr State A I Addr State A I Addr State A I Ack GetM Addr State Acks Reqs Timer A I 0 - 0 Cache #0 GetM Addr State Owner/Sharers Req A SS 1, 2 - Directory

Old Slides

3. Why Host Safety? ? ? ? ? ? ? Accelerator cache Addr A: ? Addr A: RW Ack Addr: A Addr A: RW ? ? ? ? ? ? Addr A: Not Present in caches Directory

3. Why Host Safety? Accelerator cache Addr A: M Addr A: RW Addr A: M, owned by accelerator Fwd-GetM Addr: A Directory

Crossing Guard Example Accelerator cache Addr A: M Writeback Addr: A Invalidate Addr: A A: waiting for WB Addr A: RW Addr A: M, owned by accelerator Fwd-GetM Addr: A Directory

Crossing Guard Example Accelerator cache Addr A: M Writeback Addr: A Invalidate Addr: A A: waiting for WB Addr A: RW Addr A: M, owned by accelerator Fwd-GetM Addr: A Directory