5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.
L.N. Bhuyan Adapted from Patterson’s slides
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Physical Design of Snoop-Based Cache Coherence in Multiprocessors
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.
1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.
Lecture 13: Consistency Models
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
Logical Protocol to Physical Design CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Logical Protocol to Physical Design
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
CS 258 Parallel Computer Architecture Lecture 17 Snoopy Caches II March 20, 2002 Prof John D. Kubiatowicz
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
 Copyright, Lawrence Snyder, Snooping and Distributed Multiprocessor Design We consider more details about how a bus- based SMP works, and then.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )
1 Lecture 7: Implementing Cache Coherence Topics: implementation details.
The University of Adelaide, School of Computer Science
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Lecture 8: Snooping and Directory Protocols
Lecture 20: Consistency Models, TM
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
Example Cache Coherence Problem
Computer Science Division
Lecture 2: Snooping-Based Coherence
Computer Science Division
Lecture 5: Snooping Protocol Design Issues
Lecture 21: Synchronization and Consistency
Lecture 22: Consistency Models, TM
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Lecture 25: Multiprocessors
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 21: Synchronization & Consistency
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 10: Directory-Based Examples II
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues  semantic model: coherence and memory consistency  dead-lock, live-lock, and starvation Design issues simplistic-to-realistic one-by-one:  Single-level cache and an atomic bus  Multi-level cache design issues  Split-transaction bus design issues Scalable snoop-based design techniques More Architectural Support for MIMD

5/8/2015 slide 2 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Key goals Correctness Design simplicity (verification is costly) High performance Design simplicity and performance are often at odds Get picture of bus-based coherence organization, dual tags, proc-side and bus-side controllers

5/8/2015 slide 3 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Correctness Requirements Semantic model: contract between HW/SW  cache coherence -> write serialization  sequential consistency -> prog. order, write atomicity Deadlock: no forward progress and no system activity  resources being held in a cyclic relationship Livelock: no forward progress but system activity  allocation/de-allocation of resources with no progress Starvation: some processes are denied service  often temporary

5/8/2015 slide 4 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Single-Level Cache and Atomic Bus Single-level caches and an atomic bus Tag and cache controller design issues  Snoop protocol design  Race conditions: non-atomic state transitions Correctness issues  serialization  deadlock, livelock, and starvation Atomic (synchronization) operations

5/8/2015 slide 5 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Cache Controller Design Extension for snoop support: bus requests also access cache  processor-side controller  bus-side controller Recall actions on a cache access: 1. Indexing cache with tag check 2. Get/request data 3. Update state bits Cached data Tags Processor requests bus requests Performance issue: Simultaneous tag accesses from processor and bus Solution: Duplicate tags but keep them consistent Tags

5/8/2015 slide 6 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Reporting Snoop Results Where to read (memory or cache) and what state transition to make?  support wired-and/or bus lines When is the snoop result available? (main alternatives)  synchronous: requires dual tags and must adapt to worst-case because of updates of state bits caused by processor  asynchronous (variable delay snoop): assume minimum delay but add enough cycles if necessary  memory state bit to distinguish between valid/invalid memory block

5/8/2015 slide 7 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Dealing with Write-backs One would like to service miss before writing back the replaced block Two implications:  Add a write-back buffer  Bus snoops must also look into write-back buffer

5/8/2015 slide 8 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Baseline Architecture Write-back buffer

5/8/2015 slide 9 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 State Transitions Must Appear Atomic Upgr Cache 1 Cache 2 Upgr 1. Await use of bus 2. Cache 2 gets access to bus 3. Upgrade from Cache 2 updates state of Cache 1 to invalid 4. Upgrade from cache 1 is performed. However, Upgrade is not appropriate Assume a block is in shared state in both caches

5/8/2015 slide 10 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Non-Atomic State Transitions Time window between issuing and performing of a bus operation  Problem: another transaction may change action  Solution: extend with non-atomic state

5/8/2015 slide 11 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Correctness Issues Write serialization: ownership acquisition and cache block modification should appear atomic  processor may not write data into cache until read- exclusive request is on bus; it is committed Deadlock: Two cache controllers may be in a circular dependence relation if one is locking the cache while waiting for the bus (fetch deadlock) Livelock: If several controllers issue read-exclusive requests for same block at the same time  Let each one complete before taking care of next Starvation: Bus arbitration is unfair to some nodes

5/8/2015 slide 12 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 A Fetch-Deadlock Situation ReadX B Cache 1Cache 2 BusRd A 1. Await use of bus, but Cache 1 is locked 2. Cache 2 gets access to bus 3. Cache 2 waits for Cache 1 to respond and Cache 1 waits for Cache 2 to release the bus Deadlock! AB

5/8/2015 slide 13 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 A Livelock Situation ReadX A Cache 1 Cache 2 ReadX A 1. Try to get bus 3. Make Cache 2’s copy invalid Etc……Livelock! A read exclusive operation involves: 1.Acquisition of an exclusive block 2.Reattempting the write in the local cache 2. Make cache 1’s copy invalid

5/8/2015 slide 14 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Remedies to Correctness Issues Do not update cache until Upgrade is on bus Service incoming snoops while waiting for bus Complete the transaction with no interruption Upgr Cache 1 Cache 2 Upgr

5/8/2015 slide 15 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Implementation of Atomic Memory Operations Test&set should result in atomic read-modify-write Cacheable t&s vs memory-based implementation  lower latency & bw for spinning and self-acquisition  longer time to transfer lock to other node  memory-based requires bus to be locked down Load-linked (LL) and store-conditional (SC) implementation  Lock flag and lock address register at each processor  LL reads block, sets lock flag, puts block address in reg  Incoming invalidates checked against address: if match, reset flag  SC checks lock flag as indicator of intervening conflicting write: if reset, fail; if not, succeed

5/8/2015 slide 16 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Multi-Level Cache Designs Coherence needs to be extended across L1 and L2 L1 on-chip. Snoop support in L1 expensive Is snoop support needed in L1? P L1 L2 M Definition: L1 included in L2 iff all blocks in L1 also in L2 If inclusion maintained then snoop support only needed at L2 (must be able to invalidate blocks in L1) Consequence: a block in owned state in L1 (M in MSI) must be marked modified in L2

5/8/2015 slide 17 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Maintaining Inclusion Violations to the inclusion property:  Set-associative L1 with history-based replacement algorithm  Split I- and D-caches at L1 and unified at L2  Different cache block sizes in L1 and L2 Techniques to maintain inclusion: Direct-mapped L1 and L2 with any associativity given some additional constraints for block size, fetch policy, … Note: One can always displace a block in L1 on replacement in L2 to maintain inclusion

5/8/2015 slide 18 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Split Transaction Buses Challenging issues: Avoid conflicting requests in progress simultaneously Buffers needed => flow control Correctness issues (coherence, SC, deadlock, livelock,...) Separate request-response phases improve bus utilization Mem Access Delay Address/CMD Mem Access Delay Data Address/CMD Data Address/CMD Bus arbitration

5/8/2015 slide 19 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Example of Conflict Situation With atomic bus, Upgrade is committed when bus is granted Here, two Upgrades can be on bus and may invalidate both copies Upgr Cache 1 Cache 2 Upgr

Some real examples Details can be interesting Supports historical emphasis of the course SGI Power Challenge 5/8/2015 slide 20 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011

5/8/2015 slide 21 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 SGI Challenge 1(4) High-level design decisions Avoid conflicts: Allow a fixed number of requests to different blocks in progress at a time Flow-control: Limited buffers, so NACK when full and retry Ordering: Allow out-of-order responses (to cope with non- uniform delays)

5/8/2015 slide 22 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 SGI Challenge 2(4) Separate request-response buses Request phase: (use address request bus)  present the address and initiate snooping  report snoop result (prolong or nack if necessary) Response phase: (use data request bus)  send data back

5/8/2015 slide 23 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Design of SGI Challenge 3(4) Max 8 outstand. requests 3-bit tag to separate req. Request table in each node to keep track of outstanding requests Writes are committed when request is granted Flow control: NACK and retry when buffers are full Conflict resolution Before address request is done, request table is checked Memory and caches check request independently

5/8/2015 slide 24 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Serialization and SC 4(4) Serialization to a single location guaranteed  1. Only a single request to each block allowed  2. Request committed when request on bus Problems to guarantee SC:  requires serialization across writes to different locations  requests can be reordered in buffers so being committed is not same as performed A solution:  Servicing incoming requests before processor’s own requests guarantees write atomicity

5/8/2015 slide 25 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Multiple Outstanding Processor Requests Modern processors allow multiple outstanding memory operations Problem: may violate sequential consistency Solution:  Buffer all outstanding requests  Don’t make writes visible to any until committed  Don’t perform reads before previously issued requests are committed Lockup-free caches implement the buffering capability to enforce ordering of uncommitted memory operations

5/8/2015 slide 26 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Commercial Machines SGI Challenge: 36 MIPS R8000 processors with a 1.2 GB/s bus Peak: 5.4 GFLOPS Sun Enterprise 6000: 30 UltraSparc processors with 2.67 GB/s bus Peak: 9 GFLOPS Look these up on the net