Cache Coherence in Scalable Machines (IV). 99-6-72 Dealing with Correctness Issues Serialization of operations Deadlock Livelock Starvation.

Slides:

Advertisements

Similar presentations

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Advertisements

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Scalable CC-NUMA Design Study - SGI Origin 2000 CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.

1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.

1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

Lecture 13: Consistency Models

Computer Architecture II 1 Computer architecture II Lecture 9.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Memory Consistency Models

NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.

Multiprocessor Cache Coherency

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

A Novel Directory-Based Non-Busy, Non- Blocking Cache Coherence Huang Yomgqin, Yuan Aidong, Li Jun, Hu Xiangdong 2009 International forum on computer Science-Technology.

Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.

Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

 Copyright, Lawrence Snyder, Snooping and Distributed Multiprocessor Design We consider more details about how a bus- based SMP works, and then.

Memory Consistency Zhonghai Lu Outline Introduction What is a memory consistency model? Who should care? Memory consistency models Strict.

Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.

Cache Coherence CS433 Spring 2001 Laxmikant Kale.

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Lecture 8: Snooping and Directory Protocols

Lecture 20: Consistency Models, TM

Memory Consistency Models

The University of Adelaide, School of Computer Science

Lecture 11: Consistency Models

Memory Consistency Models

Lecture 18: Coherence and Synchronization

Multiprocessor Cache Coherency

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Lecture 9: Directory-Based Examples II

CS5102 High Performance Computer Systems Distributed Shared Memory

Lecture 2: Snooping-Based Coherence

Shared Memory Consistency Models: A Tutorial

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

Lecture 9: Directory Protocol Implementations

Lecture 25: Multiprocessors

Lecture 9: Directory-Based Examples

Lecture 10: Consistency Models

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Lecture 8: Directory-Based Examples

The University of Adelaide, School of Computer Science

Lecture 24: Multiprocessors

Lecture 23: Transactional Memory

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 10: Directory-Based Examples II

Lecture 11: Consistency Models

Presentation transcript:

Cache Coherence in Scalable Machines (IV)

Dealing with Correctness Issues Serialization of operations Deadlock Livelock Starvation

Serialization of Operations Need a serializing agent home memory is a good candidate, since all misses go there first Possible Mechanism: FIFO buffering requests at the home until previous requests forwarded from home have returned replies to it but input buffer problem becomes acute at the home Possible Solutions: let input buffer overflow into main memory (MIT Alewife)

Serialization of Operations don’t buffer at home, but forward to the owner node when directory is in a busy state (Stanford DASH) serialization determined by home when clean, by owner when exclusive if cannot be satisfied at “owner”, e.g. written back or ownership given up, NACKed back to requestor without being serialized serialized when retried don’t buffer at home, use busy state to NACK (Origin) serialization order is that in which requests are accepted (not NACKed) maintain the FIFO buffer in a distributed way (SCI)

Serialization to a Location (cont’d) Having single entity determine order is not enough Example P1 P2 rd A (i)wr A barrier barrier rd A (ii) Second read of A should return the value written by P2 Rd A (i) -> wr A -> rd A(ii) Wr A -> rd A(i) -> rd A(ii)

Serialization to a Location (cont’d) Having single entity determine order is not enough it may not know when all xactions for that operation are done everywhere Home P1 P2 1. P1 issues read request to home node forA 2. P2 issues read-exclusive request to home corresponding to write ofA. But won’t process it until it is done with read 3. Home receives 1, and in response sends reply to P1 (and sets directory presence bit). Home now thinks read is complete. Unfortunately, the reply does not get to P1 right away. 4. In response to 2, home sends invalidate to P1; it reaches P1 before transaction 3 (no point-to-point order among requests and replies). 5. P1 receives and applies invalidate, sends ack to home. 6. Home sends data reply to P2 corresponding to request 2. Finally, transaction 3 (read reply) reaches P

Possible solutions Solution 1 To have read replies themselves be acknowledged Let the home go on to process the next request only after it receives this ack. Solution 2 The requestor does not allow access by another request, such as invalidation, to be applied to that block until its outstanding request completes. SGI Orgin Solution 3 Apply invalidation even before the read reply is received and consider the reply invalid and retry the read DASH The order may be different between two solutions

Deadlock Two networks not enough when protocol not request-reply Additional networks expensive and underutilized Use two, but detect potential deadlock and circumvent e.g. when input request and output request buffers fill more than a threshold, and request at head of input queue is one that generates more requests or when output request buffer is full and has had no relief for T cycles Two major techniques: take requests out of queue and NACK them, until the one at head will not generate further requests or ouput request queue has eased up (DASH) fall back to strict request-reply (Origin) instead of NACK, send a reply saying to request directly from owner better because NACKs can lead to many retries, and even livelock

Livelock Classical problem of two processors trying to write a block Origin solves with busy states and NACKs first to get there makes progress, others are NACKed Problem with NACKs useful for resolving race conditions (as above) Not so good when used to ease contention in deadlock-prone situations can cause livelock e.g. DASH NACKs may cause all requests to be retried immediately, regenerating problem continually DASH implementation avoids by using a large enough input buffer No livelock when backing off to strict request-reply

Starvation Not a problem with FIFO buffering but has earlier problems NACKs can cause starvation Possible solutions: do nothing; starvation shouldn’t happen often (DASH) random delay between request retries priorities (Origin)

Synchronization R10000 load-locked / store conditional Hub provides uncached fetch&op

Back-to-back Latencies (unowned) measured by pointer chasing since R10000 does not stall on a read miss Satisfied inback-to-back latency (ns)hops L1 cache 5.50 L2 cache local mem4720 4P mem6901 8P mem P mem9903

Protocol latencies HomeOwnerUnownedClean-ExclusiveModified LocalLocal ,036 RemoteLocal ,272 LocalRemote472*9301,159 RemoteRemote704*9171,097

Application Speedups

Summary In directory protocol there is substantial implementation complexity below the logical state diagram directory vs cache states transient states race conditions conditional actions speculation Real systems reflect interplay of design issues at several levels Origin philosophy: memory-less: node reacts to incoming events using only local state an operation does not hold shared resources while requesting others

Hardware/Software Trade-Offs

HW/SW Trade-offs Potential limitations of the directory-based, CC systems High waiting time at memory operations SC and HW optimization - SGI Origin2000 Relaxing consistency model Limited capacity for replication caching data in main memory and keeping this data coherent - COMA High design and implementation cost HW solutions - separate CA SW solutions

Memory Consistency Models

19 Memory Consistency Model A formal specification of memory semantics How the memory system will appear to the programmer Sequential Consistency greatly restricts the use of many performance optimizations commonly used by uniprocessor hardware and compiler designers Relaxed Consistency Models to alleviate the above problem

20 Memory Consistency Models - Who Should Care? The model affects programmability The model affects the performance of the system The model affects portability, due to a lack of consensus on a single model The memory model influences the writing of parallel programs from the programmer’s perspective, and virtually all aspects of designing a parallel system(the processor, memory, interconnect network, compiler, and programming language) from a system designer’s perspective.

21 Memory Semantics in Uniprocessor Systems Most HL uniprocessor language - simple sequential semantics for memory operations all memory operations will occur one at a time in the sequential order specified by the program but allow a wide range of efficient system designs register allocation, code motion, loop transformation pipelining, multiple issue, write buffer bypassing and forwarding, lockup-free cache

22 Sequential Consistency Sequential consistency(Lamport) The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program simple and intuitive programming model disallows many hardware and compiler optimizations that are possible in uniprocessors Condition: SC ?

Reasoning with Sequential Consistency program order: (a)  (b) and (c)  (d) claim: (x,y) == (1,0) cannot occur x == 1 => (b)  (c) y == 0 => (d)  (a) thus, (a)  (b)  (c)  (d)  (a) so (a)  (a) initial: A, flag, x, y == 0 p1p2 (a) A := 1;(c) x := flag; (b) flag := 1;(d) y := A

Then again,... Many variables are not used to effect the flow of control, but only to shared data synchronizing variables non-synchronizing variables initial: A, flag, x, y == 0 p1p2 (a) A := 1;(c) x := flag; B := C := 2.78 (b) flag := 1;(d) y := A+B+C

25 Sequential Consistency Condition: SC ? maintaining program order among operations from individual processors maintaining a single sequential order among operations from all processors - atomic memory operation P1 P2 P3.... P4 Memory

Lamport’s Requirement for SC Each processor issues memory requests in the order specified by its program. Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Issuing a memory request consists of entering the request on this queue. Assumes stores execute atomically newly written value becomes visible to all processors at the same time inserted into FIFO queue not so with caches and general interconnect

27 Memory Operations Issuing : Request has left the processor environment Performing: A LOAD is considered performed at a point in time when the issuing of a STORE to the same address cannot affect the value returned by the LOAD. A STORE on X by processor i is considered performed at a point in time when a subsequently issued LOAD to the same address returns the value defined by a STORE in the sequence {S i (X)}+ Atomic Memory accesses are atomic in a system if the value stored by a WRITE op becomes readable at the same time for all processors

28 Memory Operations Performing wrt a processor A STOREby processor i is considered performed wrt processor k at a point in time when a subsequently issued LOAD to the same address by processor k returns the value defined by a STORE in the sequence {S i (X)/k} + A LOAD by processor I..... Performing an access globally A STORE is globally performed when it is performed wrt all processors A LOAD is globally performed if it is performed wrt all processors and if the STORE which is the source of the returned value has been globally performed

Requirements for SC (Dubois & Scheurich) Each processor issues memory requests in the order specified by the program. After a store operation is issued, the issuing processor should wait for the store to complete before issuing its next operation. (A STORE globally performed) After a load operation is issued, the issuing processor should wait for the load to complete, and for the store whose value is being returned by the load to complete, before issuing its next operation. (A LOAD is globally performed) the last point ensures that stores appear atomic to loads note, in an invalidation-based protocol, if a processor has a copy of a block in the dirty state, then a store to the block can complete immediately, since no other processor could access an older value

Architecture Implications need write completion for atomicity and access ordering w/o caches, ack writes w/ caches, ack all invalidates atomicity delay access to new value till all inv. are acked access ordering delay each access till previous completes PM CA PM ° ° °

31 Implementing Sequential Consistency Architectures with Caches three additional issues multiple copies --> cache coherence protocol detecting when a write is complete --> more transactions propagating changes --> non-atomic operation

32 Architectures with Caches: Cache Coherence and SC Several definitions for Cache Coherence a synonym for SC A Set of Conditions commonly associated with a Cache Coherence Protocol 1) a write is eventually made visible to all processors 2) writes to the same location appear to be seen in the same order by all processors (also referred to a serialization of writes to the same location) Are the above conditions sufficient for satisfying SC?

33 Architectures with Caches: Cache Coherence and SC SC requires 1) writes to all locations to be seen in the same order by all processors 2) operations of a single processor appear to execute in the program order With this view, CC Protocol can be simply defined as the mechanism that propagates a newly written value invalidating and updating