Download presentation
Presentation is loading. Please wait.
Published byStanley King Modified over 9 years ago
1
Cache Coherence in Scalable Machines (IV)
2
99-6-72 Dealing with Correctness Issues Serialization of operations Deadlock Livelock Starvation
3
99-6-73 Serialization of Operations Need a serializing agent home memory is a good candidate, since all misses go there first Possible Mechanism: FIFO buffering requests at the home until previous requests forwarded from home have returned replies to it but input buffer problem becomes acute at the home Possible Solutions: let input buffer overflow into main memory (MIT Alewife)
4
99-6-74 Serialization of Operations don’t buffer at home, but forward to the owner node when directory is in a busy state (Stanford DASH) serialization determined by home when clean, by owner when exclusive if cannot be satisfied at “owner”, e.g. written back or ownership given up, NACKed back to requestor without being serialized serialized when retried don’t buffer at home, use busy state to NACK (Origin) serialization order is that in which requests are accepted (not NACKed) maintain the FIFO buffer in a distributed way (SCI)
5
99-6-75 Serialization to a Location (cont’d) Having single entity determine order is not enough Example P1 P2 rd A (i)wr A barrier barrier rd A (ii) Second read of A should return the value written by P2 Rd A (i) -> wr A -> rd A(ii) Wr A -> rd A(i) -> rd A(ii)
6
99-6-76 Serialization to a Location (cont’d) Having single entity determine order is not enough it may not know when all xactions for that operation are done everywhere 2 4 5 Home P1 P2 1. P1 issues read request to home node forA 2. P2 issues read-exclusive request to home corresponding to write ofA. But won’t process it until it is done with read 3. Home receives 1, and in response sends reply to P1 (and sets directory presence bit). Home now thinks read is complete. Unfortunately, the reply does not get to P1 right away. 4. In response to 2, home sends invalidate to P1; it reaches P1 before transaction 3 (no point-to-point order among requests and replies). 5. P1 receives and applies invalidate, sends ack to home. 6. Home sends data reply to P2 corresponding to request 2. Finally, transaction 3 (read reply) reaches P1. 3 6 1
7
99-6-77 Possible solutions Solution 1 To have read replies themselves be acknowledged Let the home go on to process the next request only after it receives this ack. Solution 2 The requestor does not allow access by another request, such as invalidation, to be applied to that block until its outstanding request completes. SGI Orgin Solution 3 Apply invalidation even before the read reply is received and consider the reply invalid and retry the read DASH The order may be different between two solutions
8
99-6-78 Deadlock Two networks not enough when protocol not request-reply Additional networks expensive and underutilized Use two, but detect potential deadlock and circumvent e.g. when input request and output request buffers fill more than a threshold, and request at head of input queue is one that generates more requests or when output request buffer is full and has had no relief for T cycles Two major techniques: take requests out of queue and NACK them, until the one at head will not generate further requests or ouput request queue has eased up (DASH) fall back to strict request-reply (Origin) instead of NACK, send a reply saying to request directly from owner better because NACKs can lead to many retries, and even livelock
9
99-6-79 Livelock Classical problem of two processors trying to write a block Origin solves with busy states and NACKs first to get there makes progress, others are NACKed Problem with NACKs useful for resolving race conditions (as above) Not so good when used to ease contention in deadlock-prone situations can cause livelock e.g. DASH NACKs may cause all requests to be retried immediately, regenerating problem continually DASH implementation avoids by using a large enough input buffer No livelock when backing off to strict request-reply
10
99-6-710 Starvation Not a problem with FIFO buffering but has earlier problems NACKs can cause starvation Possible solutions: do nothing; starvation shouldn’t happen often (DASH) random delay between request retries priorities (Origin)
11
99-6-711 Synchronization R10000 load-locked / store conditional Hub provides uncached fetch&op
12
99-6-712 Back-to-back Latencies (unowned) measured by pointer chasing since R10000 does not stall on a read miss Satisfied inback-to-back latency (ns)hops L1 cache 5.50 L2 cache 56.90 local mem4720 4P mem6901 8P mem8902 16P mem9903
13
99-6-713 Protocol latencies HomeOwnerUnownedClean-ExclusiveModified LocalLocal4727071,036 RemoteLocal7049301,272 LocalRemote472*9301,159 RemoteRemote704*9171,097
14
99-6-714 Application Speedups
15
99-6-715 Summary In directory protocol there is substantial implementation complexity below the logical state diagram directory vs cache states transient states race conditions conditional actions speculation Real systems reflect interplay of design issues at several levels Origin philosophy: memory-less: node reacts to incoming events using only local state an operation does not hold shared resources while requesting others
16
Hardware/Software Trade-Offs
17
99-6-717 HW/SW Trade-offs Potential limitations of the directory-based, CC systems High waiting time at memory operations SC and HW optimization - SGI Origin2000 Relaxing consistency model Limited capacity for replication caching data in main memory and keeping this data coherent - COMA High design and implementation cost HW solutions - separate CA SW solutions
18
Memory Consistency Models
19
19 Memory Consistency Model A formal specification of memory semantics How the memory system will appear to the programmer Sequential Consistency greatly restricts the use of many performance optimizations commonly used by uniprocessor hardware and compiler designers Relaxed Consistency Models to alleviate the above problem
20
20 Memory Consistency Models - Who Should Care? The model affects programmability The model affects the performance of the system The model affects portability, due to a lack of consensus on a single model The memory model influences the writing of parallel programs from the programmer’s perspective, and virtually all aspects of designing a parallel system(the processor, memory, interconnect network, compiler, and programming language) from a system designer’s perspective.
21
21 Memory Semantics in Uniprocessor Systems Most HL uniprocessor language - simple sequential semantics for memory operations all memory operations will occur one at a time in the sequential order specified by the program but allow a wide range of efficient system designs register allocation, code motion, loop transformation pipelining, multiple issue, write buffer bypassing and forwarding, lockup-free cache
22
22 Sequential Consistency Sequential consistency(Lamport) The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program simple and intuitive programming model disallows many hardware and compiler optimizations that are possible in uniprocessors Condition: SC ?
23
99-6-723 Reasoning with Sequential Consistency program order: (a) (b) and (c) (d) claim: (x,y) == (1,0) cannot occur x == 1 => (b) (c) y == 0 => (d) (a) thus, (a) (b) (c) (d) (a) so (a) (a) initial: A, flag, x, y == 0 p1p2 (a) A := 1;(c) x := flag; (b) flag := 1;(d) y := A
24
99-6-724 Then again,... Many variables are not used to effect the flow of control, but only to shared data synchronizing variables non-synchronizing variables initial: A, flag, x, y == 0 p1p2 (a) A := 1;(c) x := flag; B := 3.1415 C := 2.78 (b) flag := 1;(d) y := A+B+C
25
25 Sequential Consistency Condition: SC ? maintaining program order among operations from individual processors maintaining a single sequential order among operations from all processors - atomic memory operation P1 P2 P3.... P4 Memory
26
99-6-726 Lamport’s Requirement for SC Each processor issues memory requests in the order specified by its program. Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Issuing a memory request consists of entering the request on this queue. Assumes stores execute atomically newly written value becomes visible to all processors at the same time inserted into FIFO queue not so with caches and general interconnect
27
27 Memory Operations Issuing : Request has left the processor environment Performing: A LOAD is considered performed at a point in time when the issuing of a STORE to the same address cannot affect the value returned by the LOAD. A STORE on X by processor i is considered performed at a point in time when a subsequently issued LOAD to the same address returns the value defined by a STORE in the sequence {S i (X)}+ Atomic Memory accesses are atomic in a system if the value stored by a WRITE op becomes readable at the same time for all processors
28
28 Memory Operations Performing wrt a processor A STOREby processor i is considered performed wrt processor k at a point in time when a subsequently issued LOAD to the same address by processor k returns the value defined by a STORE in the sequence {S i (X)/k} + A LOAD by processor I..... Performing an access globally A STORE is globally performed when it is performed wrt all processors A LOAD is globally performed if it is performed wrt all processors and if the STORE which is the source of the returned value has been globally performed
29
99-6-729 Requirements for SC (Dubois & Scheurich) Each processor issues memory requests in the order specified by the program. After a store operation is issued, the issuing processor should wait for the store to complete before issuing its next operation. (A STORE globally performed) After a load operation is issued, the issuing processor should wait for the load to complete, and for the store whose value is being returned by the load to complete, before issuing its next operation. (A LOAD is globally performed) the last point ensures that stores appear atomic to loads note, in an invalidation-based protocol, if a processor has a copy of a block in the dirty state, then a store to the block can complete immediately, since no other processor could access an older value
30
99-6-730 Architecture Implications need write completion for atomicity and access ordering w/o caches, ack writes w/ caches, ack all invalidates atomicity delay access to new value till all inv. are acked access ordering delay each access till previous completes PM CA PM ° ° °
31
31 Implementing Sequential Consistency Architectures with Caches three additional issues multiple copies --> cache coherence protocol detecting when a write is complete --> more transactions propagating changes --> non-atomic operation
32
32 Architectures with Caches: Cache Coherence and SC Several definitions for Cache Coherence a synonym for SC A Set of Conditions commonly associated with a Cache Coherence Protocol 1) a write is eventually made visible to all processors 2) writes to the same location appear to be seen in the same order by all processors (also referred to a serialization of writes to the same location) Are the above conditions sufficient for satisfying SC?
33
33 Architectures with Caches: Cache Coherence and SC SC requires 1) writes to all locations to be seen in the same order by all processors 2) operations of a single processor appear to execute in the program order With this view, CC Protocol can be simply defined as the mechanism that propagates a newly written value invalidating and updating
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.