Computer Architecture II 1 Computer architecture II Lecture 8.

Computer Architecture II 1 Computer architecture II Lecture 8

Computer Architecture II 2 Shared Memory Multiprocessors

Computer Architecture II 3 Outline: NOW: Bus based shared memory (chapters 5, 6 Culler) –Cache Coherence –Snooping Cache Coherence Protocols –Consistency –Synchronization LATER: NUMA machines (chapters 7, 8) –Directory based coherency protocols

Computer Architecture II 4 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) –Symmetric access to the whole main memory from any processor Dominate the server market Attractive as throughput servers and for parallel programs –Efficient resource sharing (processors, memory) –Uniform access via loads/stores –Automatic data movement and coherent replication in cache

Computer Architecture II 5 Supporting Programming Models –Address translation and protection in hardware (hardware SAS) –Message passing using shared memory buffers can be very high performance since no OS involvement necessary –We focus today on supporting coherent shared address space Multiprogramming Shared address space Message passing Programming models Communication abstraction User/system boundary Compilation or library Operating systems support Communication hardware Physical communication medium Hardware/software boundary

Computer Architecture II 6 Memory Systems in Multiprocessors 80s, UMA, 2-8 processors Actual SMPs,UMA, 20-30 processors Scalable, UMA, memory far Most scalable, NUMA –b, c, d : private caches –b: small scale configurati on –d: scalable machines –a, c: rarely used

Computer Architecture II 7 Compo- nent Access Random Random Random Direct Sequential Capa- 64-1024+8KB-8MB 64MB-2GB 8GB 1TB city, bytes Latency.4-10ns.4-20ns10-50ns 10ms 10ms-10s Block 4 B 64 B64 B 4KB 4KB size Band- SystemSystem10-4000 50MB/s 1MB/s width clock ClockMB/s Raterate-80MB/s Cost/MB High$10$.25 $0.002 $0.01 The Memory Hierarchy CPU CacheMain Memory Disk Memory Tape Memory Some Typical Values: † † As of 2003-4. They go out of date immediately.

Computer Architecture II 8 Caches Caches play a key role in all cases –Address temporal locality –Reduce average data access time to memory Some data in cache, some only in memory –Reduce bandwidth demands placed on shared interconnect Cache types based on write access –Write-through: the value is also written to memory –Write-back: the value is written to memory only at block replacement Cache Memory X:0 X=5 X:0 5 5 WRITE-THROUGH CacheX:0 X=5 X:0 5 WRITE-BACK

Computer Architecture II 9 Write-through vs. write-back Write-though –Every write from every processor goes to shared bus and memory –Easy protocol –Write-through unpopular for SMPs Write-back –absorb most writes as cache hits –Write hits don’t go on bus –But now how do we ensure write propagation and serialization? –Need more sophisticated protocols Cache Memory X:0 X=5 X:0 5 5 WRITE-THROUGH CacheX:0 X=5 X:0 5 WRITE-BACK

Computer Architecture II 10 Caches and Cache Coherence Private processor caches create a problem –Copies of a variable can be present in multiple caches –A write by one processor may not become visible to others Other processors may access stale value in their caches –Cache coherence problem

Computer Architecture II 11 Example Cache Coherence Problem –How can this problem be solved? 1.P 1 reads u 2.P 3 reads u 3.P 3 : u=7 4.P 1 reads u 5.P 2 reads u What do P 1 and P 2 see in steps 4 and 5?

Computer Architecture II 12 A Coherent Memory System: Intuition Reading a location should return latest value written (by any process) Easy in uniprocessors –Except for I/O: coherence between I/O devices and processors –Problem: DMA and the processor write to main memory, but the processor through the cache –Infrequent => software solutions work fine uncacheable operations, uncached loads/store for I/O memory regions, flush pages, pass I/O data through caches The coherence problem is more performance-critical in multiprocessors –Fast hardware solutions desired

Computer Architecture II 13 Problems with the Intuition Recall: –Value returned by read should be last value written But “last” is not well-defined! Even in sequential case: –“last” is defined in terms of program order, not time Order of operations in the machine language presented to processor “Subsequent” defined in analogous way, and well defined In parallel case: –program order defined within a process, but need to make sense of orders across processes Must define a meaningful semantics –the answer involves both “cache coherence” (TODAY) and an appropriate “memory consistency model” (NEXT TIME)

Computer Architecture II 14 Program order Sequential program order – P 1 : 1a->1b – P 2 : 2a->2b Parallel program order: a arbitrary interleaving of sequential orders of P 1 and P 2 –1a->1b->2a->2b –1a->2a->1b->2b –2a->1a->1b->2b –2a->2b->1a->1b P 1 P 2 (1a) A = 1;(2a) print B; (1b) B = 2;(2b) print A;

Computer Architecture II 15 Formal Definition of Coherence Results of a program: values returned by its read operations A memory system is coherent if the results of any execution of a program are such that, for each location, it is possible to construct a hypothetical serial order of all operations to the given location that is consistent with the results of the execution and in which: 1. operations issued by any particular process occur in the order issued by that process, and 2. the value returned by a read is the value written by the last write to that location in the serial order Two necessary features: –Write propagation: value written must become visible to others –Write serialization: writes to location seen in same order by all if I see w1 after w2, you should not see w2 before w1 no need for read serialization since reads not visible to others

Computer Architecture II 16 Cache Coherence Solutions Software Based: –often used in clusters of workstations or PCs (e.g., “Treadmarks DSM”) –extend virtual memory system to perform more work on page faults send messages to remote machines if necessary Hardware Based: –two most common variations: “snoopy” schemes –rely on broadcast to observe all coherence traffic –well suited for buses and small-scale systems –example: Almost all dual-processor SMPs directory schemes –uses centralized information to avoid broadcast –scales well to larger numbers of processors –example: SGI Origin 2000 Today Later

Computer Architecture II 17 Ex: Cache Coherence Problem Solution INVALIDATE write-through: 1)before completing P 3 :u=7 1)send a signal on the bus to invalidate all the copies of the block cache where u is stored ( cache of P 1 ) 2)Write through the new value to the memory 2)P 1 and P 2 will read 7 from main memory (the copy of P 1 has been invalidated) 1.P 1 reads u 2.P 3 reads u 3.P 3 : u=7 4.P 1 reads u 5.P 2 reads u What do P 1 and P 2 see in steps 4 and 5? invalidate Write-through 7

Computer Architecture II 18 Ex: Cache Coherence Problem Solution UPDATE write-through: 1)before completing P 3 :u=7 1)send a signal on the bus to update all the copies of the block cache where u is stored ( cache of P 1 ) 2)Write through the new value to the memory 2)P 1 reads 7 from its cache (the copy has been updated) and P 2 will read 7 from main memory 1.P 1 reads u 2.P 3 reads u 3.P 3 : u=7 4.P 1 reads u 5.P 2 reads u What do P 1 and P 2 see in steps 4 and 5? update Write-through 7 7

Computer Architecture II 19 Outline: NOW: Bus based shared memory (chapters 5, 6 Culler) –Cache Coherence –Snooping Cache Coherence Protocols –Consistency –Synchronization LATER: NUMA machines (chapters 7, 8) –Directory based coherency protocols

Computer Architecture II 20 Memory Systems in Multiprocessors 80s, UMA, 2-8 processors Actual SMPs,UMA, 20-30 processors Scalable, UMA, memory far away Most scalable, NUMA WE WILL DISCUSS NOW b !

Computer Architecture II 21 Bus-based Cache Coherence Built on top of two fundamentals of uni-processor systems –Bus transactions –State transition diagram in cache Uni-processor bus transaction: –Three phases Arbitration: decide who will get control of the bus command/address data transfer –All devices observe addresses, one is responsible for answering Uni-processor cache states: –A finite-state machine for every block –Write-through: two states: valid, invalid –Write-back: one more state: modified (“dirty”)

Computer Architecture II 22 Snooping-based Coherence Basic Idea –Transactions on bus are visible to all processors –Processors or their representatives can snoop (monitor) bus and take action on relevant events (e.g. change state) Implementing a Protocol –Cache controller receives inputs: Requests from processor Bus requests/responses from snooper –Cache controller actions on input Updates state, responds with data, generates new bus transactions –Protocol is a distributed algorithm: cooperating state machines Set of states, state transition diagram, actions –Granularity of coherence is typically cache block (e.g.: 32 bytes) The same with granularity of allocation in cache and of transfer to/from cache

Computer Architecture II 23 Coherence with Write-through Caches –Key extensions to uniprocessor: snooping, invalidating/updating caches no new states or bus transactions in this case invalidation- versus update-based protocols –Write propagation: even in invalidation case, later reads will see new value invalidation causes miss on later access, and memory up-to-date via write-through

Computer Architecture II 24 Invalidation-based write-through State Transition Diagram –One transition diagram per cache block –Block states V:valid, I:invalid V: block contains a correct copy of main memory I: block is not in the cache –Notation a/b a : event b: action taken on event –Events Processor can Rd/Wr from/to block Bus can Rd/Wr from/to block V I PrRd/- PrWr/BusWr BusWr/- PrRd/BusRd PrWr/BusWProcessor initiated action Snooper initiated action

Computer Architecture II 25 Invalidation-based write-through State Transition Diagram –Write will invalidate all other caches (no local change of state) can have multiple simultaneous readers of block, but write invalidates them –Implementation Hardware state bits associated only with blocks that are in the cache other blocks can be seen as being in invalid (not-present) state in that cache V I PrRd/- PrWr/BusWr BusWr/- PrRd/BusRd PrWr/BusWrProcessor initiated action Snooper initiated action

Computer Architecture II 26 Problems with Write Through High bandwidth requirements –Every write from every processor goes to shared bus and memory –Write-through unpopular for SMPs Write-back caches absorb most writes as cache hits –Write hits don’t go on bus –But now how do we ensure write propagation and serialization? –Need more sophisticated protocols: large design space

Computer Architecture II 27 Write-Back Snoopy Protocols No need to change processor, main memory, cache … –Extend cache controller (not only V and I states) and exploit bus (provides serialization) Dirty state now also indicates exclusive ownership –Exclusive: only cache with a valid copy (main memory may be too) –Owner: responsible for supplying block upon a request for it 2 types of protocols –Invalidation based –Update based

Computer Architecture II 28 Basic MSI Write-back Invalidation Protocol States –Invalid (I) –Shared (S): one or more –Dirty or Modified (M): one only Processor Events: –PrRd (read) –PrWr (write) Bus Transactions –BusRd (read): asks for copy with no intent to modify –BusRdX (read exclusive): asks for copy with intent to modify –BusWB (write back): updates memory Actions –Update state, perform bus transaction, flush value onto bus

Computer Architecture II 29 State Transition Diagram –Replacement and write backs are not shown –Rd/Wr in M and Rd in S state do not cause bus transaction –But: Rd/Wr in I state cause 2 bus transactions –Wr in S state 2 bus transactions And data sent at RdX Can spare the data transfer, because already have latest data: can use upgrade (BusUpgr) instead of BusRdX PrRd/— PrWr/BusRdX BusRd/— PrWr/- S M I BusRdX/Flush BusRdX/— BusRd/Flush PrWr/ BusRdX PrRd/BusRd

Computer Architecture II 30 Example: Write-Back Protocol I/O devices Memory u :5 P1P1 P2P2 P3P3 PrRd U US5 BusRd U PrRd U BusRd U US5 PrWr U 7 UM7 BusRdx U PrRd U US7US7 7 BusRd Flush PrRd/— PrWr/BusRdX BusRd/— PrWr/— S M I BusRdX/Flush BusRdX/— BusRd/Flush PrWr/BusRdX PrRd/BusRd 1.P 1 reads u 2.P 3 reads u 3.P 3 : u=7 4.P 1 reads u 5.P 2 reads u

Computer Architecture II 31 MESI (4-state) Invalidation Protocol Problem with MSI protocol –Reading and modifying data is 2 bus transactions, even if no sharing! even in a sequential program! BusRd (I->S) followed by BusRdX or BusUpgr (S->M) Add exclusive state: not modified block resides only in local cache => write locally without bus transaction States –invalid –exclusive or exclusive-clean (only this cache has copy, but not modified) –shared (two or more caches may have copies) –modified (dirty)

Computer Architecture II 32 MESI State Transition Diagram –Goal: no bus transaction on E When does I go to E and when to S? I -> E on PrRd if no other processor has a copy I -> S otherwise –S additional signal on bus –BusRd(S): on BusRd signal, if one processor holds the block it asserts S (makes it 1) PrWr/- BusRd/Flush PrRd/ BusRdX/Flush PrWr/ BusRdX PrWr/- PrRd/— BusRd/Flush E M I S PrRd/- BusRd(S) BusRdX/Flush BusRd/ Flush PrWr/ BusRdX PrRd/ BusRd(S) )

Computer Architecture II 33 Dragon Write-Back Update Protocol 4 states –Exclusive-clean or exclusive (E): locally and memory have it –Shared clean (Sc): locally, others, and maybe memory, but I’m not owner –Shared modified (Sm): locally and others but not memory, and I’m the owner Sm and Sc can coexist in different caches, with only one Sm –Modified or dirty (D): locally and nowhere else No invalid state –If in cache, cannot be invalid –If not present in cache, can view as being in not-present or invalid state New processor events: PrRdMiss, PrWrMiss –Introduced to specify actions when block not present in cache New bus transaction: BusUpd –Broadcasts single word written on bus; updates caches that hold a copy

Computer Architecture II 34 Dragon State Transition Diagram E Sc Sm M PrWr/— PrRd/— PrRdMiss/BusRd(S) PrWr/— PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/BusRd(S) PrWr/BusUpd(S) PrWr/BusUpd(S) BusRd/— BusRd/Flush PrRd/— BusUpd/Update BusRd/Flush PrWr/BusUpd(S) PrWr/BusUpd(S)

Computer Architecture II 35 Invalidate versus Update Basic question of program behavior –Is a block written by one processor read by others before it is rewritten? Invalidation: –Yes => readers will take a miss –No => multiple writes without additional traffic Update: –Yes => readers will not miss if they had a copy previously single bus transaction to update all copies –No => multiple useless updates, even to dead copies Invalidate or update may be better depending on the application Invalidation protocols much more popular –Some systems provide both, or even hybrid

Computer Architecture II 1 Computer architecture II Lecture 8.

Similar presentations

Presentation on theme: "Computer Architecture II 1 Computer architecture II Lecture 8."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Architecture II 1 Computer architecture II Lecture 8.

Similar presentations

Presentation on theme: "Computer Architecture II 1 Computer architecture II Lecture 8."— Presentation transcript:

Similar presentations

About project

Feedback