Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin1 Lecture 9 Outline  MESI protocol  Dragon update-based protocol.

Slides:



Advertisements
Similar presentations
Extra Cache Coherence Examples In the following examples there are a couple questions. You can answer these for practice by ing Colin at
Advertisements

Lecture 7. Multiprocessor and Memory Coherence
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Cache Optimization Summary
CS252 Graduate Computer Architecture Lecture 25 Memory Consistency Models and Snoopy Bus Protocols Prof John D. Kubiatowicz
Computer Architecture II 1 Computer architecture II Lecture 8.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
EECC756 - Shaaban #1 lec # 10 Spring Shared Memory Multiprocessors Symmetric Memory Multiprocessors (SMPs): commonly 2-4 processors/node.
EECC756 - Shaaban #1 lec # 11 Spring Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs): –Symmetric access to all of main memory.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.
Computer architecture II
Cache Coherence: Part 1 Todd C. Mowry CS 740 November 4, 1999 Topics The Cache Coherence Problem Snoopy Protocols.
Bus-Based Multiprocessor
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)
EECC756 - Shaaban #1 lec # 10 Spring Multiprocessors Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors.
Logical Protocol to Physical Design
Cache Coherence in Bus-Based Shared Memory Multiprocessors
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
CS 258 Parallel Computer Architecture Lecture 12 Shared Memory Multiprocessors II March 1, 2002 Prof John D. Kubiatowicz
Snoopy Coherence Protocols Small-scale multiprocessors.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Cache Coherence CSE 661 – Parallel and Vector Architectures
Evaluating the Performance of Four Snooping Cache Coherency Protocols Susan J. Eggers, Randy H. Katz.
CS252 Graduate Computer Architecture Lecture 18 April 4 th, 2011 Memory Consistency Models and Snoopy Bus Protocols Prof John D. Kubiatowicz
0 Shared Address Space Processors Chapter 5 from Culler & Singh February, 2007.
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
1 Memory and Cache Coherence. 2 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main memory from any processor.
Cache Coherence for Small-Scale Machines Todd C
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
Cache Coherence CS433 Spring 2001 Laxmikant Kale.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
CSC/ECE 506: Architecture of Parallel Computers Bus-Based Coherent Multiprocessors 1 Lecture 12 (Chapter 8) Lecture 12 (Chapter 8)
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
COSC6385 Advanced Computer Architecture
Cache Coherence in Shared Memory Multiprocessors
Cache Coherence: Part 1 Todd C. Mowry CS 740 October 25, 2000
CS 704 Advanced Computer Architecture
A Study on Snoop-Based Cache Coherence Protocols
Cache Coherence for Shared Memory Multiprocessors
Multiprocessor Cache Coherency
Lecture 9 Outline MESI protocol Dragon update-based protocol
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Example Cache Coherence Problem
Prof John D. Kubiatowicz
Protocol Design Space of Snooping Cache Coherent Multiprocessors
Cache Coherence (controllers snoop on bus transactions)
Lecture 2: Snooping-Based Coherence
Chip-Multiprocessor.
Cache Coherence in Bus-Based Shared Memory Multiprocessors
Symmetric Multiprocessors
Lecture 4: Update Protocol
Bus-Based Coherent Multiprocessors
Shared Memory Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Lecture 8 Outline Memory consistency
CS 258 Parallel Computer Architecture Lecture 16 Snoopy Protocols I
Prof John D. Kubiatowicz
Presentation transcript:

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin1 Lecture 9 Outline  MESI protocol  Dragon update-based protocol  Impact of protocol optimizations

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin2 Lower-Level Protocol Choices  BusRd observed in M state: what transition to make? Change to S: assume I’ll read again soon  good for mostly read data  what about “migratory” data, thus: Change to I: assume other will write to it (Synapse)  I read and write, then you read and write, then X reads and writes... Sequent Symmetry and MIT Alewife use adaptive protocols

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin3 MESI (4-state) Invalidation Protocol  Problem with MSI protocol Rd, Wr sequence incurs 2 transactions  even when no one is sharing (e.g., serial program!)  BusRd (I  S) followed by BusRdX or BusUpgr (S  M)  In general, coherence traffic from serial programs is unacceptable  Add exclusive state:  Invalid  Modified (dirty)  Shared (two or more caches may have copies)  Exclusive (only this cache has clean copy, same value as in memory)  How to decide I  E or I  S? Need to check whether someone else has copy “Shared” signal on bus: wired-or line asserted in response to BusRd

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin4 MESI: Processor-Initiated Transactions MSE PrRd/– PrWr/ – PrRd/ – PrWr/ – I PrRd/BusRd(~S) PrRd/BusRd(S) PrWr/BusRdX PrRd/ –

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin5 MESI: Bus-Initiated Transactions M IE BusRd/– BusRdX/– S BusRd/Flush BusRdX/Flush BusRdX/Flush ׳ BusRd/Flush ׳

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin6 MESI State Transition Diagram BusRd(S) means shared line asserted on BusRd transaction

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin7 Flush vs. Flush'  Flush: mandatory  Flush' happens only when Cache-to-cache sharing is used, and, Only one cache flushes data

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin8 MESI Visualization P1P3 P2 Cache Main Memory Bus Snooper X=1 Mem Ctrl

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin9 MESI Visualization P1P3 P2 Snooper X=1 Mem Ctrl rd &X BusRd

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin10 MESI Visualization P1P3 P2 Snooper X=1 Mem Ctrl X=1E

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin11 MESI Visualization P1P3 P2 Snooper X=1 Mem Ctrl X=1E wr &X (X=2) M2 One less bus request due to Exclusive state, esp. for serial programs

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin12 MESI Visualization P1P3 P2 Snooper X=1 Mem Ctrl X=2M rd &X BusRd

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin13 MESI Visualization P1P3 P2 Snooper X=1 Mem Ctrl X=2M S 2 S Flush

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin14 MESI Visualization P1P3 P2 Snooper X=2 Mem Ctrl X=2S S wr &X X=3 BusUpgr IM3 Note: BusUpgr instead of BusRdX

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin15 MESI Visualization P1P3 P2 Snooper X=2 Mem Ctrl X=2IX=3 rd &X BusRd 3 S3 M S Flush

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin16 MESI Visualization P1P3 P2 Snooper X=3 Mem Ctrl X=3S S rd &X

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin17 MESI Visualization P1P3P2 Snooper X=3 Mem Ctrl X=3S S rd &X BusRd X=3S Referred to as Cache-to-cache transfer in Illinois MESI protocol Flush1

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin18 MESI Example (Cache-to-Cache Transfer) * Data from memory if no cache2cache transfer, BusRd/- Proc Action State P1State P2State P3Bus ActionData From R1E––BusRdMem W1M–––Own cache R3S–SBusRd/FlushP1 cache W3I–MBusRdXMem R1S–SBusRd/FlushP3 cache R3S–S–Own cache R2SSSBusRd/Flush׳׳ P1/P3 Cache*

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin19 MESI Example (Cache-to-Cache Transfer+BusUpgr) * Data from memory if no cache2cache transfer, BusRd/- Proc Action State P1State P2State P3Bus ActionData From R1E--BusRdMem W1M---Own cache R3S-SBusRd/FlushP1 cache W3I-MBusUpgrOwn cache R1S-SBusRd/FlushP3 cache R3S-S-Own cache R2SSS BusRd/Flush ׳ P1/P3 Cache*

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin20 Lower-Level Protocol Choices  Who supplies data on miss when not in M state: memory or cache?  Original, lllinois MESI: cache assume cache faster than memory (cache-to-cache transfer) Not necessarily true  Adds complexity How does memory know it should supply data? (must wait for caches) Selection algorithm if multiple caches have valid data  Valuable for distributed memory May be cheaper to obtain from nearby cache than distant memory Especially when constructed out of SMP nodes (Stanford DASH)

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin21 Lecture 9 Outline  MESI protocol  Dragon update-based protocol  Impact of protocol optimizations

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin22 Dragon Writeback Update Protocol  Four states Exclusive-clean (E): I and memory have it Shared clean (Sc): I, others, and maybe memory, but I’m not owner Shared modified (Sm): I and others but not memory, and I’m the owner  Sm and Sc can coexist in different caches, with at most one Sm Modified or dirty (M): I and, no one else On replacement: Sc can silently drop, Sm has to flush  No invalid state If in cache, cannot be invalid If not present in cache, can view as being in not-present or invalid state  New processor events: PrRdMiss, PrWrMiss Introduced to specify actions when block not present in cache  New bus transaction: BusUpd Broadcasts single word written on bus; updates other relevant caches

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin23 Dragon: Processor-Initiated Transactions EM Sc Sm PrRdMiss/BusRd(~S) PrRd/– PrWr/ – PrRd/ – PrWr/BusUpd(S) PrWr/BusUpd(~S) PrRdMiss/BusRd(S) PrWrMiss/ (BusRd(S);BusUpd) PrRd/ – PrWr/BusUpd(~S) PrRdMiss/BusRd(~S) PrRd/ – PrWr/BusUpd(S) PrWr/–

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin24 Dragon: Bus-Initiated Transactions EM Sc Sm BusRd/– BusUpd/Update BusRd/– BusRd/Flush BusUpd/Update BusRd/Flush

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin25 Dragon State Transition Diagram E Sc Sm M PrWr/— PrRd/— PrRdMiss/ BusRd(S) PrRdMiss/ BusRd(S) PrWr/— PrWrMiss/ (BusRd(S); BusUpd) PrWrMiss/ BusRd(S) PrWr/ BusUpd(S) PrWr/BusUpd(S) BusRd/— BusRd/Flush PrRd/— BusUpd/Update BusRd/Flush PrWr/BusUpd(S)

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin26 Dragon Visualization P1P3 P2 Cache Main Memory Bus Snooper X=1 Mem Ctrl

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin27 Dragon Visualization P1P3 P2 Snooper X=1 Mem Ctrl rd &X BusRd

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin28 Dragon Visualization P1P3 P2 Snooper X=1 Mem Ctrl X=1E

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin29 Dragon Visualization P1P3 P2 Snooper X=1 Mem Ctrl X=1E wr &X (X=2) M2 One less bus request due to Exclusive state, esp. for serial programs

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin30 Dragon Visualization P1P3 P2 Snooper X=1 Mem Ctrl X=2M rd &X BusRd

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin31 Dragon Visualization P1P3 P2 Snooper X=1 Mem Ctrl X=2M Sc Sm

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin32 Dragon Visualization P1P3 P2 Snooper X=1 Mem Ctrl X=2SmX=2Sc wr &X X=3 BusUpd Sm3 Note: BusUpdate instead of BusUpgr (no inval is performed) Sc3

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin33 Dragon Visualization P1P3 P2 Snooper X=1 Mem Ctrl X=3ScX=3 rd &X Sm This is a miss in the MESI and MSI protocols

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin34 Dragon Visualization P1P3 P2 Snooper X=1 Mem Ctrl X=3ScX=3Sm rd &X

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin35 Dragon Visualization P1P3P2 Snooper X=1 Mem Ctrl X=3ScX=3Sm rd &X BusRd X=3Sc Note: Only the cache in State Sm is responsible for cache-to-cache transfer

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin36 Dragon Visualization P1P3P2 Snooper X=1 Mem Ctrl X=3ScX=3Sm X=3Sc P1 replaces X

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin37 Dragon Visualization P1P3P2 Snooper X=1 Mem Ctrl X=3ScX=3Sm X=3Sc P3 replaces X Owner responsible for writing back to mem 3 vs. MSI or MESI where write-back only when the line is in M state

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin38 Dragon Example Proc Action State P1State P2State P3Bus ActionData From R1E––BusRdMem W1M–––Own cache R3Sm–ScBusRd/FlushP1 cache W3Sc–SmBusUpd/UpdOwn cache R1Sc–Sm–Own cache R3Sc–Sm–Own cache R2Sc SmBusRd/FlushP3 cache

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin39 Lower-Level Protocol Choices  Can shared-modified state be eliminated? If update memory as well on BusUpd transactions (DEC Firefly) Dragon protocol doesn’t (assumes DRAM memory slow to update)  Should replacement of an Sc block be broadcast? Would allow last copy to go to Exclusive state and not generate updates Replacement bus transaction is not in critical path, later update may be  Shouldn’t update local copy on write hit before controller gets bus Can mess up serialization  Coherence, consistency considerations much like write-through case  In general, many subtle race conditions in protocols  But first, let’s illustrate quantitative assessment at logical level

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin40 Lecture 9 Outline  MESI protocol  Dragon update-based protocol  Impact of protocol optimizations

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin41 Assessing Protocol Tradeoffs  Methodology: Use simulator; choose parameters per earlier methodology (default 1MB, 4-way cache, 64-byte block, 16 processors; 64K cache for some) Focus on frequencies, not end performance for now  transcends architectural details, but not what we’re really after Use idealized memory performance model to avoid changes of reference interleaving across processors with machine parameters  Cheap simulation: no need to model contention

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin42 Impact of Protocol Optimizations MSI = MESI Upgrades instead of read-exclusive helps Same story when working sets don’t fit for Ocean, Radix, Raytrace MESI vs. MSI (w/ BusUpgr) vs. MSI (w/ BusRdX)

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin43 Impact of Cache-Block Size  Multiprocessors add new kind of miss to cold, capacity, conflict Coherence misses: Due to invalidations  True sharing: Write to same word  False sharing: Write to different words  Reducing misses architecturally in invalidation protocol Capacity: enlarge cache; increase block size (if spatial locality) Conflict: increase associativity Cold and coherence: only block size  Increasing block size has advantages and disadvantages Can reduce misses if spatial locality is good Can hurt too  increase misses due to false sharing if spatial locality not good  increase misses due to conflicts in fixed-size cache  increase traffic due to fetching unnecessary data and due to false sharing  can increase miss penalty and perhaps hit cost

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin44 Impact of Block Size on Miss Rate  For default problem size: vary block/line size from Bytes Decreases with larger lines: cold, capacity (due to spatial locality), true sharing (due to spatial locality) Increases with larger lines: false sharing Working set doesn’t fit: impact of capacity misses large: (Ocean, Radix)

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin45 Impact of Block Size on Traffic  Results different than for miss rate: traffic almost always increases  When working sets fits, overall traffic still small, except for Radix  Fixed overhead is significant component So total traffic often minimized at byte block, not smaller  Working set doesn’t fit: even 128-byte good for Ocean due to capacity Address bus traffic behaves in opposite way as the data bus traffic Traffic (bytes/inst) affects performance indirectly through contention