A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Also known as “Snoopy cache” Paper by: Mark S. Papamarcos and Janak H.

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.
L.N. Bhuyan Adapted from Patterson’s slides
Cache Coherence Mechanisms (Research project) CSCI-5593
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
The University of Adelaide, School of Computer Science
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence.
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
Snoopy Coherence Protocols Small-scale multiprocessors.
Cache Organization of Pentium
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
OpenRISC 1000 Yung-Luen Lan, b Cache Model Perspective of the Programming Model. Hence, the hardware implementation details (cache organization.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
ECE/CS 552: Shared Memory © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith.
Additional Material CEG 4131 Computer Architecture III
1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan.
“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.
Performance of Snooping Protocols Kay Jr-Hui Jeng.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
COSC6385 Advanced Computer Architecture
תרגול מס' 5: MESI Protocol
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
A Study on Snoop-Based Cache Coherence Protocols
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Multiprocessors - Flynn’s taxonomy (1966)
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
The University of Adelaide, School of Computer Science
Cache coherence CEG 4131 Computer Architecture III
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Presentation transcript:

A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Also known as “Snoopy cache” Paper by: Mark S. Papamarcos and Janak H. Patel Presented by: Cameron Mott3/25/2005

Outline Goals Outline Examples Solutions Details on this method Results Analysis Success Comments/Questions

Goals Reduce bus traffic Reduce bus wait Increase possible number of processors before saturation of bus Increase processor utilization Low cost Extensible Long length of life for strategy

Structure The typical layout for a multi-processor machine:

Difficulties Bus speed and saturation limits the processor utilization (there is a single time-shared bus with an arbitration mechanism). This scheme suffers from the well-known data consistency or “cache coherence” problem where two processors have the same writable data block in their private cache.

Coherence example Process communication in shared-memory multiprocessors can be implemented by exchanging information through shared variables This sharing can result in several copies of a shared block in one or more caches at the same time. TimeEventCache contents for CPU A Cache contents for CPU B Memory contents for location X 01 1CPU A reads X11 2CPU B reads X111 3CPU A stores 0 into X010

Enforcing Coherence Styles - Hardware based  Use a global table, the table keeps track of what memory is held and where. - “Snoopy” cache  No need for centralized hardware  All processors share the same cache bus  Each cache “snoops” or listens to cache transactions from other processors  Used in CSM machines using a bus

Snoopy caches To solve coherence, each processor can send out the address of the block that is being written in cache, each other processor that contains that entry then invalidates the local entry (called broadcast invalidate).

Other Snoopy Methods Broadcast-Invalidate Any write to cache transmits the address throughout the system. Other caches check their directory, and purge the block if it exists locally. This does not require extra status bits, but does eat up a lot of bus time. Improvements to above Introduce a bias filter. The bias filter is a small associative memory that stores the most frequently invalidated blocks.

Goodman’s Strategy Goodman proposes his strategy for multiple processor systems with independent cache but a shared bus. Invalidate is broadcast only when a block is written in cache the first time (thus “write once”). This block is also written through to main memory. If a block in cache is written to more than once (by different processors for example), the block must be written back to memory before replacing it.

Write-Once Combination of write-through and write-back.

Example Online example Note that the only browser that displayed this on my computer was IE…

Details Two bits in each block in the cache keep track of the status of that block. 1. Invalid = The data in this line is not present or is not valid. 2. Exclusive-Unmodified (Excl-Unmod) = This is an exclusive cache line. The line is coherent with memory and is held unmodified only in one cache. The cache owns the line and can modify it without having to notify the rest of the system. No other caches in the system may have a copy of this line. 3. Shared-Unmodified (Shared-Unmod) = This is a shared cache line. The line is coherent with memory and may be present in several caches. Caches must notify the rest of the system about any changes to this line. The main memory owns this cache line. 4. Exclusive-Modified (Excl-Mod) = There is modified data in this cache line. The line is incoherent with memory, so the cache is said to own the line. No other caches in the system may have a copy of this line. Other papers discuss MESI caches. How does this fit with Papamarcos and Patel’s work? M: Exclusive Modified E: Exclusive Unmodified S: Shared Unmodified I: Invalid

Details (cont.) Snoopy cache actions: Read With Intent to Modify – This is the “write” cycle. If the address on the bus matches a Shared or Exclusive line, the line is invalidated. If a line is Modified, the cache must cause the bus action to abort, write the modified line back to memory, invalidate the line, and then allow the bus read to retry. Alternatively, the owning cache can supply the line directly to the requestor across the bus. Read - If the address on the bus matches a Shared line there is no change. If the line is Exclusive, the state changes to Shared. If a line is Modified, the cache must cause the bus action to abort, write the modified line back to memory, change the line to Shared, and then allow the bus read to retry. Alternatively, the owning cache can supply the line to the requestor directly and change its state to Shared.

Flow diagrams

Other cache can now provide requested memory. This changes the status bit to shared-unmod. Block is also written back to memory if another cache had an Excl-Mod entry for that block. The status of that block is then changed to shared-unmod after being written and shared with the other processor. Writes cause any other cache to set the corresponding entry to invalid. If memory provided the block, the status becomes exclusive-unmod. No signal is necessary if the status is not shared- unmod.

Problems What if? A block is Shared-Unmodified and two caches attempt to change the block at the same time. Depending on the implementation, the bus provides the “sync” mechanism. Only one processor can have control of the bus at any one time. This provides a contention mechanism to determine which processor wins. Requires that this operation is indivisible.

Results Results were analyzed using an approximation algorithm. Is this appropriate? Can an approximation be used to justify the algorithm? Accuracy of the approximation: error rate of less than 5% in certain circumstances

Parameters Variable LabelNumber assumed for calculationsDescription NNumber of processors a90%Processor Memory reference rate (cache requests) m5%Miss ratio w20%Fraction of memory references that are written d50%Probability that a block in cache has been locally modified or (“dirty”) u30%Fraction of write requests that reference unmodified blocks s5%Fraction of write requests that reference shared blocks A1Number of cycles required for bus arbitration T2Number of cycles for a block transfer I2Number of cycles for a block Invalidate WAverage waiting time per bus request b(Derived) Ave. number of bus requests per unit of useful processor activity

Miss Ratio

Miss Ratio (Cont)

Degree of Sharing

Write Back Probability

Block Transfer Time

Cost of implementing

Note This algorithm and structure does have a finite limit to the number of supported processors. Diminishing returns are noted for performance as the number of processors increase. Thus, this strategy should not be utilized in systems of 30 processors or more (as an estimate). This all depends on the system parameters of course, but it is limited by these factors. For a system utilizing a finite number of processors, this strategy is very effective, and is in use today.

References A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Mark S. Papamarcos and Janak H. Patel “Cache Coherence” Srini Devadas ect23/sld001.htm ect23/sld001.htm “Dynamic Decentralized Cache Schemes for MIMD Parallel Processors” Tu Phan tions/Dynamic%20Decentralized%20Cache%20Schemes.ppt tions/Dynamic%20Decentralized%20Cache%20Schemes.ppt H&P 3 rd. Edition Mark Smotherman Vivio: Write Once cache coherency protocol Jeremy Jones tm tm