A Study on Snoop-Based Cache Coherence Protocols

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Performance of Cache Memory
Cache Optimization Summary
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
Logical Protocol to Physical Design
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
IntroductionSnoopingDirectoryConclusion IntroductionSnoopingDirectoryConclusion Memory 1A 2B 3C 4D 5E Cache 1 1A 2B 3C Cache 2 3C 4D 5E Cache 4 1A 2B.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Evaluating the Performance of Four Snooping Cache Coherency Protocols Susan J. Eggers, Randy H. Katz.
1 Lecture 7: Implementing Cache Coherence Topics: implementation details.
Performance of Snooping Protocols Kay Jr-Hui Jeng.
The University of Adelaide, School of Computer Science
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Outline Introduction (Sec. 5.1)
Lecture 8: Snooping and Directory Protocols
COSC6385 Advanced Computer Architecture
COMP 740: Computer Architecture and Implementation
Processor support devices Part 2: Caches and the MESI protocol
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Cache Organization of Pentium
CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches
תרגול מס' 5: MESI Protocol
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Cache Coherence for Shared Memory Multiprocessors
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
Cache Memory Presentation I
Morgan Kaufmann Publishers
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Cache Coherence (controllers snoop on bus transactions)
Lecture 2: Snooping-Based Coherence
Multi-core systems COMP25212 System Architecture
Cache Coherence Protocols 15th April, 2006
James Archibald and Jean-Loup Baer CS258 (Prof. John Kubiatowicz)
CMSC 611: Advanced Computer Architecture
Lecture 5: Snooping Protocol Design Issues
Lecture: Cache Innovations, Virtual Memory
Multiprocessors - Flynn’s taxonomy (1966)
11 – Snooping Cache and Directory Based Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 9: Directory-Based Examples
Lecture 22: Cache Hierarchies, Memory
CS 3410, Spring 2014 Computer Science Cornell University
Lecture 8: Directory-Based Examples
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Update : about 8~16% are writes
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

A Study on Snoop-Based Cache Coherence Protocols Linda Bigelow Veynu Narasiman Aater Suleman

Shared Memory Machine P1 $ P2 $ P3 $ PN $ . . . MEMORY

Invalidation-Based Protocols The MSI Protocol Possible Bus Operations: Bus Read (BR) Get Permission (GP) RWITM Write-Back (WB) M SBR / WB SRWITM / WB PW / GP PW / RWITM I S PR / BR SGP, SRWITM / -

Example Invalidation-Based Protocols Berkeley-Ownership (MOSI) New state: Owner On a Snoop Bus Read in Modified state: Supply data to requesting processor Transfer to the Owner state Do NOT Write Back Owner must supply data for future Bus Reads On eviction, block must be written back if in the Modified or the Owner state Advantages Avoid memory Potentially fewer Writebacks Disadvantages Owner may be busy always supplying data

Example Invalidation-Based Protocols Illinois (MESI) New State: Exclusive On a Read Miss: Transfer to Exclusive if block is private Otherwise, transfer to Shared Once in Exclusive, a processor write can go through without any bus transaction Advantages Fewer Get Permissions (less bus traffic) Disadvantages Increased Bus Complexity (Shared Line) Question: Can regular MSI outperform MESI?

Example Invalidation-Based Protocols MOESI Two new states Exclusive Owner Gets both the advantages and disadvantages of MESI and MOSI Additional Disadvantage Extra bit required to keep track of state

Update-Based Protocols Replace Get Permission with Bus Update On snooping a Bus Update Grab data off Bus and update your copy Remain in Shared state (do not invalidate!) Examples Firefly Uses Write-Throughs instead of Bus Updates Dragon Uses Bus Updates Requires a new state: Shared Dirty

Update or Invalidate? Invalidation-Based Protocols Good for Sequential Sharing Suffers from Invalidation Misses Problem is worse as block size increases Update-Based Protocols Good when reads and writes are interleaved among many different processors Suffers from unnecessary Updates Problem gets exaggerated as cache size increases

Improving Invalidation-Based Protocols Read Broadcasting Aims to reduce Invalidation Misses On Snooping a Bus Read in Invalid Grab data off Bus and transit to Shared Many processors in Shared and one writes: Writing processor issues GP, goes to Modified All others go to Invalid (tag still stays the same) Invalidated processor wants to read: All other invalidated processors snoop the read, grab the data, and transit to shared When they read, it’s a hit (no Bus Read required) Reduces Invalidation Misses, increases processor lockout

Improving Update-Based Protocols Hybrid Protocols Competitive Snooping Each cache block has a counter associated with it Initialized to a threshold value when block is loaded Decrement counter when you Snoop an Update Set counter back to threshold on local read or write If counter reaches zero, invalidate the block Writing processor can detect when everyone else has invalidate, and transits to Modified Archibald Protocol Do not invalidate as soon as counter reaches 0 Wait until all counters reach 0

Bus Interface Unit (BIU) Overview Provides communication between the processor and external world via the system bus Responsibilities Interfacing with the bus Arbitrating for bus Driving address and control lines Supplying/receiving data Controlling flow of transactions Request buffer to hold data that processor needs to put on bus Response buffer to hold data that memory sends back to processor Snooping the bus Tag look-up State update Appropriate response: assert shared line, write back, update, etc.

Simple BIU Single-Level Cache, Single-Ported Tag Store Cache Controller Tags & state Cache Data BIU Response Buffer Request Buffer Cmd Addr Addr Cmd

Tag Store Access Problem Who gets priority?? Cache Controller Tags & state Cache Data BIU Response Buffer Request Buffer Cmd Addr Addr Cmd

Processor Lockout P Cache Controller Tags & state Cache Data BIU Response Buffer Request Buffer Cmd Addr Addr Cmd

BIU Lockout P Cache Controller Tags & state Cache Data BIU Response Buffer Request Buffer Cmd Addr Addr Cmd

Duplicate Tag Store P Cache Controller Tags & state for snoop Tags & state for P Cache Data BIU Response Buffer Request Buffer Cmd Addr Addr Cmd

Multilevel Cache Hierarchy BIU looks up tags in L2 tag store Cache controller looks up tags in L1 tag store L2 must be inclusive of L1 L2 acts as a filter for L1 for bus transactions Add bit to indicate whether or not the block is also in L1 (reduces processor lockout) L1 acts as a filter for L2 for processor requests Write through L1 or add bit to indicate a block in L2 is modified-but-stale (reduces BIU lockout) Figures taken from Parallel Computer Architecture: A Hardware/Software Approach

Write-Back Buffer Write back due to snooping a RWITM may generate two bus transactions Dirty block written back to memory Memory supplies block to requestor To satisfy request faster Delay write back by putting in a buffer Supply data to requestor Write back to memory Issues BIU needs to snoop against write-back buffer (as well as tag store) If hit in write-back buffer, need to supply data and possibly cancel the pending write back

Cache-to-Cache Transfers Faster to transfer data between two caches than a cache and memory What does the BIU do? Snoops the request Indicates if its cache can supply the data Indicates if the data is in a modified state Which cache supplies data if multiple have it? Predetermined priority All put same value on bus at same time What about memory? Should be inhibited from supplying data May need to be written to if data is dirty

M5 Simulator Simple Processor Model Detailed Processor Model Functional CPU model No IPC statistics generated Faster Detailed Processor Model Cycle accurate simulator Models an out-of-order processor ~10-20X slower than the Simple Model Experiments were conducted using the simple model and the detailed model

SPLASH-2 Benchmarks Simulated benchmarks from SPLASH-2 PARMACS macros from UPC Conditional variables were not padded Created reduced data sets for some benchmarks Were able to successfully setup and run: 7 benchmarks in Simple processor 5 benchmarks in Detailed processor

Simulated System 2 Processor system 64KB L-1 Data Cache 3 cycle latency 64 B block 2 way associative 32 outstanding misses 64KB L-1 Instruction Cache 2MB L-2 Cache 10 Cycle latency 32 way associative 16-byte-wide bus to memory Main memory 100 cycle latency

Number of Bus Invalidates (GPs) All of the benchmarks show a reduction when Exclusive state is added (some more than others)

E to S Transitions Exclusive state only beneficial if the reduction in number of GPs outweighs the number of E to S transitions

Write Backs to Memory Not much difference for Cholesky and FFT Differences in FMM, WaterNsq, and WaterSpa

Throughput IPC

Owner Protocols Less performance benefit than expected Reasons Minimal Reduction in write backs More replacement write backs Load balancing problems After Owner evicted, must get data from memory

MONOESI When an owner is evicted, the ownership of the block gets transferred to the Next Owner Introduces a new state Next Owner (NO) in the MOESI protocol When the owner is evicted: Next Owner snoops the write back Transitions to the Owner state Memory write back is inhibited Overhead Added support for snooping write backs Two extra lines: Owner and Next Owner

M E O S I PW PW / GP SBR / shd SRWITM / !Mem+ SBR / shd, !Mem+ !Mem+ = inhibit memory & supply data shd = shared line PW M E PW / GP SBR / shd SBR / shd, !Mem+ SRWITM / !Mem+ O SRWITM / !Mem+ PR & !shd / BR PW / RWITM PW / GP SGP SRWITM / !Mem+ PR & shd / BR S I SGP, SRWITM

M E O NO S I PW PW / GP PW / GP SBR / shd SRWITM / !Mem+ !Mem+ = inhibit memory & supply data shd = shared line !Mem = inhibit memory O = owner NO = next owner PW M E PW / GP PW / GP SBR / shd SBR / shd, !Mem+, O SRWITM / !Mem+ O SRWITM / !Mem+ NO PR & !shd / BR PW / RWITM PW / GP SGP PR & O & !NO / BR SGP, SRWITM SRWITM / !Mem+ PR & ((shd & O & NO) | (shd & !O & !NO)) / BR S I SGP, SRWITM SWB / !Mem

Future Work Use MONOESI to solve the load balancing problem in MOESI Coherence aware cache replacement policy Use a lower priority for BIU on a Read Broadcast

Questions?

Example Invalidation-Based Protocols Goodman’s Write Once GP replaced by a Write Through New state: Reserved Advantages May lead to fewer Writebacks Disadvantages Increased Memory traffic due to Write Throughs