Dr. George Michelogiannakis EECS, University of California at Berkeley

Slides:

Advertisements

Similar presentations

The University of Adelaide, School of Computer Science

Advertisements

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.

Cache Optimization Summary

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.

The University of Adelaide, School of Computer Science

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

CS 152 Computer Architecture and Engineering Lecture 21: Directory-Based Cache Protocols Scott Beamer (substituting for Krste Asanovic) Electrical Engineering.

April 13, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches Krste Asanovic Electrical Engineering and Computer.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer Sciences University of California,

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Snooping Cache and Shared-Memory Multiprocessors

April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.

April 15, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer.

Multiprocessor Cache Coherency

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

4/13/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches Dr. George Michelogiannakis EECS, University of California.

The University of Adelaide, School of Computer Science

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Outline Introduction (Sec. 5.1)

CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches

תרגול מס' 5: MESI Protocol

CS427 Multicore Architecture and Parallel Computing

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

12.4 Memory Organization in Multiprocessor Systems

Multiprocessor Cache Coherency

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Krste Asanovic Electrical Engineering and Computer Sciences

Example Cache Coherence Problem

CS427 Multicore Architecture and Parallel Computing

The University of Adelaide, School of Computer Science

CS5102 High Performance Computer Systems Distributed Shared Memory

Lecture 2: Snooping-Based Coherence

CMSC 611: Advanced Computer Architecture

11 – Snooping Cache and Directory Based Multiprocessors

Multiprocessor and Thread-Level Parallelism-II Chapter 4

Lecture 25: Multiprocessors

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Lecture 24: Virtual Memory, Multiprocessors

CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 18 Cache Coherence Krste Asanovic Electrical Engineering and.

Lecture 24: Multiprocessors

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 20: Multiprocessors

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CSE 486/586 Distributed Systems Cache Coherence

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory http://inst.eecs.berkeley.edu/~cs152

Administrivia PS 5 due on Wednesday Wednesday lecture on data centers Quiz 5 on Wednesday next week Please show up on Monday April 21st (last lecture) Neuromorphic, quantum Parting thoughts that have nothing to do with architecture Class evaluation

Recap: Snoopy Cache Protocols Memory Bus M1 Snoopy Cache Physical Memory M2 Snoopy Cache DMA M3 Snoopy Cache DISKS Use snoopy mechanism to keep all processors’ view of memory coherent

Recap: MESI: An Enhanced MSI protocol increased performance for private data M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Each cache line has a tag Address tag state bits Write miss P1 read P1 write P1 write or read M E Read miss, not shared Other processor reads Other processor intent to write, P1 writes back P1 intent to write Other processor reads P1 writes back Other processor intent to write Read miss, shared S I Read by any processor Other processor intent to write Cache state in processor P1

Performance of Symmetric Shared-Memory Multiprocessors Cache performance is combination of: Uniprocessor cache miss traffic Miss traffic caused by communication Results in invalidations and subsequent cache misses Coherence misses Sometimes called a Communication miss A cache miss which is a result of a remote core Read miss: remote core wrote Write miss: remote core wrote or read 4th C of cache misses along with Compulsory, Capacity, & Conflict.

Coherence Misses True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1st write to shared line Reads by another CPU of modified line in different cache Miss would still occur if line size were 1 word False sharing misses when a line is invalidated because some word in the line, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Line is shared, but no word in line is actually shared  miss would not occur if line size were 1 word state line addr data0 data1 ... dataN

Example: True v. False Sharing v. Hit? Assume x1 and x2 in same cache line. P1 and P2 both read x1 and x2 before. Time P1 P2 True, False, Hit? Why? 1 Write x1 2 Read x2 3 4 Write x2 5 True miss; invalidate x1 in P2 False miss; x1 irrelevant to P2 False miss; x1 irrelevant to P2 True miss; x2 not writeable True miss; invalidate x2 in P1

MP Performance 4 Processor Commercial Workload: OLTP, Decision Support (Database), Search Engine Uniprocessor cache misses improve with cache size increase (Instruction, Capacity/Conflict, Compulsory) True sharing and false sharing unchanged going from 1 MB to 8 MB (L3 cache)

MP Performance 2MB Cache Commercial Workload: OLTP, Decision Support (Database), Search Engine True sharing, false sharing increase going from 1 to 8 CPUs

What if We Had 128 Cores? M1 Physical Memory M2 DMA M3 Memory Bus Snoopy Cache Physical Memory M2 Snoopy Cache Snoopy Cache DMA M3 DISKS

Bus Is a Synchronization Point So far, any message that enters the bus reaches all cores and gets replies before another message can enter the bus (instantaneous actions) Therefore, the bus forces ordering Memory Bus M1 Snoopy Cache Physical Memory M2 Snoopy Cache Snoopy Cache DMA M3 DISKS

Scaling Snoopy/Broadcast Coherence When any processor gets a miss, must probe every other cache Scaling up to more processors limited by: Communication bandwidth over bus Snoop bandwidth into tags Can improve bandwidth by using multiple interleaved buses with interleaved tag banks E.g, two bits of address pick which of four buses and four tag banks to use – (e.g., bits 7:6 of address pick bus/tag bank, bits 5:0 pick byte in 64-byte line) Buses don’t scale to large number of connections

This Scales Better

What if Bus Does not Synchronize? What if a cache broadcasts invalidations to transition to modified (M), and before that completes it receives an invalidation from another core’s transition to modified? Write miss P1 read P1 write P1 write or read M E Read miss, not shared Other processor reads Other processor intent to write, P1 writes back P1 intent to write Other processor reads P1 writes back Other processor intent to write Read miss, shared S I Read by any processor Other processor intent to write Cache state in processor P1

What if Bus Does not Synchronize? Time view: P1 P2

Scalable Approach: Directories Can use point-to-point network for larger number of nodes, but then limited by tag bandwidth when broadcasting snoop requests. Insight: Most snoops fail to find a match! Every memory line has associated directory information keeps track of copies of cached lines and their states on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary in scalable networks, communication with directory and copies is through network transactions Many alternatives for organizing directory information

Directory Cache Protocol (Lab 5 Handout) CPU Cache CPU Cache CPU Cache CPU Cache CPU Cache CPU Cache Data Tag Stat. Each line in cache has state field plus tag Data Stat. Directry Each line in memory has state field plus bit vector directory with one bit per processor Interconnection Network Directory Controller DRAM Bank Directory Controller DRAM Bank Directory Controller DRAM Bank Directory Controller DRAM Bank Delay is a fact of life Serialization by directory, no longer the bus SC not guarantied Assumptions: Reliable network, FIFO message delivery between any given source-destination pair

Vector Directory Bit Mask With four cores, means that cores 2 and 3 have that line in their local cache Can also have a list of core IDs With MESI, how do we know what state each cache has the line in? Think of the case with one sharer Stat. Directry Data 0 1 1 0

Cache States For each cache line, there are 4 possible states (based on MSI): C-invalid (= Nothing): The accessed data is not resident in the cache. C-shared (= Sh): The accessed data is resident in the cache, and possibly also cached at other sites. The data in memory is valid. C-modified (= Ex): The accessed data is exclusively resident in this cache, and has been modified. Memory does not have the most up-to-date data. C-transient (= Pending): The accessed data is in a transient state (for example, the site has just issued a protocol request, but has not received the corresponding protocol reply). - common cases - read miss to private (victim clean or dirty) - write miss to private - read miss to shared - write miss to shared

Network Has No Ordering Guarantees CPU Cache CPU Cache CPU Cache CPU Cache CPU Cache CPU Cache Data Tag Stat. Each line in cache has state field plus tag Data Stat. Directry Each line in memory has state field plus bit vector directory with one bit per processor Interconnection Network Directory Controller DRAM Bank Directory Controller DRAM Bank Directory Controller DRAM Bank Directory Controller DRAM Bank

Home Directory States For each memory line, there are 4 possible states: R(dir): The memory line is shared by the sites specified in dir (dir is a set of sites). The data in memory is valid in this state. If dir is empty (i.e., dir = ε), the memory line is not cached by any site. W(id): The memory line is exclusively cached at site id, and has been modified at that site. Memory does not have the most up-to-date data. TR(dir): The memory line is in a transient state waiting for the acknowledgements to the invalidation requests that the home site has issued. TW(id): The memory line is in a transient state waiting for a line exclusively cached at site id (i.e., in C-modified state) to make the memory line at the home site up-to-date. Different states in directory than caches

Read miss, to uncached or shared line CPU Cache 1 Load request at head of CPU->Cache queue. 9 Update cache tag and data and return load data to CPU. 2 Load misses in cache. 8 ShRep arrives at cache. 3 Send ShReq message to directory. Interconnection Network Directory Controller DRAM Bank 7 Send ShRep message with contents of cache line. 4 Message received at directory controller. Delay is a fact of life Serialization by directory, no longer the bus SC not guarantied 6 Update directory by setting bit for new processor sharer. 5 Access state and directory for line. Line’s state is R, with zero or more sharers.

Write miss, to read shared line Multiple sharers CPU Cache CPU Cache CPU Cache CPU Cache 12 Update cache tag and data, then store data from CPU 1 Store request at head of CPU->Cache queue. 8 Invalidate cache line. Send InvRep to directory. 2 Store misses in cache. 11 ExRep arrives at cache 7 InvReq arrives at cache. 3 Send ExReq message to directory. Interconnection Network 4 ExReq message received at directory controller. Directory Controller DRAM Bank 10 When no more sharers, send ExRep to cache. 9 InvRep received. Clear down sharer bit. 6 Send one InvReq message to each sharer. Delay is a fact of life Serialization by directory, no longer the bus SC not guarantied 5 Access state and directory for line. Line’s state is R, with some set of sharers.

Concurrency Management Protocol would be easy to design if only one transaction in flight across entire system But, want greater throughput and don’t want to have to coordinate across entire system Great complexity in managing multiple outstanding concurrent transactions to cache lines Can have multiple requests in flight to same cache line!

This is The Standard MESI Write miss P1 read P1 write P1 write or read M E Read miss, not shared Other processor reads Other processor intent to write, P1 writes back P1 intent to write Other processor reads P1 writes back Other processor intent to write Read miss, shared S I Read by any processor Other processor intent to write Cache state in processor P1

With Transients (Cache) Based on MESI

With Transient (Directory) 17 states with MESI! If you are curious: http://www.m5sim.org/MESI_Two_Level Figure based on MSI: 13 states

More Complex Coherence Protocols The original protocol for this was 6 stages: MOETSI.

Protocol Messages (MESI) There are 10 different protocol messages: Category Messages Cache to Memory Requests ShReq, ExReq Memory to Cache Requests WbReq, InvReq, FlushReq Cache to Memory Responses WbRep(v), InvRep, FlushRep(v) Memory to Cache Responses ShRep(v), ExRep(v) LL-SC

Cache State Transitions (from invalid state)

Cache State Transitions (from shared state)

Cache State Transitions (from exclusive state)

Cache Transitions (from pending)

Home Directory State Transitions Messages sent from site id

Home Directory State Transitions Messages sent from site id

Home Directory State Transitions Messages sent from site id

Home Directory State Transitions Messages sent from site id

Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) Mark Hill (U. Wisconsin-Madison) Dana Vantrase, Mikko Lipasti, Nathan Binkert (U. Wisconsin-Madison & HP) MIT material derived from course 6.823 UCB material derived from course CS252