The Cache-Coherence Problem Intro to Chapter 5. Lecture 7 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin2 Shared Memory vs.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Cache Optimization Summary

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Computer Architecture II 1 Computer architecture II Lecture 8.

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Snooping Cache and Shared-Memory Multiprocessors

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Graduate Computer Architecture I Lecture 10: Shared Memory Multiprocessors Young Cho.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Lecture 13: Multiprocessors Kai Bu

Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.

Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.

1 Memory and Cache Coherence. 2 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main memory from any processor.

CSC/ECE 506: Architecture of Parallel Computers The Cache-Coherence Problem Lecture 11 (Chapter 7) Lecture 11 (Chapter 7) 1.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

CSC/ECE 506: Architecture of Parallel Computers The Cache-Coherence Problem Chapter 7 1.

1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.

Cache Coherence CS433 Spring 2001 Laxmikant Kale.

1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.

CSC/ECE 506: Architecture of Parallel Computers Bus-Based Coherent Multiprocessors 1 Lecture 12 (Chapter 8) Lecture 12 (Chapter 8)

COSC6385 Advanced Computer Architecture

Lecture 21 Synchronization

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

The Cache-Coherence Problem

The Cache-Coherence Problem

The Cache-Coherence Problem

Multiprocessors - Flynn’s taxonomy (1966)

Bus-Based Coherent Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Lecture 25: Multiprocessors

Lecture 10: Consistency Models

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Lecture 8 Outline Memory consistency

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 11: Consistency Models

Presentation transcript:

The Cache-Coherence Problem Intro to Chapter 5

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin2 Shared Memory vs. No Shared Memory  Advantages of shared-memory machines Naturally support shared-memory programs  Clusters can also support them via software virtual shared memory, but with much coarser granularity and higher overheads Single OS image Can fit larger problems: total mem = mem n = n  mem 1  Large programs cause thrashing in clusters Allow fine-grained sharing  Messages are expensive, so large messages work, but small messages are dominated by overheads  Fine-grained synchronization fits better  Disadvantages of shared-memory machines Cost of providing shared-memory abstraction

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin3 Multiprocessor Usage  On-chip: shared memory (+ shared cache)  Small scale (up to 2–30 processors): shared memory Most processors have MP support out of the box Most of these systems are bus-based Popular in commercial as well as HPC markets  Medium scale (64–256): shared memory and clusters Clusters are cheaper Often, clusters of SMPs  Large scale (> 256): few shared memory and many clusters SGI Altix 3300: 512-processor shared memory Large variety on custom/off-the-shelf components such as interconnection networks.  Beowulf clusters: fast Ethernet  Myrinet: fiber optics  IBM SP2: custom

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin4 Goals So far, we have learned: How to write parallel programs How to write parallel programs How caches and buses work How caches and buses work Let’s now construct a bus-based multiprocessor Will it work?

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin5 The Cache-Coherence Problem  Illustration:  “A joint checking account owned by A, B, and C. Each has a checkbook, and withdraws and deposits funds several times every day. The bank imposes large penalty if a check is issued without sufficient fund at the account. The only mode of communication is by sending the current account balance by mail. A, B, and C work in the same building, but the bank is on a different building. It takes only 1 min. to send mail within the building, but 1 hour to another building.”  Goal 1: A, B, and C need an up-to-date and coherent view of their account balance.  Goal 2: To achieve goal 1 with minimum number of messages.

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin6 At the Minimum  At the minimum, they need a protocol to support … Write propagation: passing on the information that the balance has been updated.  But they also need … Write serialization: Updates by  X,Y  in that order are seen in the same order by everybody  Reaching goal 2 depends on the access patterns Mostly reads and a few writes? Many successive writes?

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin7 Will Parallel Code Work Correctly? sum = 0; begin parallel for (i=0; i<2; i++) { lock(id, myLock); sum = sum + a[i]; unlock(id, myLock); end parallel print sum; Suppose a[0] = 3 and a[1] = 7 … Two issues: Will it print sum = 10? Will it print sum = 10? How can it support locking correctly? How can it support locking correctly?

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin8 Example 1: The Cache-Coherence Problem sum = 0; begin parallel for (i=0; i<2; i++) { lock(id, myLock); sum = sum + a[i]; unlock(id, myLock); end parallel print sum; Suppose a[0] = 3 and a[1] = 7 P1P1 P2P2 PnPn … Will it print sum = 10?

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin9 Cache-Coherence Problem Illustration P1P3 P2 Cache Main Memory Bus sum=0 Mem Ctrl

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin10 Cache-Coherence Problem Illustration P1P3 P2 sum=0 Mem Ctrl sum =0 V rd &sum

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin11 Cache-Coherence Problem Illustration P1P3 P2 sum=0 Mem Ctrl sum =0V V rd &sum

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin12 Cache-Coherence Problem Illustration P1P3 P2 sum=0 Mem Ctrl sum=0V V wr sum (sum = 3) 3D

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin13 Cache-Coherence Problem Illustration P1P3 P2 sum=0 Mem Ctrl sum=3Dsum=0V wr &sum (sum = 7) 7D

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin14 Cache-Coherence Problem Illustration P1P3 P2 sum=0 Mem Ctrl sum=3Dsum=7D rd &sum print sum=3

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin15 Cache-Coherence Problem  Do P1 and P2 see the same sum?  Does it matter if we use a WT cache?  What if we do not have caches, or sum is uncacheable. Will it work?

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin16 Write-Through Cache Does Not Work P1P3 P2 sum=7 Mem Ctrl sum=3Dsum=7D rd &sum print sum=3

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin17 Software Lock Using a Flag  Will this guarantee mutual exclusion? void lock (int process, int lvar) { // process is 0 or 1 while (lvar == 1) {} ; lvar = 1; } void unlock (int process, int lvar) { lvar = 0; }

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin18 Peterson’s Algorithm  Exit from lock() happens only if: interested[other] == FALSE: either the other process has not competed for the lock, or it has just called unlock() turn != process: the other process is competing, has set the turn to itself, and will be blocked in the while() loop int turn; int interested[n]; // initialized to false void lock (int process, int lvar) { // process is 0 or 1 int other = 1 – process; interested[process] = TRUE; turn = process; while (turn == process && interested[other] == TRUE) {} ; } // Post: turn != process or interested[other] == FALSE void unlock (int process, int lvar) { interested[process] = FALSE; }

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin19 Will it Work?  Correctness depends on the global order of  Thus, it will not work if: Compiler reorders the operations  No data dependence, so unless the compiler is notified, it may well reorder the operations  This prevents compiler from using aggressive optimizations used in serial programs The architecture reorders the operations  Write buffers, memory controller  Network delay for statement A  If turn and interested[] are cacheable, A may result in cache miss, but B in cache hit  This is called the memory-consistency problem. A: interested[process] = TRUE; B: turn = process;

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin20 No Race // Proc 0 interested[0] = TRUE; turn = 0; while (turn==0 && interested[1]==TRUE) {}; // since interested[1] == FALSE, // Proc 0 enters critical section // Proc 1 interested[1] = TRUE; turn = 1; while (turn==1 && interested[0]==TRUE) {}; // since turn==1 && interested[0]==TRUE // Proc 1 waits in the loop until Proc 0 // releases the lock // unlock Interested[0] = FALSE; // now can exit the loop and acquire the // lock

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin21 Race // Proc 0 interested[0] = TRUE; turn = 0; while (turn==0 && interested[1]==TRUE) {}; // since turn == 1, // Proc 0 enters critical section // Proc 1 interested[1] = TRUE; turn = 1; while (turn==1 && interested[0]==TRUE) {}; // since turn==1 && interested[0]==TRUE // Proc 1 waits in the loop until Proc 0 // releases the lock // unlock Interested[0] = FALSE; // now can exit the loop and acquire the // lock

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin22 Race on a Non-Sequential Consistent Machine // Proc 0 interested[0] = TRUE; turn = 0; while (turn==0 && interested[1]==TRUE) {}; // since interested[1] == FALSE, // Proc 0 enters critical section // Proc 1 turn = 1; interested[1] = TRUE; while (turn==1 && interested[0]==TRUE) {}; // since turn==0, // Proc 1 enters critical section reordered

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin23 Two Fundamental Problems  Cache-coherence problem Tackled in hardware with cache coherence protocols Correctness guaranteed by the protocols, but with varying performance  Memory-consistency problem Tackled by various memory consistency models, which differ by  what operations can be reordered, and what cannot be reordered  Guarantee of completeness of a write Compilers and programs have to conform to the model for correctness! 2 approaches:  Sequential consistency: Multi-threaded codes for uniprocessors automatically run correctly How? Every shared rd/wr completes globally in program order Most intuitive but worst performance  Others (relaxed consistency models): Multi-threaded codes for uniprocessor need to be ported to run correctly Additional instruction (memory fence) to ensure global order between 2 operations

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin24 Caches and Cache Coherence  Caches are important Reducing average data access time Reducing bandwidth requirement on bus/interconnect  So let’s use caches, but solve the coherence problem  Sufficient conditions for coherence: Notation: Request proc (data) Write propagation:  Rd i (X) must return the “latest” Wr j (X) Write serialization:  Wr i (X) and Wr j (X) are seen in the same order by everybody i.e., if I see w1 after w2, you should not see w2 before w1 In essence, global ordering of memory operations to a single location  no need for read serialization since it’s local

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin25 A Coherent Memory System: Intuition  Uniprocessors Coherence between I/O devices and processors Infrequent, so software solutions work  uncacheable memory, uncacheable operations, flush pages, pass I/O data through caches  But coherence problem much more critical in multiprocessors Pervasive Performance-critical Necessitates a hardware solution  * Note that “latest” is ambiguous. Ultimately, what we care is any write is propagated, as a building block of synchronization mechanism. Synchronization defines what “latest” means

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin26 Natural Extensions of a Memory System

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin27 Focus: Bus-Based, Centralized Memory  Shared cache: popular in CMP (Power4, Sparc, Pentium, Itanium) Low-latency sharing and prefetching across processors Sharing of working sets No coherence problem (if sharing L1 cache) But contention and negative interference  Thread starvation, priority inversion, thread-mix dependent throughput Higher hit and miss latency due to extra ports and cache size Mid 80s: to connect couple of processors on a board (Encore, Sequent) Today: CMP (e.g., Power4)  Dancehall No longer popular: everything is uniformly far away  Distributed memory Popular way to build large size systems ( procs), discussed later

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin28 Assume a Bus-Based SMP  Built on top of two fundamentals of uniprocessor system Bus transactions Cache-line finite-state machine  Uniprocessor bus transaction: Three phases: arbitration, command/address, data transfer All devices observe addresses, one is responsible  Uniprocessor cache states: Every cache line has a finite state machine In WT+write no-allocate: Valid, Invalid states WB: Valid, Invalid, Modified (“Dirty”)  Multiprocessors extend both these somewhat to implement coherence

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin29 Snoop-Based Coherence  Basic Idea Assign a snooper to each processor so that all bus transactions are visible to all processors (Snooping) Processors (via cache controllers) change line states on relevant events  Implementing a Protocol Each cache controller reacts to processor and bus events: Takes actions when necessary  Updates state, responds with data, generates new bus transactions Memory controller also snoops bus transactions and returns data only when needed Granularity of coherence is typically cache line/block  Same granularity as in transfer to/from cache

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin30 Coherence with Write-Through Caches Bus-based SMP implementation choice in the mid 80s What happens when we snoop a write? invalidating/update the block  No new states or bus transactions in this case  Write-update protocol: write is immediately propagated  Write-invalidation protocol: causes miss on later access, and memory up-to-date via write-through sum = 0; begin parallel for (i=0; i<2; i++) { lock(id, myLock); sum = sum + a[i]; unlock(id, myLock); end parallel Print sum; Suppose a[0] = 3 and a[1] = 7 P0P0 P1P1 PnPn = snooper

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin31 Snooper Assumption  Atomic bus  Writes in program order

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin32 Transactions  Processor transactions: PrRd PrWr  Snooped bus transactions BusRd BusWr

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin33 Write-Through State-Transition Diagram  Key: Write invalidates all other caches  Therefore, we have: Modified line: exists as V in only 1 cache Clean line: exists as V in at least 1 cache Invalid state represents invalidated line or not present in the cache write-through no-write-allocate write invalidate

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin34 Is It Coherent?  Write propagation: through invalidation then a cache miss, loading a new value  Write serialization: Assume invalidation happens instantaneously Assuming atomic bus (all through Chap 5, relaxed on Chap 6) Writes serialized by order in which they appear on bus (bus order) So do invalidations  Do reads see the latest writes? Read misses generate bus transactions, so will get the last write Read hits : do not appear on bus, but are preceded by  most recent write by this processor (self), or  most recent read miss by this processor Thus, reads hits see latest written values (according to bus order)

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin35 Determining Orders More Generally  A memory operation M2 is subsequent to a memory operation M1 if the operations are issued by the same processor and M2 follows M1 in program order.  Read is subsequent to write W if read generates bus xaction that follows that for W.  Write is subsequent to read or write M if M generates bus xaction and the xaction for the write follows that for M.  Write is subsequent to read if read does not generate a bus xaction and is not already separated from the write by another bus xaction. Writes establish a partial order Doesn’t constrain ordering of reads, though bus will order read misses too – any order among reads between writes is fine, as long as in program order

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin36 Problem with Write-Through  High bandwidth requirements Every write goes to the shared bus and memory Example: 200MHz, 1 CPI processor, and 15% instrs. are 8-byte stores Each processor generates 30M stores or 240MB data per second 1GB/s bus can support only about 4 processors without saturating Thus, unpopular for SMPs  Write-back caches Write hits do not go to the bus => reduce most write bus transactions But now how do we ensure write propagation and serialization?

Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin37 Summary  Shared memory with caches raises the problem of cache coherence. Writes to the same location must be seen in the same order evereywhere.  But this is not the only problem Writes to different locations must also be kept in order if they are being depended upon for synchronizing tasks. This is called the memory-consistency problem  One solution for small-scale multiprocessors is a shared bus.  State-transition diagrams can be used to show how a cache- coherence protocol operates.  The simplest protocol is write-through, but it has performance problems.