Multiprocessors— Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

CSE 502: Computer Architecture

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

Cache Optimization Summary

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

The University of Adelaide, School of Computer Science

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

Chapter 6 Multiprocessors and Thread-Level Parallelism 吳俊興高雄大學資訊工程學系 December 2004 EEF011 Computer Architecture 計算機結構.

1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.

CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

ENGS 116 Lecture 151 Multiprocessors and Thread-Level Parallelism Vincent Berk November 12 th, 2008 Reading for Friday: Sections Reading for Monday:

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Snooping Cache and Shared-Memory Multiprocessors

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.

CSIE30300 Computer Architecture Unit 15: Multiprocessors Hsin-Chou Chi [Adapted from material by and

EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Lecture 13: Multiprocessors Kai Bu

Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Cache Coherence CSE 661 – Parallel and Vector Architectures

Multiprocessors— Performance, Synchronization, Memory Consistency Models Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.

1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.

ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

Cache Coherence CS433 Spring 2001 Laxmikant Kale.

1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

22/12/2005 Distributed Shared-Memory Architectures by Seda Demirağ Distrubuted Shared-Memory Architectures by Seda Demirağ.

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

CS 704 Advanced Computer Architecture

COSC6385 Advanced Computer Architecture

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Cache Coherence for Shared Memory Multiprocessors

Multiprocessor Cache Coherency

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Flynn’s Taxonomy Flynn classified by data and control streams in 1966

The University of Adelaide, School of Computer Science

Lecture 2: Snooping-Based Coherence

Chapter 6 Multiprocessors and Thread-Level Parallelism

Chip-Multiprocessor.

CMSC 611: Advanced Computer Architecture

Multiprocessors - Flynn’s taxonomy (1966)

Multiprocessors CS258 S99.

11 – Snooping Cache and Directory Based Multiprocessors

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 20: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Multiprocessors— Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Alvin R. Lebeck Computer Science 220 Fall 2001

2 © Alvin R. Lebeck 1999 What is Parallel Computer Architecture? A Parallel Computer is a collection of processing elements that cooperate to solve large problems fast –how large a collection? –how powerful are the elements? –how does it scale up? –how do they cooperate and communicate? –how is data transmitted between processors? –what are the primitive abstractions? –how does it all translate to performance?

3 © Alvin R. Lebeck 1999 Parallel Computation: Why and Why Not? Pros –Performance –Cost-effectiveness (commodity parts) –Smooth upgrade path –Fault Tolerance Cons –Difficult to parallelize applications –Requires automatic parallelization or parallel program development –Software! AAHHHH!

4 © Alvin R. Lebeck 1999 CPS 220 Flynn Categories SISD (Single Instruction Single Data) –Uniprocessors MISD (Multiple Instruction Single Data) –??? SIMD (Single Instruction Multiple Data) –Examples: Illiac-IV, CM-2, Intel MMX »Simple programming model »Low overhead »Flexibility »All custom processors MIMD (Multiple Instruction Multiple Data) –Examples: Intel 4-way SMP, SUN ES3000, SGI Origin, Cray T3D »Flexible »Use off-the-shelf microprocessors

5 © Alvin R. Lebeck 1999 CPS 220 Communication Models Shared Memory –Processors communicate with shared address space –Easy on small-scale machines –Advantages: »Model of choice for uniprocessors, small-scale MPs »Ease of programming »Lower latency »Easier to use hardware controlled caching Message passing –Processors have private memories, communicate via messages –Advantages: »Less hardware, easier to design »Focuses attention on costly non-local operations Can support either model on either HW base

6 © Alvin R. Lebeck 1999 Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] How do you make this loop parallel to run on many processors?

7 © Alvin R. Lebeck 1999 Simple Problem Split the loops // Independent iterations // Run on up to N processors for i = 1 to N A[i] = (A[i] + B[i]) * C[i] // One last loop to run on one processor for i = 1 to N sum = sum + A[i]

8 © Alvin R. Lebeck 1999 Small Scale Shared Memory Multiprocessors Small number of processors connected to one shared memory Memory is equidistant from all processors (UMA) Kernel can run on any processor (symmetric MP) Intel dual/quad Pentium, IBM, SUN, Compaq, almost everyone Some are moving on-chip (e.g., IBM) Main Memory P $ P $ P $ P $ P $ P $ P $ P $ Cache(s) and TLB 0N-1

9 © Alvin R. Lebeck 1999 CPS 220 Large Scale Shared Memory Multiprocessors Shared 100s to 1000s of nodes (processors) with single shared physical address space Use General Purpose Interconnection Network –Still have cache coherence protocol –Use messages instead of bus transactions –No hardware broadcast Communication Assist Cray T3D, T3E, Compaq EV7, SUN ES3000 Interconnect Cntrl/NI Mem P $ Cntrl/NI Mem P $ Cntrl/NI Mem P $ Cntrl/NI Mem P $

10 © Alvin R. Lebeck 1999 Message Passing Architectures Cannot directly access memory on another node IBM SP-2, Intel Paragon Cluster of workstations Interconnect CA Mem P $ CA Mem P $ Node 0 0,N-1 Node 1 0,N-1 Node 2 0,N-1 Node 3 0,N-1 CA Mem P $ CA Mem P $

11 © Alvin R. Lebeck 1999 CPS 220 Important Communication Properties Bandwidth –Need high bandwidth in communication –Cannot scale, but stay close –Limits may be in network, memory, and processor –Overhead to communicate is a problem in many machines Latency –Affects performance, since processor may have to wait –Affects ease of programming, since requires more thought to overlap communication and computation Latency Hiding –How can a mechanism help hide latency? –Examples: overlap message send with computation, prefetch

12 © Alvin R. Lebeck 1999 CPS 220 Small-Scale—Shared Memory Caches serve to: –Increase bandwidth versus bus/memory –Reduce latency of access –Valuable for both private data and shared data What about cache coherence? Main Memory P $ P $ P $ P $ P $ P $ P $ P $ Cache(s) and TLB 0N-1

13 © Alvin R. Lebeck 1999 Cache Coherence Problem (Initial State) P1 P2 x Interconnection Network / Bus Main Memory Time

14 © Alvin R. Lebeck 1999 Cache Coherence Problem (Step 1) P1 P2 x Interconnection Network / Bus Main Memory Time ld r2, x

15 © Alvin R. Lebeck 1999 Cache Coherence Problem (Step 2) P1 P2 x Interconnection Network / Bus Main Memory ld r2, x Time ld r2, x

16 © Alvin R. Lebeck 1999 Cache Coherence Problem (Step 3) P1 P2 x Interconnection Network / Bus Main Memory ld r2, x add r1, r2, r4 st x, r1 Time ld r2, x

17 © Alvin R. Lebeck 1999 CPS 220 The Problem of Cache Coherence (4) P1 P2 ld r2, x add r1, r3, r4 st x, r1 ld r4,y ld r2, x add r1, r2, r3 st y, r1 ld r5, x x Time Interconnection Network / Bus Main Memory y y x

18 © Alvin R. Lebeck 1999 Coherence vs. Consistency Intuition says loads should return latest value –what is latest? Coherence concerns only one memory location Consistency concerns apparent ordering for all locations A Memory System is Coherent if –can serialize all operations to that location such that, –operations performed by any processor appear in program order »program order = order defined program text or assembly code –value returned by a read is value written by last store to that location

19 © Alvin R. Lebeck 1999 Why Coherence != Consistency /* initial A = B = flag = 0 */ P1 P2 A = 1;while (flag == 0); /* spin */ B = 1; print A; flag = 1; print B; Intuition says printed A = B = 1 Coherence doesn’t say anything, why?

20 © Alvin R. Lebeck 1999 The Sequential Consistency Memory Model P1P2 P3 switch randomly set after each memory op sequential processors issue memory ops in program order Memory

21 © Alvin R. Lebeck 1999 Sufficient Conditions for Sequential Consistency Every processor issues memory ops in program order Processor must wait for store to complete before issuing next memory operation After load, issuing proc waits for load to complete, and store that produced value to complete before issuing next op Easily implemented with shared bus.

22 © Alvin R. Lebeck 1999 CPS 220 Potential Solutions Snooping Solution (Snoopy Bus): –Send all requests for data to all processors –Processors snoop to see if they have a copy and respond accordingly –Requires broadcast, since caching information is at processors –Works well with bus (natural broadcast medium) –Dominates for small scale machines (most of the market) »probably won’t scale beyond 2-4 processors Directory-Based Schemes –Keep track of what is being shared in one centralized place –Distributed memory => distributed directory (avoids bottlenecks) –Send point-to-point requests to processors –Scales better than Snoop –Actually existed BEFORE Snoop-based schemes

23 © Alvin R. Lebeck 1999 CPS 220 Basic Snoopy Protocols Write Invalidate Protocol: –Multiple readers, single writer –Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies –Read Miss: »Write-through: memory is always up-to-date »Write-back: snoop in caches to find most recent copy Write Broadcast (Update) Protocol: –Write to shared data: broadcast on bus, processors snoop, and update copies –Read miss: memory is always up-to-date Write serialization: bus serializes requests –Bus is single point of arbitration

24 © Alvin R. Lebeck 1999 Snoopy Cache-Coherence Protocols Bus provides serialization point for consistency –but, but, what about write-buffers? Later in the semester…. Each cache controller “snoops” all bus transactions –relevant transactions if for a block it contains –take action to ensure coherence »invalidate »update »supply value –depends on state of the block and the protocol Simultaneous Operation of Independent Controllers

25 © Alvin R. Lebeck 1999 Snoopy Design Choices Processor ld/st Snoop (observed bus transaction) StateTagData... Controller updates state of blocks in response to processor and snoop events and generates bus xactions Often have duplicate cache tags Snoopy protocol –set of states –state-transition diagram –actions Basic Choices –write-through vs. write-back –invalidate vs. update Cache

26 © Alvin R. Lebeck 1999 The Simple Invalidate Snoopy Protocol Write-through, no-write- allocate cache Actions: PrRd, PrWr, BusRd, BusWr PrRd / PrWr Valid BusWr Invalid PrWr / BusWr PrRd / BusRd

27 © Alvin R. Lebeck 1999 A 3-State Write-Back Invalidation Protocol 2-State Protocol + Simple hardware and protocol –Bandwidth (every write goes on bus!) 3-State Protocol (MSI) – Modified » one cache has valid/latest copy »memory is stale – Shared »one or more caches have valid copy – Invalid Must invalidate all other copies before entering modified state Requires bus transaction (order and invalidate)

28 © Alvin R. Lebeck 1999 MSI Processor and Bus Actions Processor: –PrRd –PrWr –Writeback on replacement of modified block Bus –Bus Read (BusRd) Read without intent to modify, data could come from memory or another cache –Bus Read-Exclusive (BusRdX) Read with intent to modify, must invalidate all other caches copies –Writeback (BusWB) cache controller puts contents on bus and memory is updated –Definition: cache-to-cache transfer occurs when another cache satisfies BusRd or BusRdX request Let’s draw it!

29 © Alvin R. Lebeck 1999 MSI State Diagram PrRd M BusRdX / BusWB PrWr / BusRdX SI PrWr BusRd / BusWB PrWr / BusRdX PrRd / BusRd BusRdX PrRd BusRd

30 © Alvin R. Lebeck 1999 An example Proc Action P1 State P2 state P3 state Bus Act Data from 1. P1 read u S BusRd Memory 2. P3 read u S -- S BusRd Memory 3. P3 write u I -- M BusRdX Memory or not 4. P1 read u S -- S BusRd P3’s cache 5. P2 read u S S S BusRd Memory Single writer, multiple reader protocol Why Modified to Shared? What if not in any cache? –Read, Write produces 2 bus transactions!

31 © Alvin R. Lebeck State (MESI) Invalidation Protocol Often called the Illinois protocol Modified (dirty) Exclusive (clean unshared) only copy, not dirty Shared Invalid Requires shared signal to detect if other caches have a copy of block Cache Flush for cache-to-cache transfers –Only one can do it though What does state diagram look like?

32 © Alvin R. Lebeck State Write-back Update Protocol Dragon (Xerox PARC) States –Exclusive (E): one copy, clean, memory is up-to-date –Shared-Clean (SC): could be two or more copies, memory unknown –Shared-Modified (SM): could be two or more copies, memory stale –Modified (M) Adds Bus Update Transaction Adds Cache Controller Update operation Must obtain bus before updating local copy What does state diagram look like? –let’s look at the actions first

33 © Alvin R. Lebeck 1999 CPS 220 Basic Snoopy Protocols Write Invalidate versus Broadcast: –Invalidate requires one transaction per write-run –Invalidate uses spatial locality: one transaction per block –Broadcast has lower latency between write and read –Broadcast: BW (increased) vs. latency (decreased) tradeoff NameProtocol TypeMemory-write policyMachines using Write OnceWrite invalidateWrite back First snoopy protocol. after first write Synapse N+1Write invalidateWrite back1st cache-coherent MPs Berkeley Write invalidateWrite backBerkeley SPUR IllinoisWrite invalidateWrite backSGI Power and Challenge “Firefly” Write broadcastWrite back private, Write through sharedSPARCCenter 2000

34 © Alvin R. Lebeck 1999 CPS 220 Larger MPs Separate Memory per Processor Local or Remote access via memory controller Cache Coherency solution: non cached pages Alternative: directory/cache that tracks state of every block in every cache –Which caches have a copies of block, dirty vs. clean,... Info per memory block vs. per cache block? –In memory => simpler protocol (centralized/one location) –In memory => directory is ƒ(memory size) vs. ƒ(cache size) Prevent directory as bottleneck: distribute directory entries with memory, each keeping track of which Procs have copies of their blocks

35 © Alvin R. Lebeck 1999 CPS 220 Directory Protocol Similar to Snoopy Protocol: 3 states –Shared: > 1 processors have data, memory up to date –Uncached –Exclusive: 1 processor(owner) has data; memory out of date In addition to cache state, must track which processors have data when in shared state Terms: –Local node is the node where a request originates –Home node is the node where the memory location of an address resides –Remote node is the node that has a copy of a cache block, whether exclusive or shared.

36 © Alvin R. Lebeck 1999 CPS 220 Example Directory Protocol Message sent to directory causes 2 actions : –update the directory –more messages to satifty request Block is in Uncached state: the copy in memory is the current value, & only possible requests for that block are : –Read miss: requesting processor is sent back the data from memory and the requestor is the only sharing node. The state of the block is made Shared. –Write miss: requesting processor is sent the value and becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner. Block is Shared, the memory value is up-to-date : –Read miss: requesting processor is sent back the data from memory & requesting processor is added to the sharing set. –Write miss: requesting processor is sent the value. All processors in the set Sharers are sent invalidate messages, & Sharers set is to identity of requesting processor. The state of the block is made Exclusive.

37 © Alvin R. Lebeck 1999 CPS 220 Example Directory Protocol Block is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner), & 3 possible directory requests: –Read miss: owner processor is sent a data fetch message, which causes state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory and sent back to the requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). –Data write-back: owner processor is replacing the block and hence must write it back. This makes the memory copy up-to-date (the home directory essentially becomes the owner), the block is now uncached, and the Sharer set is empty. –Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is send to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.