Lecture 28Fall 2006 Computer Architecture Fall 2006 Lecture 28: Bus Connected Multiprocessors Adapted from Mary Jane Irwin ( www.cse.psu.edu/~mji )www.cse.psu.edu/~mji.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Cache Optimization Summary
Review: Multiprocessor Systems (MIMD)
Lecture 25Fall 2008 Computer Architecture Fall 2008 Lecture 25: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
The University of Adelaide, School of Computer Science
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
MultiIntro.1 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Lecture 22Fall 2006 Computer Systems Fall 2006 Lecture 22: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )
.1 Intro to Multiprocessors [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
CMPE 421 Parallel Computer Architecture Multi Processing 1.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
.1 Intro to Multiprocessors. .2 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Computer Architecture Spring 2012 Lecture 25: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( ) [Adapted.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
The University of Adelaide, School of Computer Science
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
CSE431 L25 MultiIntro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin (
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
CSE 431 Computer Architecture Fall Lecture 26. Bus Connected Multi’s
The University of Adelaide, School of Computer Science
COSC121: Computer Systems. Managing Memory
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Multiprocessors - Flynn’s taxonomy (1966)
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Lecture 25: Multiprocessors
High Performance Computing
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Multiprocessors
Computer Systems Spring 2008 Lecture 25: Intro. to Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Lecture 28Fall 2006 Computer Architecture Fall 2006 Lecture 28: Bus Connected Multiprocessors Adapted from Mary Jane Irwin ( ) [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005] Plus, slides from Chapter 18 Parallel Processing by Stallings

Lecture 28Fall 2006 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor – multiple processors with a single shared address space  Cluster – multiple computers (each with their own address space) connected over a local area network (LAN) functioning as a single system

Lecture 28Fall 2006 Multiprocessor Basics # of Proc Communication model Message passing8 to 2048 Shared address NUMA8 to 256 UMA2 to 64 Physical connection Network8 to 256 Bus2 to 36  Q1 – How do they share data?  Q2 – How do they coordinate?  Q3 – How scalable is the architecture? How many processors?

Lecture 28Fall 2006 Single Bus (Shared Address UMA) Multi’s  Caches are used to reduce latency and to lower bus traffic l Write-back caches used to keep bus traffic at a minimum  Must provide hardware to ensure that caches and memory are consistent (cache coherency)  Must provide a hardware mechanism to support process synchronization Proc1 Proc2Proc4 Caches Single Bus Memory I/O Proc3 Caches

Lecture 28Fall 2006 Multiprocessor Cache Coherency  Cache coherency protocols l Bus snooping – cache controllers monitor shared bus traffic with duplicate address tag hardware (so they don’t interfere with processor’s access to the cache) Proc1 Proc2 ProcN DCache Single Bus MemoryI/O Snoop

Lecture 28Fall 2006 Bus Snooping Protocols  Multiple copies are not a problem when reading  Processor must have exclusive access to write a word l What happens if two processors try to write to the same shared data word in the same clock cycle? The bus arbiter decides which processor gets the bus first (and this will be the processor with the first exclusive access). Then the second processor will get exclusive access. Thus, bus arbitration forces sequential behavior. l This sequential consistency is the most conservative of the memory consistency models. With it, the result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved.  All other processors sharing that data must be informed of writes

Lecture 28Fall 2006 Handling Writes Ensuring that all other processors sharing data are informed of writes can be handled two ways: 1. Write-update (write-broadcast) – writing processor broadcasts new data over the bus, all copies are updated l All writes go to the bus  higher bus traffic l Since new values appear in caches sooner, can reduce latency 2. Write-invalidate – writing processor issues invalidation signal on bus, cache snoops check to see if they have a copy of the data, if so they invalidate their cache block containing the word (this allows multiple readers but only one writer) l Uses the bus only on the first write  lower bus traffic, so better use of bus bandwidth

Lecture 28Fall 2006 A Write-Invalidate CC Protocol Shared (clean) Invalid Modified (dirty) write-back caching protocol in black read (miss) write (hit or miss) read (hit or miss) read (hit) or write (hit or miss) write (miss)

Lecture 28Fall 2006 A Write-Invalidate CC Protocol Shared (clean) Invalid Modified (dirty) write-back caching protocol in black read (miss) read (hit or miss) read (hit) or write (hit) write (miss) write (hit or miss) send invalidate signals from the processor coherence additions in red signals from the bus coherence additions in blue receives invalidate (write by another processor to this block) write-back due to read (miss) by another processor to this block write (miss) by another processor to this block

Lecture 28Fall 2006 Write-Invalidate CC Examples l I = invalid (many), S = shared (many), M = modified (only one) Proc 1 A S Main Mem A Proc 2 A I Proc 1 A S Main Mem A Proc 2 A I Proc 1 A M Main Mem A Proc 2 A I Proc 1 A M Main Mem A Proc 2 A I

Lecture 28Fall 2006 Write-Invalidate CC Examples l I = invalid (many), S = shared (many), M = modified (only one) Proc 1 A S Main Mem A Proc 2 A I 1. read miss for A 2. read request for A 3. snoop sees read request for A & lets MM supply A 4. gets A from MM & changes its state to S Proc 1 A S Main Mem A Proc 2 A I 1. write miss for A 2. writes A & changes its state to M Proc 1 A M Main Mem A Proc 2 A I 1. read miss for A3. snoop sees read request for A, writes- back A to MM changes it state to S 2. read request for A 4. gets A from MM & changes its state to S 3. P2 sends invalidate for A 4. change A state to I Proc 1 A M Main Mem A Proc 2 A I 1. write miss for A 2. writes A & changes its state to M 3. P2 sends invalidate for A 4. change A state to I

Lecture 28Fall 2006 SMP Data Miss Rates  Shared data has lower spatial and temporal locality l Share data misses often dominate cache behavior even though they may only be 10% to 40% of the data accesses 64KB 2-way set associative data cache with 32B blocks Hennessy & Patterson, Computer Architecture: A Quantitative Approach

Lecture 28Fall 2006 Block Size Effects  Writes to one word in a multi-word block mean l either the full block is invalidated (write-invalidate) l or the full block is exchanged between processors (write-update) -Alternatively, could broadcast only the written word  Multi-word blocks can also result in false sharing: when two processors are writing to two different variables in the same cache block l With write-invalidate false sharing increases cache miss rates  Compilers can help reduce false sharing by allocating highly correlated data to the same cache block AB Proc1Proc2 4 word cache block

Lecture 28Fall 2006 Other Coherence Protocols  There are many variations on cache coherence protocols  Another write-invalidate protocol used in the Pentium 4 (and many other micro’s) is MESI with four states: l Modified – (same) only modified cache copy is up-to-date; memory copy and all other cache copies are out-of-date l Exclusive – only one copy of the shared data is allowed to be cached; memory has an up-to-date copy -Since there is only one copy of the block, write hits don’t need to send invalidate signal l Shared – multiple copies of the shared data may be cached (i.e., data permitted to be cached with more than one processor); memory has an up-to-date copy l Invalid – same

Lecture 28Fall 2006 MESI Cache Coherency Protocol Processor write or read hit Modified (dirty) Invalid (not valid block) Shared (clean) Processor write miss Processor shared read miss Processor write [Send invalidate signal] [Write back block] Processor shared read Invalidate for this block Another processor has read/write miss for this block [Write back block] Exclusive (clean) Processor exclusive read Processor exclusive read miss Processor exclusive read Processor exclusive read miss [Write back block] Processor shared read

Lecture 28Fall 2006 Process Synchronization  Need to be able to coordinate processes working on a common task  Lock variables (semaphores) are used to coordinate or synchronize processes  Need an architecture-supported arbitration mechanism to decide which processor gets access to the lock variable l Single bus provides arbitration mechanism, since the bus is the only path to memory – the processor that gets the bus wins  Need an architecture-supported operation that locks the variable l Locking can be done via an atomic swap operation (processor can both read a location and set it to the locked state – test-and- set – in the same bus operation)

Lecture 28Fall 2006 Spin Lock Synchronization Read lock variable Succeed? (=0?) Try to lock variable using swap: read lock variable and set it to locked value (1) Unlocked? (=0?) No Yes NoBegin update of shared data Finish update of shared data Yes unlock variable: set lock variable to 0 Spin atomic operation The single winning processor will read a 0 - all others processors will read the 1 set by the winning processor

Lecture 28Fall 2006 Review: Summing Numbers on a SMP sum[Pn] = 0; for (i = 1000*Pn; i< 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; /* each processor sums its /* subset of vector A  Pn is the processor’s number, vectors A and sum are shared variables, i is a private variable, half is a private variable initialized to the number of processors repeat/* adding together the /* partial sums synch();/*synchronize first if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; half = half/2 if (Pn<half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1);/*final sum in sum[0]

Lecture 28Fall 2006 An Example with 10 Processors sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] P0P1P2P3P4P5P6P7P8P9 P0P1P2P3P4  synch() : Processors must synchronize before the “consumer” processor tries to read the results from the memory location written by the “producer” processor l Barrier synchronization – a synchronization scheme where processors wait at the barrier, not proceeding until every processor has reached it

Lecture 28Fall 2006 Barrier Implemented with Spin-Locks lock(arrive); count := count + 1;/* count the processors as if count < n/* they arrive at barrier then unlock(arrive) else unlock(depart);  n is a shared variable initialized to the number of processors,count is a shared variable initialized to 0, arrive and depart are shared spin-lock variables where arrive is initially unlocked and depart is initially locked lock(depart); count := count - 1;/* count the processors as if count > 0/* they leave barrier then unlock(depart) else unlock(arrive); procedure synch()

Lecture 28Fall 2006 Spin-Locks on Bus Connected ccUMAs  With bus based cache coherency, spin-locks allow processors to wait on a local copy of the lock in their caches l Reduces bus traffic – once the processor with the lock releases the lock (writes a 0) all other caches see that write and invalidate their old copy of the lock variable. Unlocking restarts the race to get the lock. The winner gets the bus and writes the lock back to 1. The other caches then invalidate their copy of the lock and on the next lock read fetch the new lock value (1) from memory.  This scheme has problems scaling up to many processors because of the communication traffic when the lock is released and contested

Lecture 28Fall 2006 Cache Coherence Bus Traffic Proc P0Proc P1Proc P2Bus activityMemory 1Has lockSpins None 2Releases lock (0) Spins Bus services P0’s invalidate 3Cache miss Bus services P2’s cache miss 4Sends lock (0) WaitsReads lock (0) Response to P2’s cache miss Update memory from P0 5Reads lock (0) Swaps lockBus services P1’s cache miss 6Swaps lockSwap succeeds Response to P1’s cache miss Sends lock variable to P1 7Swap failsHas lockBus services P2’s invalidate 8SpinsHas lockBus services P1’s cache miss

Lecture 28Fall 2006 Commercial Single Backplane Multiprocessors Processor# proc.MHzBW/ system Compaq PLPentium Pro IBM R40PowerPC AlphaServerAlpha SGI Pow ChalMIPS R Sun 6000UltraSPARC

Lecture 28Fall 2006 Summary  Key questions l Q1 - How do processors share data? l Q2 - How do processors coordinate their activity? l Q3 - How scalable is the architecture (what is the maximum number of processors)?  Bus connected (shared address UMA’s(SMP’s)) multi’s l Cache coherency hardware to ensure data consistency l Synchronization primitives for synchronization l Scalability of bus connected UMAs limited (< ~ 36 processors) because the three desirable bus characteristics -high bandwidth -low latency -long length are incompatible  Network connected NUMAs are more scalable

Lecture 28Fall 2006 Next Lecture and Reminders  Reminders l Final Tuesday, December 12 from 8 – 9:50 AM  Next lecture