Computer Systems Spring 2008 Lecture 25: Intro. to Multiprocessors

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

The University of Adelaide, School of Computer Science

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

SE-292 High Performance Computing

Review: Multiprocessor Systems (MIMD)

Lecture 25Fall 2008 Computer Architecture Fall 2008 Lecture 25: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )

The University of Adelaide, School of Computer Science

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

MultiIntro.1 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Lecture 22Fall 2006 Computer Systems Fall 2006 Lecture 22: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )

.1 Intro to Multiprocessors [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

CMPE 421 Parallel Computer Architecture Multi Processing 1.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Memory/Storage Architecture Lab Computer Architecture Multiprocessors.

.1 Intro to Multiprocessors. .2 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control.

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

Computer Architecture Spring 2012 Lecture 25: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( ) [Adapted.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Lecture 28Fall 2006 Computer Architecture Fall 2006 Lecture 28: Bus Connected Multiprocessors Adapted from Mary Jane Irwin ( )

CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

CSE431 L25 MultiIntro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin (

The University of Adelaide, School of Computer Science

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

Multiprocessors – Locks

CS 704 Advanced Computer Architecture

Parallel Architecture

Computer Engineering 2nd Semester

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 147 – Parallel Processing

CS 704 Advanced Computer Architecture

Lecture 18: Coherence and Synchronization

Multiprocessor Cache Coherency

CSE 431 Computer Architecture Fall Lecture 26. Bus Connected Multi’s

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Flynn’s Taxonomy Flynn classified by data and control streams in 1966

The University of Adelaide, School of Computer Science

Parallel and Multiprocessor Architectures – Shared Memory

Chapter 5 Multiprocessor and Thread-Level Parallelism

Chapter 17 Parallel Processing

Multiprocessors - Flynn’s taxonomy (1966)

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

High Performance Computing

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Computer Systems Spring 2008 Lecture 25: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005] Plus, slides from Chapter 18 Parallel Processing by Stallings Other handouts To handout next time

The Big Picture: Where are We Now? Processor Processor Output Output Control Control Memory Memory Datapath Input Input Datapath Multiprocessor – multiple processors with a single shared address space Cluster – multiple computers (each with their own address space) connected over a local area network (LAN) functioning as a single system

Applications Needing “Supercomputing” Energy (plasma physics (simulating fusion reactions), geophysical (petroleum) exploration) DoE stockpile stewardship (to ensure the safety and reliability of the nation’s stockpile of nuclear weapons) Earth and climate (climate and weather prediction, earthquake, tsunami prediction and mitigation of risks) Transportation (improving vehicles’ airflow dynamics, fuel consumption, crashworthiness, noise reduction) Bioinformatics and computational biology (genomics, protein folding, designer drugs) Societal health and safety (pollution reduction, disaster planning, terrorist action detection) http://www.nap.edu/books/0309095026/html/

Encountering Amdahl’s Law Speedup due to enhancement E is Speedup w/ E = ---------------------- Exec time w/o E Exec time w/ E Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E  ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S)

Examples: Amdahl’s Law Speedup w/ E = 1 / ((1-F) + F/S) Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = 1/(.75 + .25/20) = 1.31 What if its usable only 15% of the time? Speedup w/ E = 1/(.85 + .15/20) = 1.17 Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less

Supercomputer Style Migration (Top500) http://www.top500.org/lists/2005/11/ Nov data Cluster – whole computers interconnected using their I/O bus Constellation – a cluster that uses an SMP multiprocessor as the building block Here SMP is both UMA and NUMA In the last 8 years uniprocessor and SIMDs disappeared while Clusters and Constellations grew from 3% to 80%

Multiprocessor/Clusters Key Questions Q1 – How do they share data? Q2 – How do they coordinate? Q3 – How scalable is the architecture? How many processors can be supported?

Flynn’s Classification Scheme SISD – single instruction, single data stream aka uniprocessor - what we have been talking about all semester SIMD – single instruction, multiple data streams single control unit broadcasting operations to multiple datapaths MISD – multiple instruction, single data no such machine (although some people put vector machines in this category) MIMD – multiple instructions, multiple data streams aka multiprocessors (SMPs, MPPs, clusters, NOWs) Now obsolete except for . . .

Taxonomy of Parallel Processor Architectures

SIMD Processors Single control unit PE Control Single control unit Multiple datapaths (processing elements – PEs) running in parallel Q1 – PEs are interconnected (usually via a mesh or torus) and exchange/share data as directed by the control unit Q2 – Each PE performs the same operation on its own local data

Example SIMD Machines Maker Year # PEs # b/ PE Max memory (MB) PE clock (MHz) System BW (MB/s) Illiac IV UIUC 1972 64 1 13 2,560 DAP ICL 1980 4,096 2 5 MPP Goodyear 1982 16,384 10 20,480 CM-2 Thinking Machines 1987 65,536 512 7 MP-1216 MasPar 1989 4 1024 25 23,000

Multiprocessor Basic Organizations Processors connected by a single bus Processors connected by a network # of Proc Communication model Message passing 8 to 2048 Shared address NUMA 8 to 256 UMA 2 to 64 Physical connection Network Bus 2 to 36

Shared Address (Shared Memory) Multi’s Q1 – Single address space shared by all the processors Q2 – Processors coordinate/communicate through shared variables in memory (via loads and stores) Use of shared data must be coordinated via synchronization primitives (locks) UMAs (uniform memory access) – aka SMP (symmetric multiprocessors) all accesses to main memory take the same amount of time no matter which processor makes the request or which location is requested NUMAs (nonuniform memory access) some main memory accesses are faster than others depending on the processor making the request and which location is requested can scale to larger sizes than UMAs so are potentially higher performance

Message Passing Multiprocessors Each processor has its own private address space Q1 – Processors share data by explicitly sending and receiving information (messages) Q2 – Coordination is built into message passing primitives (send and receive)

N/UMA Remote Memory Access Times (RMAT) Year Type Max Proc Interconnection Network RMAT (ns) Sun Starfire 1996 SMP 64 Address buses, data switch 500 Cray 3TE NUMA 2048 2-way 3D torus 300 HP V 1998 32 8 x 8 crossbar 1000 SGI Origin 3000 1999 512 Fat tree Compaq AlphaServer GS Switched bus 400 Sun V880 2002 8 240 HP Superdome 9000 2003 275 NASA Columbia 2004 10240 ???

Multiprocessor Basics Q1 – How do they share data? Q2 – How do they coordinate? Q3 – How scalable is the architecture? How many processors? # of Proc Communication model Message passing 8 to 2048 Shared address NUMA 8 to 256 UMA 2 to 64 Physical connection Network Bus 2 to 36

Single Bus (Shared Address UMA) Multi’s Proc1 Proc2 Proc3 Proc4 Caches Caches Caches Caches Single Bus Memory I/O Caches are used to reduce latency and to lower bus traffic Write-back caches used to keep bus traffic at a minimum Must provide hardware to ensure that caches and memory are consistent (cache coherency) Must provide a hardware mechanism to support process synchronization

Multiprocessor Cache Coherency Cache coherency protocols Bus snooping – cache controllers monitor shared bus traffic with duplicate address tag hardware (so they don’t interfere with processor’s access to the cache) Proc1 Proc2 ProcN DCache Single Bus Memory I/O Snoop

Bus Snooping Protocols Multiple copies are not a problem when reading Processor must have exclusive access to write a word What happens if two processors try to write to the same shared data word in the same clock cycle? The bus arbiter decides which processor gets the bus first (and this will be the processor with the first exclusive access). Then the second processor will get exclusive access. Thus, bus arbitration forces sequential behavior. This sequential consistency is the most conservative of the memory consistency models. With it, the result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved. All other processors sharing that data must be informed of writes

Handling Writes Ensuring that all other processors sharing data are informed of writes can be handled two ways: Write-update (write-broadcast) – writing processor broadcasts new data over the bus, all copies are updated All writes go to the bus  higher bus traffic Since new values appear in caches sooner, can reduce latency Write-invalidate – writing processor issues invalidation signal on bus, cache snoops check to see if they have a copy of the data, if so they invalidate their cache block containing the word (this allows multiple readers but only one writer) Uses the bus only on the first write  lower bus traffic, so better use of bus bandwidth All commercial machines use write-invalidate as their standard coherence protocol

A Write-Invalidate CC Protocol read (hit or miss) read (miss) Invalid Shared (clean) receives invalidate (write by another processor to this block) write-back due to read (miss) by another processor to this block write (miss) send invalidate send invalidate write (hit or miss) write-back caching protocol in black For lecture A read miss causes the cache to acquire the bus and write back the victim block (if it was in the Modified (dirty) state). All the other caches monitor the read miss to see if this block is in their cache. If one has a copy and it is in the Modified state, then the block is written back and its state is changed to Invalid. The read miss is then satisfied by reading from the block memory, and the state of the block is set to Shared. Read hits do not change the cache state. A write miss to an Invalid block causes the cache to acquire the bus, read the block, modify the portion of the block being written and changing the block’s state to Modified. A write miss to a Shared block causes the cache to acquire the bus, send an invalidate signal to invalidate any other existing copies in other caches, modify the portion of the block being written and chang the block’s state to Modified. A write hit to a Modified block causes no action. A write hit to a Shared block causes the cache to acquire the bus, send an invalidate signal to invalidate any other existing copies in other caches, modify the part of the block being written, and change the state to Modified. Modified (dirty) signals from the processor coherence additions in red signals from the bus coherence additions in blue read (hit) or write (hit)

Write-Invalidate CC Examples I = invalid (many), S = shared (many), M = modified (only one) 3. snoop sees read request for A & lets MM supply A 1. read miss for A 1. write miss for A Proc 1 A S Main Mem A Proc 2 A I Proc 1 A S Main Mem A Proc 2 A I 4. gets A from MM & changes its state to S 2. writes A & changes its state to M 4. change A state to I 2. read request for A 3. P2 sends invalidate for A 1. write miss for A 3. snoop sees read request for A, writes-back A to MM 1. read miss for A Proc 1 A M Main Mem A Proc 2 A I Proc 1 A M Main Mem A Proc 2 A I 4. gets A from MM & changes its state to M 2. writes A & changes its state to M For lecture 1st example – supplying A from Proc 1 cache would be faster than getting it from MM 2nd and 4th example – assumes a Write allocate cache policy 5. change A state to I 4. change A state to I 5. P2 sends invalidate for A 2. read request for A 3. P2 sends invalidate for A

Other Coherence Protocols There are many variations on cache coherence protocols Another write-invalidate protocol used in the Pentium 4 (and many other micro’s) is MESI with four states: Modified – (same) only modified cache copy is up-to-date; memory copy and all other cache copies are out-of-date Exclusive – only one copy of the shared data is allowed to be cached; memory has an up-to-date copy Since there is only one copy of the block, write hits don’t need to send invalidate signal Shared – multiple copies of the shared data may be cached (i.e., data permitted to be cached with more than one processor); memory has an up-to-date copy Invalid – same For E and S book states that memory has an up-to-date version so is this not a write-back cache ???

MESI Cache Coherency Protocol Processor shared read Invalidate for this block Invalid (not valid block) Shared (clean) Another processor has read/write miss for this block Processor shared read miss Processor shared read [Write back block] [Send invalidate signal] [Write back block] Processor write miss Processor exclusive read Processor exclusive read miss Processor write Exclusive (clean) In hide mode – too complicated for this undergraduate class. If going to include need to cleanup and animate it. Modified (dirty) [Write back block] Processor exclusive read miss Processor exclusive read Processor write or read hit

Process Synchronization Need to be able to coordinate processes working on a common task Lock variables (semaphores) are used to coordinate or synchronize processes Need an architecture-supported arbitration mechanism to decide which processor gets access to the lock variable Single bus provides arbitration mechanism, since the bus is the only path to memory – the processor that gets the bus wins Need an architecture-supported operation that locks the variable Locking can be done via an atomic swap operation (processor can both read a location and set it to the locked state – test-and-set – in the same bus operation)

Spin Lock Synchronization Read lock variable Spin unlock variable: set lock variable to 0 No Unlocked? (=0?) Yes Finish update of shared data atomic operation Try to lock variable using swap: read lock variable and set it to locked value (1) . The losers will continue to write the variable with the locked value of 1, but that is okay! No Yes Begin update of shared data Succeed? (=0?) The single winning processor will read a 0 - all others processors will read the 1 set by the winning processor

Summing 100,000 Numbers on 10 Processors Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable) sum[Pn] = 0; for (i = 10000*Pn; i< 10000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; The processors then coordinate in adding together the partial sums (half is a private variable initialized to 10 (the number of processors)) repeat synch(); /*synchronize first if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; half = half/2 if (Pn<half) sum[Pn] = sum[Pn] + sum[Pn+half] until (half == 1); /*final sum in sum[0] Divide and conquer summing – half of the processors add pairs of partial sums, the a quarter add pairs of the new partial sums, and so on.

An Example with 10 Processors synch(): Processors must synchronize before the “consumer” processor tries to read the results from the memory location written by the “producer” processor Barrier synchronization – a synchronization scheme where processors wait at the barrier, not proceeding until every processor has reached it P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 sum[P0] sum[P1] sum[P2] sum[P3] sum[P4] sum[P5] sum[P6] sum[P7] sum[P8] sum[P9] For lecture

An Example with 10 Processors sum[P0] sum[P1] sum[P2] sum[P3] sum[P4] sum[P5] sum[P6] sum[P7] sum[P8] sum[P9] P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 half = 10 P0 P1 P2 P3 P4 half = 5 P0 P1 half = 2 For lecture half = 1 P0

Barrier Implemented with Spin-Locks n is a shared variable initialized to the number of processors,count is a shared variable initialized to 0, arrive and depart are shared spin-lock variables where arrive is initially unlocked and depart is initially locked procedure synch() lock(arrive); count := count + 1; /* count the processors as if count < n /* they arrive at barrier then unlock(arrive) else unlock(depart); As each arriving process executes the Barrier procedure, it tests and increments a global counting variable. If the count has not yet reached n, then the processor goes into a suspended state within the Barrier procedure. As the processors arrive at the Barrier, they each become suspended and count grows. Finally, the last process to arrive finds that the count has reached n, and then releases all of the suspended processors. lock(depart); count := count - 1; /* count the processors as if count > 0 /* they leave barrier then unlock(depart) else unlock(arrive);

Spin-Locks on Bus Connected ccUMAs With bus based cache coherency, spin-locks allow processors to wait on a local copy of the lock in their caches Reduces bus traffic – once the processor with the lock releases the lock (writes a 0) all other caches see that write and invalidate their old copy of the lock variable. Unlocking restarts the race to get the lock. The winner gets the bus and writes the lock back to 1. The other caches then invalidate their copy of the lock and on the next lock read fetch the new lock value (1) from memory. This scheme has problems scaling up to many processors because of the communication traffic when the lock is released and contested

Cache Coherence Bus Traffic Proc P0 Proc P1 Proc P2 Bus activity Memory 1 Has lock Spins None 2 Releases lock (0) Bus services P0’s invalidate 3 Cache miss Bus services P2’s cache miss 4 Sends lock (0) Waits Reads lock (0) Response to P2’s cache miss Update memory from P0 5 Swaps lock Bus services P1’s cache miss 6 Swap succeeds Response to P1’s cache miss Sends lock variable to P1 7 Swap fails Bus services P2’s invalidate 8 Probably don’t want to cover this in detail in class – ask the class to look at it outside lecture.

Commercial Single Backplane Multiprocessors MHz BW/ system Compaq PL Pentium Pro 4 200 540 IBM R40 PowerPC 8 112 1800 AlphaServer Alpha 21164 12 440 2150 SGI Pow Chal MIPS R10000 36 195 1200 Sun 6000 UltraSPARC 30 167 2600

Summary Flynn’s classification of processors – SISD, SIMD, MIMD Q1 – How do processors share data? Q2 – How do processors coordinate their activity? Q3 – How scalable is the architecture (what is the maximum number of processors)? Bus connected (shared address UMA’s(SMP’s)) multi’s Cache coherency hardware to ensure data consistency Synchronization primitives for synchronization Scalability of bus connected UMAs limited (< ~ 36 processors) because the three desirable bus characteristics high bandwidth low latency long length are incompatible Network connected NUMAs are more scalable