.1 Intro to Multiprocessors. .2 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 25Fall 2008 Computer Architecture Fall 2008 Lecture 25: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
Lecture 18: Multiprocessors
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
MultiIntro.1 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Lecture 22Fall 2006 Computer Systems Fall 2006 Lecture 22: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )
.1 Intro to Multiprocessors [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
CMPE 421 Parallel Computer Architecture Multi Processing 1.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Outline Why this subject? What is High Performance Computing?
Computer Architecture Spring 2012 Lecture 25: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( ) [Adapted.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Lecture 28Fall 2006 Computer Architecture Fall 2006 Lecture 28: Bus Connected Multiprocessors Adapted from Mary Jane Irwin ( )
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
CSE431 L25 MultiIntro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin (
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 13: Multiprocessors Kai Bu
CS 704 Advanced Computer Architecture
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
12.4 Memory Organization in Multiprocessor Systems
CSE 431 Computer Architecture Fall Lecture 26. Bus Connected Multi’s
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
The University of Adelaide, School of Computer Science
Multiprocessors - Flynn’s taxonomy (1966)
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
High Performance Computing
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Computer Systems Spring 2008 Lecture 25: Intro. to Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
CSL718 : Multiprocessors 13th April, 2006 Introduction
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

.1 Intro to Multiprocessors

.2 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor – multiple processors with a single shared address space  Cluster – multiple computers (each with their own address space) connected over a local area network (LAN) functioning as a single system

.3 Applications Needing “Supercomputing”  Energy (plasma physics (simulating fusion reactions), geophysical (petroleum) exploration)  DoE stockpile stewardship (to ensure the safety and reliability of the nation’s stockpile of nuclear weapons)  Earth and climate (climate and weather prediction, earthquake, tsunami prediction and mitigation of risks)  Transportation (improving vehicles’ airflow dynamics, fuel consumption, crashworthiness, noise reduction)  Bioinformatics and computational biology (genomics, protein folding, designer drugs)  Societal health and safety (pollution reduction, disaster planning, terrorist action detection)

.4 Encountering Amdahl’s Law  Speedup due to enhancement E is Speedup w/ E = Exec time w/o E Exec time w/ E  Suppose that enhancement E accelerates a fraction F (F 1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E  Speedup w/ E =

.5 Encountering Amdahl’s Law  Speedup due to enhancement E is Speedup w/ E = Exec time w/o E Exec time w/ E  Suppose that enhancement E accelerates a fraction F (F 1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E  ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S)

.6 Examples: Amdahl’s Law  Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E =  What if its usable only 15% of the time? Speedup w/ E =  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!  To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less Speedup w/ E =

.7 Examples: Amdahl’s Law  Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = 1/( /20) = 1.31  What if its usable only 15% of the time? Speedup w/ E = 1/( /20) = 1.17  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!  To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less Speedup w/ E = 1 / ((1-F) + F/S)

.8 Supercomputer Style Migration (Top500)  In the last 8 years uniprocessor and SIMDs disappeared while Clusters and Constellations grew from 3% to 80% Nov data Cluster – whole computers interconnected using their I/O bus Constellation – a cluster that uses an SMP multiprocessor as the building block

.9 Multiprocessor/Clusters Key Questions  Q1 – How do they share data?  Q2 – How do they coordinate?  Q3 – How scalable is the architecture? How many processors can be supported?

.10 Flynn’s Classification Scheme  Now obsolete except for...  SISD – single instruction, single data stream l aka uniprocessor - what we have been talking about all semester  SIMD – single instruction, multiple data streams l single control unit broadcasting operations to multiple datapaths  MISD – multiple instruction, single data l no such machine (although some people put vector machines in this category)  MIMD – multiple instructions, multiple data streams l aka multiprocessors (SMPs, MPPs, clusters, NOWs)

.11 SIMD Processors  Single control unit  Multiple datapaths (processing elements – PEs) running in parallel l Q1 – PEs are interconnected (usually via a mesh or torus) and exchange/share data as directed by the control unit l Q2 – Each PE performs the same operation on its own local data PE Control

.12 Example SIMD Machines MakerYear# PEs# b/ PE Max memory (MB) PE clock (MHz) System BW (MB/s) Illiac IVUIUC ,560 DAPICL19804, ,560 MPPGoodyear198216, ,480 CM-2Thinking Machines , ,384 MP-1216MasPar198916, ,000

.13 Multiprocessor Basic Organizations  Processors connected by a single bus  Processors connected by a network # of Proc Communication model Message passing8 to 2048 Shared address NUMA8 to 256 UMA2 to 64 Physical connection Network8 to 256 Bus2 to 36

.14 Shared Address (Shared Memory) Multi’s  UMAs (uniform memory access) – aka SMP (symmetric multiprocessors) l all accesses to main memory take the same amount of time no matter which processor makes the request or which location is requested  NUMAs (nonuniform memory access) l some main memory accesses are faster than others depending on the processor making the request and which location is requested l can scale to larger sizes than UMAs so are potentially higher performance  Q1 – Single address space shared by all the processors  Q2 – Processors coordinate/communicate through shared variables in memory (via loads and stores) l Use of shared data must be coordinated via synchronization primitives (locks)

.15 Single Bus (Shared Address UMA) Multi’s  Caches are used to reduce latency and to lower bus traffic  Must provide hardware to ensure that caches and memory are consistent (cache coherency)  Must provide a hardware mechanism to support process synchronization Processor Cache Single Bus Memory I/O

.16 Message Passing Multiprocessors  Each processor has its own private address space  Q1 – Processors share data by explicitly sending and receiving information (messages)  Q2 – Coordination is built into message passing primitives (send and receive)

.17 Summary  Flynn’s classification of processors – SISD, SIMD, MIMD l Q1 – How do processors share data? l Q2 – How do processors coordinate their activity? l Q3 – How scalable is the architecture (what is the maximum number of processors)?  Shared address multis – UMAs and NUMAs  Bus connected (shared address UMAs) multis l Cache coherency hardware to ensure data consistency l Synchronization primitives for synchronization l Bus traffic limits scalability of architecture (< ~ 36 processors)  Message passing multis

.18 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor – multiple processors with a single shared address space  Cluster – multiple computers (each with their own address space) connected over a local area network (LAN) functioning as a single system

.19 Multiprocessor Basics # of Proc Communication model Message passing8 to 2048 Shared address NUMA8 to 256 UMA2 to 64 Physical connection Network8 to 256 Bus2 to 36  Q1 – How do they share data?  Q2 – How do they coordinate?  Q3 – How scalable is the architecture? How many processors?

.20 Single Bus (Shared Address UMA) Multi’s  Caches are used to reduce latency and to lower bus traffic l Write-back caches used to keep bus traffic at a minimum  Must provide hardware to ensure that caches and memory are consistent (cache coherency)  Must provide a hardware mechanism to support process synchronization Proc1 Proc2Proc4 Caches Single Bus Memory I/O Proc3 Caches

.21 Multiprocessor Cache Coherency  Cache coherency protocols l Bus snooping – cache controllers monitor shared bus traffic with duplicate address tag hardware (so they don’t interfere with processor’s access to the cache) Proc1 Proc2 ProcN DCache Single Bus MemoryI/O Snoop

.22 Bus Snooping Protocols  Multiple copies are not a problem when reading  Processor must have exclusive access to write a word l What happens if two processors try to write to the same shared data word in the same clock cycle? The bus arbiter decides which processor gets the bus first (and this will be the processor with the first exclusive access). Then the second processor will get exclusive access. Thus, bus arbitration forces sequential behavior. l This sequential consistency is the most conservative of the memory consistency models. With it, the result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved.  All other processors sharing that data must be informed of writes

.23 Handling Writes Ensuring that all other processors sharing data are informed of writes can be handled two ways: 1. Write-update (write-broadcast) – writing processor broadcasts new data over the bus, all copies are updated l All writes go to the bus  higher bus traffic l Since new values appear in caches sooner, can reduce latency 2. Write-invalidate – writing processor issues invalidation signal on bus, cache snoops check to see if they have a copy of the data, if so they invalidate their cache block containing the word (this allows multiple readers but only one writer) l Uses the bus only on the first write  lower bus traffic, so better use of bus bandwidth

.24 A Write-Invalidate CC Protocol Shared (clean) Invalid Modified (dirty) write-back caching protocol in black read (miss) write (hit or miss) read (hit or miss) read (hit) or write (hit or miss) write (miss) State machine for CPU requests for cache block Read -  Shared Write -  Modified

.25 A Write-Invalidate CC Protocol Shared (clean) Invalid Modified (dirty) write-back caching protocol in black read (miss) write (hit or miss) read (hit or miss) read (hit) or write (hit or miss) write (miss) send invalidate receives invalidate (write by another processor to this block) write-back due to read (miss) by another processor to this block send invalidate signals from the processor coherence additions in red signals from the bus coherence additions in blue State machine for CPU requests for cache block State machine for bus requests for each cache block

.26 Write-Invalidate CC Examples l I = invalid (many), S = shared (many), M = modified (only one) Proc 1 A S Main Mem A Proc 2 A I Proc 1 A S Main Mem A Proc 2 A I Proc 1 A M Main Mem A Proc 2 A I Proc 1 A M Main Mem A Proc 2 A I

.27 Write-Invalidate CC Examples l I = invalid (many), S = shared (many), M = modified (only one) Proc 1 A S Main Mem A Proc 2 A I 1. read miss for A 2. read request for A 3. snoop sees read request for A & lets MM supply A 4. gets A from MM & changes its state to S Proc 1 A S Main Mem A Proc 2 A I 1. write miss for A 2. writes A & changes its state to M Proc 1 A M Main Mem A Proc 2 A I 1. read miss for A3. snoop sees read request for A, writes- back A to MM 2. read request for A 4. gets A from MM & changes its state to S 3. P2 sends invalidate for A 4. change A state to I 5. change A state to S Proc 1 A M Main Mem A Proc 2 A I 1. write miss for A 2. writes A & changes its state to M 3. P2 sends invalidate for A 4. change A state to I

.28 Block Size Effects  Writes to one word in a multi-word block mean l either the full block is invalidated (write-invalidate) l or the full block is exchanged between processors (write-update) -Alternatively, could broadcast only the written word  Multi-word blocks can also result in false sharing: when two processors are writing to two different variables in the same cache block l With write-invalidate false sharing increases cache miss rates  Compilers can help reduce false sharing by allocating highly correlated data to the same cache block AB Proc1Proc2 4 word cache block

.29 Other Coherence Protocols  There are many variations on cache coherence protocols  Another write-invalidate protocol used in many modern multicores is MESI with four states: l Modified – same l Exclusive – only one copy of the shared data is allowed to be cached; memory has an up-to-date copy -Since there is only one copy of the block, write hits don’t need to send invalidate signal l Shared – multiple copies of the shared data may be cached (i.e., data permitted to be cached with more than one processor); memory has an up-to-date copy l Invalid – same

.30 MESI Cache Coherency Protocol Processor write or read hit Modified (dirty) Invalid (not valid block) Shared (clean) Processor write miss Processor shared read miss Processor write [Send invalidate signal] [Write back block] Processor shared read Invalidate for this block Another processor has read/write miss for this block [Write back block] Exclusive (clean) Processor exclusive read Processor exclusive read miss Processor exclusive read Processor exclusive read miss [Write back block] Processor shared read

.31 Summary  Key questions l Q1 - How do processors share data? l Q2 - How do processors coordinate their activity? l Q3 - How scalable is the architecture (what is the maximum number of processors)?  Bus connected (shared address UMA’s(SMP’s)) multi’s l Cache coherency hardware to ensure data consistency l Synchronization primitives for synchronization l Scalability of bus connected UMAs limited (< ~ 36 processors) because the three desirable bus characteristics -high bandwidth -low latency -long length are incompatible  Network connected NUMAs are more scalable