MultiIntro.1 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor.

Slides:



Advertisements
Similar presentations
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Lecture 25Fall 2008 Computer Architecture Fall 2008 Lecture 25: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
Chapter 7 Multicores, Multiprocessors, and Clusters.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Lecture 22Fall 2006 Computer Systems Fall 2006 Lecture 22: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )
.1 Intro to Multiprocessors [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
CMPE 421 Parallel Computer Architecture Multi Processing 1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Department of Computer Science University of the West Indies.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Memory/Storage Architecture Lab Computer Architecture Multiprocessors.
.1 Intro to Multiprocessors. .2 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control.
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
Computer Architecture Spring 2012 Lecture 25: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( ) [Adapted.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Lecture 28Fall 2006 Computer Architecture Fall 2006 Lecture 28: Bus Connected Multiprocessors Adapted from Mary Jane Irwin ( )
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Background Computer System Architectures Computer System Software.
CSE431 L25 MultiIntro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin (
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
18-447: Computer Architecture Lecture 30B: Multiprocessors
Multiprocessor Systems
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
CS 147 – Parallel Processing
CSE 431 Computer Architecture Fall Lecture 26. Bus Connected Multi’s
Different Architectures
Chapter 17 Parallel Processing
Multiprocessors - Flynn’s taxonomy (1966)
CS 213: Parallel Processing Architectures
Parallel Processing Architectures
AN INTRODUCTION ON PARALLEL PROCESSING
Distributed Systems CS
CSC3050 – Computer Architecture
Chapter 4 Multiprocessors
Computer Systems Spring 2008 Lecture 25: Intro. to Multiprocessors
Presentation transcript:

MultiIntro.1 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor – multiple processors with a single shared address space  Cluster – multiple computers (each with their own address space) connected over a local area network (LAN) functioning as a single system

MultiIntro.2 Applications Needing “Supercomputing”  Energy (plasma physics (simulating fusion reactions), geophysical (petroleum) exploration)  DoE stockpile stewardship (to ensure the safety and reliability of the nation’s stockpile of nuclear weapons)  Earth and climate (climate and weather prediction, earthquake, tsunami prediction and mitigation of risks)  Transportation (improving vehicles’ airflow dynamics, fuel consumption, crashworthiness, noise reduction)  Bioinformatics and computational biology (genomics, protein folding, designer drugs)  Societal health and safety (pollution reduction, disaster planning, terrorist action detection)

MultiIntro.3 Encountering Amdahl’s Law  Speedup due to enhancement E is Speedup w/ E = Exec time w/o E Exec time w/ E  Suppose that enhancement E accelerates a fraction F (F 1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E  Speedup w/ E =

MultiIntro.4 Encountering Amdahl’s Law  Speedup due to enhancement E is Speedup w/ E = Exec time w/o E Exec time w/ E  Suppose that enhancement E accelerates a fraction F (F 1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E  ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S)

MultiIntro.5 Examples: Amdahl’s Law  Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E =  What if its usable only 15% of the time? Speedup w/ E =  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!  To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less Speedup w/ E =

MultiIntro.6 Examples: Amdahl’s Law  Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = 1/( /20) = 1.31  What if its usable only 15% of the time? Speedup w/ E = 1/( /20) = 1.17  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!  To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less Speedup w/ E = 1 / ((1-F) + F/S)

MultiIntro.7 Supercomputer Style Migration (Top500)  In the last 8 years uniprocessor and SIMDs disappeared while Clusters and Constellations grew from 3% to 80% Nov data Cluster – whole computers interconnected using their I/O bus Constellation – a cluster that uses an SMP multiprocessor as the building block

MultiIntro.8 Multiprocessor/Clusters Key Questions  Q1 – How do they share data?  Q2 – How do they coordinate?  Q3 – How scalable is the architecture? How many processors can be supported?

MultiIntro.9 Flynn’s Classification Scheme  SISD – single instruction, single data stream l aka uniprocessor - what we have been talking about all semester  SIMD – single instruction, multiple data streams l single control unit broadcasting operations to multiple datapaths  MISD – multiple instruction, single data l no such machine (although some people put vector machines in this category)  MIMD – multiple instructions, multiple data streams l aka multiprocessors (SMPs, MPPs, clusters, NOWs)

MultiIntro.10 SIMD Processors  Single control unit  Multiple datapaths (processing elements – PEs) running in parallel l Q1 – PEs are interconnected (usually via a mesh or torus) and exchange/share data as directed by the control unit l Q2 – Each PE performs the same operation on its own local data PE Control

MultiIntro.11 Example SIMD Machines MakerYear# PEs# b/ PE Max memory (MB) PE clock (MHz) System BW (MB/s) Illiac IVUIUC ,560 DAPICL19804, ,560 MPPGoodyear198216, ,480 CM-2Thinking Machines , ,384 MP-1216MasPar198916, ,000

MultiIntro.12 Multiprocessor Basic Organizations  Processors connected by a single bus  Processors connected by a network # of Proc Communication model Message passing8 to 2048 Shared address NUMA8 to 256 UMA2 to 64 Physical connection Network8 to 256 Bus2 to 36

MultiIntro.13 Shared Address (Shared Memory) Multi’s  UMAs (uniform memory access) – aka SMP (symmetric multiprocessors) l all accesses to main memory take the same amount of time no matter which processor makes the request or which location is requested  NUMAs (nonuniform memory access) l some main memory accesses are faster than others depending on the processor making the request and which location is requested l can scale to larger sizes than UMAs so are potentially higher performance  Q1 – Single address space shared by all the processors  Q2 – Processors coordinate/communicate through shared variables in memory (via loads and stores) l Use of shared data must be coordinated via synchronization primitives (locks)

MultiIntro.14 N/UMA Remote Memory Access Times (RMAT) YearTypeMax Proc Interconnection Network RMAT (ns) Sun Starfire1996SMP64Address buses, data switch 500 Cray 3TE1996NUMA20482-way 3D torus300 HP V1998SMP328 x 8 crossbar1000 SGI Origin NUMA512Fat tree500 Compaq AlphaServer GS 1999SMP32Switched bus400 Sun V SMP8Switched bus240 HP Superdome SMP64Switched bus275 NASA Columbia2004NUMA10240Fat tree???

MultiIntro.15 Single Bus (Shared Address UMA) Multi’s  Caches are used to reduce latency and to lower bus traffic  Must provide hardware to ensure that caches and memory are consistent (cache coherency  Must provide a hardware mechanism to support process synchronization Processor Cache Single Bus Memory I/O

MultiIntro.16 Summing 100,000 Numbers on 100 Processors sum[Pn] = 0; for (i = 1000*Pn; i< 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];  Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable)  The processors then coordinate in adding together the partial sums ( half is a private variable initialized to 100 (the number of processors)) repeat synch();/*synchronize first if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; half = half/2 if (Pn<half) sum[Pn] = sum[Pn] + sum[Pn+half] until (half == 1);/*final sum in sum[0]

MultiIntro.17 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] half = 10

MultiIntro.18 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] P0 P1P2P3P4 half = 10 half = 5 P1 half = 2 P0 half = 1

MultiIntro.19 Message Passing Multiprocessors  Each processor has its own private address space  Q1 – Processors share data by explicitly sending and receiving information (messages)  Q2 – Coordination is built into message passing primitives (send and receive)

MultiIntro.20 Summary  Flynn’s classification of processors – SISD, SIMD, MIMD l Q1 – How do processors share data? l Q2 – How do processors coordinate their activity? l Q3 – How scalable is the architecture (what is the maximum number of processors)?  Shared address multis – UMAs and NUMAs  Bus connected (shared address UMAs) multis l Cache coherency hardware to ensure data consistency l Synchronization primitives for synchronization l Bus traffic limits scalability of architecture (< ~ 36 processors)  Message passing multis