MBG 1 CIS501, Fall 99 Lecture 23: Intro to Multi-processors Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
CMSC 611: Advanced Computer Architecture
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
ENGS 116 Lecture 151 Multiprocessors and Thread-Level Parallelism Vincent Berk November 12 th, 2008 Reading for Friday: Sections Reading for Monday:
DAP Spr.‘98 ©UCB 1 Lecture 18: Review. DAP Spr.‘98 ©UCB 2 Cache Organization (1) How do you know if something is in the cache? (2) If it is in the cache,
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
CS 470/570:Introduction to Parallel and Distributed Computing.
Parallel Architectures
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Graduate Computer Architecture I Lecture 10: Shared Memory Multiprocessors Young Cho.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Multiprocessors Speed of execution is a paramount concern, always so … If feasible … the more simultaneous execution that can be done on multiple computers.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
CSIE30300 Computer Architecture Unit 15: Multiprocessors Hsin-Chou Chi [Adapted from material by and
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 34 Multiprocessors (Shared Memory Architectures) Prof. Dr. M. Ashraf Chughtai.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
COMP 740: Computer Architecture and Implementation
18-447: Computer Architecture Lecture 30B: Multiprocessors
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Multi-Processing in High Performance Computer Architecture:
CMSC 611: Advanced Computer Architecture
CPE 631 Lecture 18: Multiprocessors
Chapter 17 Parallel Processing
Multiprocessors - Flynn’s taxonomy (1966)
CS 213: Parallel Processing Architectures
CPE 631 Session 20: Multiprocessors
Parallel Processing Architectures
Lecture 24: Memory, VM, Multiproc
Distributed Systems CS
Chapter 4 Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
CPE 631 Lecture 20: Multiprocessors
Presentation transcript:

MBG 1 CIS501, Fall 99 Lecture 23: Intro to Multi-processors Michael B. Greenwald Computer Architecture CIS 501 Fall 1999

MBG 2 CIS501, Fall 99 Administrative stuff Final exam will be in room Moore 216, 8:30- 10:30am on Thursday, December 16th. HW #6 delayed until Thursday, Dec. 9th. Project extension: no penalty if I get it by the time I show up tomorrow morning (Friday, 9am-ish). Final: open book? Vote Penn CISter’s women’s luncheon on Wednesday, December 8th, 12:30-2:30, – Polar Bear Lounge (129 Pender) –Hosted by Professors Martha Palmer & Susan Davidson –questions?

MBG 3 CIS501, Fall 99 Why multiprocessors? Exploit parallelism (duplicate every resource, so no structural hazards). Increase availability (single processors may fail but system remains robust). Simplify parallelization Goal: increase performance by factor of N, if there are N processors. Pay more money, increase speedup! Rarely achievable

MBG 4 CIS501, Fall 99 Barriers to factor of N speedup Not all resources are duplicated (structural hazards) –High cost or low utilization –Need to maintain identity, or used for sharing information. Data dependencies: –A depends upon result of B, true dependencies –Name dependencies: false sharing Synchronization –x := 25; x := x+1;x := x+1; => individual reads and writes –Timing, Barriers

MBG 5 CIS501, Fall 99 Impact of barriers: lack of duplication/structural hazards Well understood in CIS501: –Stalls –Bottleneck (e.g. shared bus) –Cost of arbitration

MBG 6 CIS501, Fall 99 Impact of barriers: Data Dependencies Increased Memory Costs –Cache misses as memory goes from cache 1 to cache 2. Proc A stall waiting for B to finish (lack of parallelism) Communication costs between subtasks –Stall waiting for data to be xmitted –Increased memory costs (more misses) False sharing –Example 2 objects in 1 cache line. –Increases memory costs

MBG 7 CIS501, Fall 99 Impact of barriers: Synchronization Hotspot/Bottleneck (leads to data dependencies on lock) Increased communication Lack of parallelism (mutual exclusion)

MBG 8 CIS501, Fall 99 Structure of Multiprocessors A multiprocess has N processors, with some manner of shared memory or communications In what sense do they “run the same program”? (How do they process Instructions/Data?) Memory Hierarchy: How is the memory organized? MemoryCommunication Interface: How is state shared?

MBG 9 CIS501, Fall 99 Popular Flynn Categories SISD (Single Instruction Single Data) –Uniprocessors MISD (Multiple Instruction Single Data) –??? (Image processing? Cellular automata?) SIMD (Single Instruction Multiple Data) –Examples: Illiac-IV, CM-2 (early multiproc, special purpose) »Simple programming model »Low overhead »Flexibility »All custom integrated circuits MIMD (Multiple Instruction Multiple Data) –Examples: Sun Enterprise 5000, Cray T3D, SGI Origin »Flexible »Economy of scale (each uproc is same as commodity off- the-shelf uni-processor). »Independent tasks can operate independently

MBG 10 CIS501, Fall 99 Memory Organization Centralized Shared-memory architecture; also known as “UMA (Uniform Memory Access)”: –Shared bus (low latency, high throughput) –Shared physical memory (shared L3 cache?) –Shared I/O system –Separate L1 (and L2?) caches Distributed Memory architecture; NUMA, “cluster”: –Independent I/O, memory, and caches per processor –Scales memory bandwidth, I/O bw, fast access to local memory –Large spectrum of interconnection networks (each node may be a UMA multiprocessor)

MBG 11 CIS501, Fall 99 Memory Architecture, Communication Models Distributed Shared Memory vs. Message passing DSM –Load/Store –Addressing: »one physical address space »One virtual address space Message passing –Synchronous (RPC) –Asynchronous (Pure message passing) »(Null RPC makes this distinction less important).

MBG 12 CIS501, Fall 99 Communication Models Shared Memory –Processors communicate with shared address space –Easy on small-scale machines –Advantages: »Model of choice for uniprocessors, small-scale MPs »Ease of programming »Lower latency »Easier to use hardware controlled caching Message passing –Processors have private memories, communicate via messages –Advantages: »Less hardware, easier to design »Focuses attention on costly non-local operations Can support either SW model on either HW base

MBG 13 CIS501, Fall 99 Parallel Applications: What programs can usefully use a multiprocessor? What applications can we make parallel? Need independent computations SPLASH benchmark

MBG 14 CIS501, Fall 99 Structure of parallel programs (Amdahl’s Law): never faster than setup + cleanup Setup Loop body1 Loop body2 Loop body3 Loop body (n-1) Loop bodyn Cleanup Setup Loop body1 Loop body2 Loop body3 Loop body (n-1) Loop bodyn Cleanup Loop body4 Loop body1 Loop body2 Loop body3 Loop body (n-1) Loop bodyn Cleanup Setup

MBG 15 CIS501, Fall 99 Structure of parallel programs (Amdahl’s Law): never faster than setup + cleanup Setup Loop body1 Loop body2 Loop body3 Loop body (n-1) Loop bodyn Cleanup Setup Loop body1 Loop body2 Loop body3 Loop body (n-1) Loop bodyn Cleanup Loop body4 Loop body1 Loop body2 Loop body3 Loop body (n-1) Loop bodyn Cleanup Setup Too Simple!

MBG 16 CIS501, Fall 99 Effect of parallelization If you divide a program block that takes time T(n) into P blocks, will each block take T(n)/P? Simple answer is “yes”, but... Reality is no: Data dependencies; must communicate results from one sub- computation to another –Must spend the time transmitting data (throughput) –Must wait for data to arrive (latency)

MBG 17 CIS501, Fall 99 Effect of parallelization If you divide a program block that takes time T(n) into P blocks, will each block take T(n)/P? Computation cost scales as 1/P Communication cost scales in algorithm specific way. Example: particle simulation. –2-d Grid, communication cost is O(1/sqrt(P)) per processor, so aggregate communication cost increases as we add processors and performance increase is sublinear.

MBG 18 CIS501, Fall 99 Effect of parallelization (continued) Inter-processor Communication is expensive. –Inter-proc Communication costs (computation/communication ratio only 1st order effect) –Memory costs (locality) –Redundant computation Trade off computation for communication Change memory layout (more cache misses on uni-processor, but fewer on multi-proc).

MBG 19 CIS501, Fall 99 Fundamental Issues 4 Issues to characterize parallel machines/systems 1) Naming 2) Synchronization 3) Latency and Bandwidth 4) Consistency

MBG 20 CIS501, Fall 99 Fundamental Issue #1: Naming Naming: how to solve large problem fast –what data is shared –how it is addressed –what operations can access data –how processes refer to each other Choice of naming affects code produced by a compiler; via load where just remember address or keep track of processor number and local virtual address for msg. passing Choice of naming affects replication of data; via load in cache memory hierachy or via SW replication and consistency

MBG 21 CIS501, Fall 99 Fundamental Issue #1: Naming Global physical address space: any processor can generate address, and access it in a single operation –memory can be anywhere: virtual addr. translation handles it Global virtual address space: if the address space of each process can be configured to contain all shared data of the parallel program Segmented shared address space: locations are named uniformly for all processes of the parallel program

MBG 22 CIS501, Fall 99 Fundamental Issue #2: Synchronization To cooperate, processes must coordinate Message passing is implicit coordination with transmission or arrival of data Shared address => additional operations to explicitly coordinate: e.g., write a flag, awaken a thread, interrupt a processor, atomic operation

MBG 23 CIS501, Fall 99 Fundamental Issue #3: Latency and Bandwidth Bandwidth –Need high bandwidth in communication –Cannot scale, but stay close –Match limits in network, memory, and processor –Overhead to communicate is a problem in many machines Latency –Affects performance, since processor may have to wait –Affects ease of programming, since requires more thought to overlap communication and computation Latency Hiding –How can a mechanism help hide latency? –Examples: overlap message send with computation, prefetch data, switch to other tasks

MBG 24 CIS501, Fall 99 SMP Interconnect Processors to Memory AND to I/O Bus based: all memory locations equal access time so SMP = “Symmetric MP” –Sharing limited BW as add processors, I/O –(see Chapter 1, Figs 1-18/19, page of [CSG96]) Crossbar: expensive to expand Multistage network (less expensive to expand than crossbar with more BW) “Dance Hall” designs: All processors on the left, all memories on the right