740: Computer Architecture Programming Models and Architectures Prof. Onur Mutlu Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
SE-292 High Performance Computing
ECE669 L3: Design Issues February 5, 2004 ECE 669 Parallel Computer Architecture Lecture 3 Design Issues.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Multiple Processor Systems
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
CS 284a, 7 October 97Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 7 October 1997.
ECE669 L2: Architectural Perspective February 3, 2004 ECE 669 Parallel Computer Architecture Lecture 2 Architectural Perspective.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Parallel Computer Architectures
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Computer architecture II
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Parallel Architectures
Computer System Architectures Computer System Software
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Memory/Storage Architecture Lab Computer Architecture Multiprocessors.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can.
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
Diversity and Convergence of Parallel Architectures Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Introduction Copyright 2004 Daniel J. Sorin Duke University Slides.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Spring 2011 Parallel Computer Architecture Lecture 3: Programming Models and Architectures Prof. Onur Mutlu Carnegie Mellon University © 2010 Onur.
COMP7330/7336 Advanced Parallel and Distributed Computing Data Parallelism Dr. Xiao Qin Auburn University
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
18-447: Computer Architecture Lecture 30B: Multiprocessors
Prof. Onur Mutlu Carnegie Mellon University
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
CMSC 611: Advanced Computer Architecture
Lecture 1: Parallel Architecture Intro
Convergence of Parallel Architectures
Multiple Processor Systems
CS 213: Parallel Processing Architectures
Parallel Processing Architectures
Distributed Systems CS
Chapter 4 Multiprocessors
Prof. Onur Mutlu Carnegie Mellon University 9/12/2012
Presentation transcript:

740: Computer Architecture Programming Models and Architectures Prof. Onur Mutlu Carnegie Mellon University

What Will We Cover in This Lecture? Hill, Jouppi, Sohi, “Multiprocessors and Multicomputers,” pp , in Readings in Computer Architecture. Culler, Singh, Gupta, Chapter 1 (Introduction) in “Parallel Computer Architecture: A Hardware/Software Approach.” 2

Programming Models vs. Architectures Five major models  (Sequential)  Shared memory  Message passing  Data parallel (SIMD)  Dataflow  Systolic Hybrid models? 3

Shared Memory vs. Message Passing Are these programming models or execution models supported by the hardware architecture? Does a multiprocessor that is programmed by “shared memory programming model” have to support a shared address space processors? Does a multiprocessor that is programmed by “message passing programming model” have to have no shared address space between processors? 4

Programming Models: Message Passing vs. Shared Memory Difference: how communication is achieved between tasks Message passing programming model  Explicit communication via messages  Loose coupling of program components  Analogy: telephone call or letter, no shared location accessible to all Shared memory programming model  Implicit communication via memory operations (load/store)  Tight coupling of program components  Analogy: bulletin board, post information at a shared space Suitability of the programming model depends on the problem to be solved. Issues affected by the model include:  Overhead, scalability, ease of programming, bugs, match to underlying hardware, … 5

Message Passing vs. Shared Memory Hardware Difference: how task communication is supported in hardware Shared memory hardware (or machine model)  All processors see a global shared address space Ability to access all memory from each processor  A write to a location is visible to the reads of other processors Message passing hardware (machine model)  No global shared address space  Send and receive variants are the only method of communication between processors (much like networks of workstations today, i.e. clusters) Suitability of the hardware depends on the problem to be solved as well as the programming model. 6

Message Passing vs. Shared Memory Hardware P M IO P M P M I/O (Network) Message Passing P M IO P M P M Memory Shared Memory P M IO P M P M Processor (Dataflow/Systolic), Single-Instruction Multiple-Data (SIMD) ==> Data Parallel Join At: Program With:

Programming Model vs. Hardware Most of parallel computing history, there was no separation between programming model and hardware  Message passing: Caltech Cosmic Cube, Intel Hypercube, Intel Paragon  Shared memory: CMU C.mmp, Sequent Balance, SGI Origin.  SIMD: ILLIAC IV, CM-1 However, any hardware can really support any programming model Why?  Application  compiler/library  OS services  hardware 8

Layers of Abstraction Compiler/library/OS map the communication abstraction at the programming model layer to the communication primitives available at the hardware layer 9

Programming Model vs. Architecture Machine  Programming Model  Join at network, so program with message passing model  Join at memory, so program with shared memory model  Join at processor, so program with SIMD or data parallel Programming Model  Machine  Message-passing programs on message-passing machine  Shared-memory programs on shared-memory machine  SIMD/data-parallel programs on SIMD/data-parallel machine Isn’t hardware basically the same?  Processors, memory, interconnect (I/O)  Why not have generic parallel machine and program with model that fits the problem? 10

A Generic Parallel Machine Separation of programming models from architectures All models require communication Node with processor(s), memory, communication assist Interconnect CA Mem P $ CA Mem P $ CA Mem P $ CA Mem P $ Node 0Node 1 Node 2 Node 3

Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] How do I make this parallel?

Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] Split the loops  Independent iterations for i = 1 to N A[i] = (A[i] + B[i]) * C[i] for i = 1 to N sum = sum + A[i] Data flow graph?

Data Flow Graph A[0] B[0] + C[0] * + A[1] B[1] + C[1] * A[2] B[2] + C[2] * + A[3] B[3] + C[3] * N-1 cycles to execute on N processors what assumptions?

Partitioning of Data Flow Graph A[0] B[0] + C[0] * + A[1] B[1] + C[1] * A[2] B[2] + C[2] * + A[3] B[3] + C[3] * + global synch

Shared (Physical) Memory Communication, sharing, and synchronization with store / load on shared variables Must map virtual pages to physical page frames Consider OS support for good mapping PnPn P0P0 load store Private Portion of Address Space Shared Portion of Address Space Common Physical Addresses Pn Private P0 Private P1 Private P2 Private Machine Physical Address Space

Shared (Physical) Memory on Generic MP Interconnect CA Mem P $ CA Mem P $ CA Mem P $ CA Mem P $ Node 0 0,N-1 (Addresses) Node 1 N,2N-1 Node 2 2N,3N-1 Node 3 3N,4N-1 Keep private data and frequently used shared data on same node as computation

Return of The Simple Problem private int i, my_start, my_end, mynode; shared float A[N], B[N], C[N], sum; for i = my_start to my_end A[i] = (A[i] + B[i]) * C[i] GLOBAL_SYNCH; if (mynode == 0) for i = 1 to N sum = sum + A[i] Can run this on any shared memory machine

Message Passing Architectures Cannot directly access memory on another node IBM SP-2, Intel Paragon Cluster of workstations Interconnect CA Mem P $ CA Mem P $ Node 0 0,N-1 Node 1 0,N-1 Node 2 0,N-1 Node 3 0,N-1 CA Mem P $ CA Mem P $

Message Passing Programming Model User level send/receive abstraction  local buffer (x,y), process (Q,P) and tag (t)  naming and synchronization Local Process Address Space address x address y match Process P Process Q Local Process Address Space Send x, Q, t Recv y, P, t

The Simple Problem Again int i, my_start, my_end, mynode; float A[N/P], B[N/P], C[N/P], sum; for i = 1 to N/P A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] if (mynode != 0) send (sum,0); if (mynode == 0) for i = 1 to P-1 recv(tmp,i) sum = sum + tmp Send/Recv communicates and synchronizes P processors

Separation of Architecture from Model At the lowest level shared memory model is all about sending and receiving messages  HW is specialized to expedite read/write messages using load and store instructions What programming model/abstraction is supported at user level? Can I have shared-memory abstraction on message passing HW? How efficient? Can I have message passing abstraction on shared memory HW? How efficient?

Challenges in Mixing and Matching Assume prog. model same as ABI (compiler/library  OS  hardware) Shared memory prog model on shared memory HW  How do you design a scalable runtime system/OS? Message passing prog model on message passing HW  How do you get good messaging performance? Shared memory prog model on message passing HW  How do you reduce the cost of messaging when there are frequent operations on shared data?  Li and Hudak, “Memory Coherence in Shared Virtual Memory Systems,” ACM TOCS Message passing prog model on shared memory HW  Convert send/receives to load/stores on shared buffers  How do you design scalable HW? 23

Data Parallel Programming Model Programming Model  Operations are performed on each element of a large (regular) data structure (array, vector, matrix)  Program is logically a single thread of control, carrying out a sequence of either sequential or parallel steps The Simple Problem Strikes Back A = (A + B) * C sum = global_sum (A) Language supports array assignment

Data Parallel Hardware Architectures (I) Early architectures directly mirrored programming model Single control processor (broadcast each instruction to an array/grid of processing elements)  Consolidates control Many processing elements controlled by the master Examples: Connection Machine, MPP  Batcher, “Architecture of a massively parallel processor,” ISCA K bit-serial processing elements  Tucker and Robertson, “Architecture and Applications of the Connection Machine,” IEEE Computer K bit-serial processing elements 25

Connection Machine 26

Data Parallel Hardware Architectures (II) Later data parallel architectures  Higher integration  SIMD units on chip along with caches  More generic  multiple cooperating multiprocessors with vector units  Specialized hardware support for global synchronization E.g. barrier synchronization Example: Connection Machine 5  Hillis and Tucker, “The CM-5 Connection Machine: a scalable supercomputer,” CACM  Consists of 32-bit SPARC processors  Supports Message Passing and Data Parallel models  Special control network for global synchronization 27

Review: Separation of Model and Architecture Shared Memory  Single shared address space  Communicate, synchronize using load / store  Can support message passing Message Passing  Send / Receive  Communication + synchronization  Can support shared memory Data Parallel  Lock-step execution on regular data structures  Often requires global operations (sum, max, min...)  Can be supported on either SM or MP

Review: A Generic Parallel Machine Separation of programming models from architectures All models require communication Node with processor(s), memory, communication assist Interconnect CA Mem P $ CA Mem P $ CA Mem P $ CA Mem P $ Node 0Node 1 Node 2 Node 3

Data Flow Programming Models and Architectures A program consists of data flow nodes A data flow node fires (fetched and executed) when all its inputs are ready  i.e. when all inputs have tokens No artificial constraints, like sequencing instructions How do we know when operands are ready?  Matching store for operands (remember OoO execution?)  large associative search! Later machines moved to coarser grained dataflow (threads + dataflow across threads)  allowed registers and cache for local computation  introduced messages (with operations and operands) 30

Scalability, Convergence, and Some Terminology 31

Scaling Shared Memory Architectures 32

Interconnection Schemes for Shared Memory Scalability dependent on interconnect 33

UMA/UCA: Uniform Memory or Cache Access All processors have the same uncontended latency to memory Latencies get worse as system grows Symmetric multiprocessing (SMP) ~ UMA with bus interconnect

Uniform Memory/Cache Access + Data placement unimportant/less important (easier to optimize code and make use of available memory space) - Scaling the system increases all latencies - Contention could restrict bandwidth and increase latency

Example SMP Quad-pack Intel Pentium Pro 36

How to Scale Shared Memory Machines? Two general approaches Maintain UMA  Provide a scalable interconnect to memory  Downside: Every memory access incurs the round-trip network latency Interconnect complete processors with local memory  NUMA (Non-uniform memory access) Local memory faster than remote memory  Still needs a scalable interconnect for accessing remote memory Not on the critical path of local memory access 37

NUMA/NUCA: NonUniform Memory/Cache Access Shared memory as local versus remote memory + Low latency to local memory - Much higher latency to remote memories + Bandwidth to local memory may be higher - Performance very sensitive to data placement

Example NUMA Machines (I) – CM5 CM-5 Hillis and Tucker, “The CM-5 Connection Machine: a scalable supercomputer,” CACM

Example NUMA Machines (I) – CM5 40

Example NUMA Machines (II) Sun Enterprise Server Cray T3E 41

Convergence of Parallel Architectures Scalable shared memory architecture is similar to scalable message passing architecture Main difference: is remote memory accessible with loads/stores? 42

Historical Evolution: 1960s & 70s Early MPs –Mainframes –Small number of processors –crossbar interconnect –UMA

Historical Evolution: 1980s Bus-Based MPs –enabler: processor-on-a-board –economical scaling –precursor of today’s SMPs –UMA

Historical Evolution: Late 80s, mid 90s Large Scale MPs (Massively Parallel Processors) –multi-dimensional interconnects –each node a computer (proc + cache + memory) –both shared memory and message passing versions –NUMA –still used for “supercomputing”

Historical Evolution: Current Chip multiprocessors (multi-core) Small to Mid-Scale multi-socket CMPs  One module type: processor + caches + memory Clusters/Datacenters  Use high performance LAN to connect SMP blades, racks Driven by economics and cost  Smaller systems => higher volumes  Off-the-shelf components Driven by applications  Many more throughput applications (web servers)  … than parallel applications (weather prediction)  Cloud computing

Historical Evolution: Future Cluster/datacenter on a chip? Heterogeneous multi-core? Bounce back to small-scale multi-core? ??? 47

740: Computer Architecture Programming Models and Architectures Prof. Onur Mutlu Carnegie Mellon University

Backup slides 49

Referenced Readings Hill, Jouppi, Sohi, “Multiprocessors and Multicomputers,” pp , in Readings in Computer Architecture. Culler, Singh, Gupta, Chapter 1 (Introduction) in “Parallel Computer Architecture: A Hardware/Software Approach.”  Li and Hudak, “Memory Coherence in Shared Virtual Memory Systems,” ACM TOCS  Hillis and Tucker, “The CM-5 Connection Machine: a scalable supercomputer,” CACM Batcher, “Architecture of a massively parallel processor,” ISCA Tucker and Robertson, “Architecture and Applications of the Connection Machine,” IEEE Computer Seitz, “The Cosmic Cube,” CACM

Related Videos Programming Models and Architectures (This Lecture)  mCYX8&list=PLSEZzvupP7hNjq3Tuv2hiE5VvR- WRYoW4&index=3 mCYX8&list=PLSEZzvupP7hNjq3Tuv2hiE5VvR- WRYoW4&index=3 Multiprocessor Correctness and Cache Coherence  VZKMgItDM&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&i ndex=32 VZKMgItDM&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&i ndex=32 51