1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

ECE669 L3: Design Issues February 5, 2004 ECE 669 Parallel Computer Architecture Lecture 3 Design Issues.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Computer Organization and Architecture
Computer Organization and Architecture
Computer Organization and Architecture
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Types of Parallel Computers
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
EECC756 - Shaaban #1 lec # 2 Spring Parallel Architectures History Application Software System Software SIMD Message Passing Shared Memory.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
EECC756 - Shaaban #1 lec # 2 Spring Parallel Architectures History Application Software System Software SIMD Message Passing Shared Memory.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.
Computer architecture II
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
CH12 CPU Structure and Function
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
1 Computer System Organization I/O systemProcessor Compiler Operating System (Windows 98) Application (Netscape) Digital Design Circuit Design Instruction.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Outline Why this subject? What is High Performance Computing?
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
Evolution and Convergence of Parallel Architectures Todd C. Mowry CS 495 January 17, 2002.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Diversity and Convergence of Parallel Architectures Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.
Parallel Computing Presented by Justin Reschke
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Introduction Copyright 2004 Daniel J. Sorin Duke University Slides.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
740: Computer Architecture Programming Models and Architectures Prof. Onur Mutlu Carnegie Mellon University.
Spring 2011 Parallel Computer Architecture Lecture 3: Programming Models and Architectures Prof. Onur Mutlu Carnegie Mellon University © 2010 Onur.
COMP7330/7336 Advanced Parallel and Distributed Computing Data Parallelism Dr. Xiao Qin Auburn University
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Fundamental Design Issues Dr. Xiao Qin Auburn.
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
CMSC 611: Advanced Computer Architecture
What is Parallel and Distributed computing?
Lecture 1: Parallel Architecture Intro
Convergence of Parallel Architectures
CS 213: Parallel Processing Architectures
Parallel Processing Architectures
Chapter 4 Multiprocessors
An Overview of MIMD Architectures
Prof. Onur Mutlu Carnegie Mellon University 9/12/2012
Presentation transcript:

1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message Passing, Data Parallel Issues in Programming Models –Function: naming, operations, & ordering –Performance: latency, bandwidth, etc. Examples of Commercial Machines

2 Applications: Science and Engineering Examples –Weather prediction –Evolution of galaxies –Oil reservoir simulation –Automobile crash tests –Drug development –VLSI CAD –Nuclear bomb simulation Typically model physical systems or phenomena Problems are 2D or 3D Usually require “number crunching” Involve “true” parallelism

3 So Why Is MP Research Disappearing? Where is MP research presented? –ISCA = International Symposium on Computer Architecture –ASPLOS = Arch. Support for Programming Languages and OS –HPCA = High Performance Computer Architecture –SPAA = Symposium on Parallel Algorithms and Architecture –ICS = International Conference on Supercomputing –PACT = Parallel Architectures and Compilation Techniques –Etc. Why is the number of MP papers at ISCA falling? Read “The Rise and Fall of MP Research” –Mark D. Hill and Ravi Rajwar

4 Outline Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models Issues in Programming Models Examples of Commercial Machines

5 In Theory Sequential –Time to sum n numbers? O(n) –Time to sort n numbers? O(n log n) –What model? RAM Parallel –Time to sum? Tree for O(log n) –Time to sort? Non-trivially O(log n) –What model? »PRAM [Fortune, Willie STOC78] »P processors in lock-step »One memory (e.g., CREW for concurrent read exclusive write)

6 But in Practice, How Do You Name a datum (e.g., all say a[2])? Communicate values? Coordinate and synchronize? Select processing node size (few-bit ALU to a PC)? Select number of nodes in system?

7 Historical View P M IO P M P M I/O (Network) Message Passing P M IO P M P M Memory Shared Memory P M IO P M P M Processor Data Parallel Single-Instruction Multiple-Data (SIMD), (Dataflow/Systolic) Join At: Program With:

8 Historical View, cont. Machine  Programming Model –Join at network so program with message passing model –Join at memory so program with shared memory model –Join at processor so program with SIMD or data parallel Programming Model  Machine –Message-passing programs on message-passing machine –Shared-memory programs on shared-memory machine –SIMD/data-parallel programs on SIMD/data-parallel machine But –Isn’t hardware basically the same? Processors, memory, & I/O? –Convergence! Why not have generic parallel machine & program with model that fits the problem?

9 A Generic Parallel Machine Separation of programming models from architectures All models require communication Node with processor(s), memory, communication assist Interconnect CA Mem P $ CA Mem P $ CA Mem P $ CA Mem P $ Node 0Node 1 Node 2 Node 3

10 Outline Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory –Message Passing –Data Parallel Issues in Programming Models Examples of Commercial Machines

11 Programming Model How does the programmer view the system? Shared memory: processors execute instructions and communicate by reading/writing a globally shared memory Message passing: processors execute instructions and communicate by explicitly sending messages Data parallel: processors do the same instructions at the same time, but on different data

12 Today’s Parallel Computer Architecture Extension of traditional computer architecture to support communication and cooperation –Communications architecture Multiprogramming Shared Memory Message Passing Data Parallel User Level Programming Model Libraries and Compilers Communication Abstraction System Level Operating System Support Hardware/Software Boundary Communication Hardware Physical Communication Medium

13 Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] How do I make this parallel?

14 Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] Split the loops »Independent iterations for i = 1 to N A[i] = (A[i] + B[i]) * C[i] for i = 1 to N sum = sum + A[i]

15 Programming Model Historically, Programming Model == Architecture Thread(s) of control that operate on data Provides a communication abstraction that is a contract between hardware and software (a la ISA) Current Programming Models 1)Shared Memory 2)Message Passing 3)Data Parallel (Shared Address Space)

16 Global Shared Physical Address Space Communication, sharing, and synchronization with loads/stores on shared variables Must map virtual pages to physical page frames Consider OS support for good mapping PnPn P0P0 load store Private Portion of Address Space Shared Portion of Address Space Common Physical Addresses Pn Private P0 Private P1 Private P2 Private Machine Physical Address Space

17 Page Mapping in Shared Memory MP Interconnect NI Mem P $ NI Mem P $ NI Mem P $ NI Mem P $ Node 0 0,N-1 Node 1 N,2N-1 Node 2 2N,3N-1 Node 3 3N,4N-1 Keep private data and frequently used shared data on same node as computation Load by Processor 0 to address N+3 goes to Node 1

18 Return of The Simple Problem private int i,my_start,my_end,mynode; shared float A[N], B[N], C[N], sum; for i = my_start to my_end A[i] = (A[i] + B[i]) * C[i] GLOBAL_SYNCH; if (mynode == 0) for i = 1 to N sum = sum + A[i] Can run this on any shared memory machine

19 Message Passing Architectures Cannot directly access memory on another node IBM SP-2, Intel Paragon Cluster of workstations Interconnect CA Mem P $ CA Mem P $ Node 0 0,N-1 Node 1 0,N-1 Node 2 0,N-1 Node 3 0,N-1 CA Mem P $ CA Mem P $

20 Message Passing Programming Model User-level send/receive abstraction –local buffer (x,y), process (P,Q), and tag (t) –naming and synchronization Local Process Address Space address x address y match Process P Process Q Local Process Address Space Send x, Q, t Recv y, P, t

21 The Simple Problem Again int i, my_start, my_end, mynode; float A[N/P], B[N/P], C[N/P], sum; for i = 1 to N/P A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] if (mynode != 0) send (sum,0); if (mynode == 0) for i = 1 to P-1 recv(tmp,i) sum = sum + tmp Send/Recv communicates and synchronizes P processors

22 Separation of Architecture from Model At the lowest level, shared memory sends messages –HW is specialized to expedite read/write messages What programming model / abstraction is supported at user level? Can I have shared-memory abstraction on message passing HW? Can I have message passing abstraction on shared memory HW? Some 1990s research machines integrated both –MIT Alewife, Wisconsin Tempest/Typhoon, Stanford FLASH

23 Data Parallel Programming Model –Operations are performed on each element of a large (regular) data structure in a single step –Arithmetic, global data transfer Processor is logically associated with each data element –SIMD architecture: Single instruction, multiple data Early architectures mirrored programming model –Many bit-serial processors Today we have FP units and caches on microprocessors Can support data parallel model on shared memory or message passing architecture

24 The Simple Problem Strikes Back Assuming we have N processors A = (A + B) * C sum = global_sum (A) Language supports array assignment Special HW support for global operations CM-2 bit-serial CM-5 32-bit SPARC processors –Message Passing and Data Parallel models –Special control network

25 Aside - Single Program, Multiple Data SPMD Programming Model –Each processor executes the same program on different data –Many Splash2 benchmarks are in SPMD model for each molecule at this processor { simulate interactions with myproc+1 and myproc-1; } Not connected to SIMD architectures –Not lockstep instructions –Could execute different instructions on different processors »Data dependent branches cause divergent paths

26 Review: Separation of Model and Architecture Shared Memory –Single shared address space –Communicate, synchronize using load / store –Can support message passing Message Passing –Send / Receive –Communication + synchronization –Can support shared memory Data Parallel –Lock-step execution on regular data structures –Often requires global operations (sum, max, min...) –Can support on either shared memory or message passing

27 Review: A Generic Parallel Machine Separation of programming models from architectures All models require communication Node with processor(s), memory, communication assist Interconnect CA Mem P $ CA Mem P $ CA Mem P $ CA Mem P $ Node 0Node 1 Node 2 Node 3

28 Outline Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models Issues in Programming Models –Function: naming, operations, & ordering –Performance: latency, bandwidth, etc. Examples of Commercial Machines

29 Programming Model Design Issues Naming: How is communicated data and/or partner node referenced? Operations: What operations are allowed on named data? Ordering: How can producers and consumers of data coordinate their activities? Performance –Latency: How long does it take to communicate in a protected fashion? –Bandwidth: How much data can be communicated per second? How many operations per second?

30 Issue: Naming Single Global Linear-Address-Space (shared memory) Multiple Local Address/Name Spaces (message passing) Naming strategy affects –Programmer / Software –Performance –Design complexity

31 Issue: Operations Uniprocessor RISC –Ld/St and atomic operations on memory –Arithmetic on registers Shared Memory Multiprocessor –Ld/St and atomic operations on local/global memory –Arithmetic on registers Message Passing Multiprocessor –Send/receive on local memory –Broadcast Data Parallel –Ld/St –Global operations (add, max, etc.)

32 Issue: Ordering Uniprocessor –Programmer sees order as program order –Out-of-order execution –Write buffers –Important to maintain dependencies Multiprocessor –What is order among several threads accessing shared data? –What affect does this have on performance? –What if implicit order is insufficient? –Memory consistency model specifies rules for ordering

33 Issue: Order/Synchronization Coordination mainly takes three forms: –Mutual exclusion (e.g., spin-locks) –Event notification »Point-to-point (e.g., producer-consumer) »Global (e.g., end of phase indication, all or subset of processes) –Global operations (e.g., sum) Issues: –Synchronization name space (entire address space or portion) –Granularity (per byte, per word,... => overhead) –Low latency, low serialization (hot spots) –Variety of approaches »Test&set, compare&swap, LoadLocked-StoreConditional »Full/Empty bits and traps »Queue-based locks, fetch&op with combining »Scans

34 Performance Issue: Latency Must deal with latency when using fast processors Options: –Reduce frequency of long latency events »Algorithmic changes, computation and data distribution –Reduce latency »Cache shared data, network interface design, network design –Tolerate latency »Message passing overlaps computation with communication (program controlled) »Shared memory overlaps access completion and computation using consistency model and prefetching

35 Performance Issue: Bandwidth Private and global bandwidth requirements Private bandwidth requirements can be supported by: –Distributing main memory among PEs –Application changes, local caches, memory system design Global bandwidth requirements can be supported by: –Scalable interconnection network technology –Distributed main memory and caches –Efficient network interfaces –Avoiding contention (hot spots) through application changes

36 Cost of Communication Cost = Frequency x (Overhead + Latency + Xfer size/BW - Overlap) Frequency = number of communications per unit work –Algorithm, placement, replication, bulk data transfer Overhead = processor cycles spent initiating or handling –Protection checks, status, buffer mgmt, copies, events Latency = time to move bits from source to dest –Communication assist, topology, routing, congestion Transfer time = time through bottleneck –Comm assist, links, congestions Overlap = portion overlapped with useful work –Comm assist, comm operations, processor design

37 Outline Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models Issues in Programming Models Examples of Commercial Machines

38 Example: Intel Pentium Pro Quad –All coherence and multiprocessing glue in processor module –Highly integrated, targeted at high volume –Low latency and bandwidth

39 Example: SUN Enterprise (pre-E10K) –16 cards of either type: processors + memory, or I/O –All memory accessed over bus, so symmetric –Higher bandwidth, higher latency bus

40 Example: Cray T3E –Scale up to 1024 processors, 480MB/s links –Memory controller generates comm. request for non-local references –No hardware mechanism for coherence (SGI Origin etc. provide this)

41 Example: Intel Paragon Memory bus (64-bit, 50 MHz) i860 L 1 $ NI DMA i860 L 1 $ Driver Mem ctrl 4-way interleaved DRAM Intel Paragon node 8 bits, 175 MHz, bidirectional 2D grid network with processing node attached to every switch Sandia’s Intel Paragon XP/S-based Supercomputer

42 Example: IBM SP-2 –Made out of essentially complete RS6000 workstations –Network interface integrated in I/O bus (bw limited by I/O bus)

43 Summary Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models Issues in Programming Models Examples of Commercial Machines