1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Multiple Processor Systems
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Introduction to MIMD architectures
Reference: Message Passing Fundamentals.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
CS 284a, 7 October 97Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 7 October 1997.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
Parallel Computing Overview CS 524 – High-Performance Computing.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Parallel Architectures
KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.
Computer System Architectures Computer System Software
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
This module created with support form NSF under grant # DUE Module developed Spring 2014 Task Orchestration: Communication and Synchronization.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
MIMD Distributed Memory Architectures message-passing multicomputers.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Advanced Computer Networks Lecture 1 - Parallelization 1.
Outline Why this subject? What is High Performance Computing?
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
CMSC 611: Advanced Computer Architecture
Lecture 1: Parallel Architecture Intro
Multiprocessors - Flynn’s taxonomy (1966)
Multiple Processor Systems
CS 213: Parallel Processing Architectures
Parallel Processing Architectures
Distributed Systems CS
Chapter 4 Multiprocessors
Presentation transcript:

1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters

2 Why Parallel Computing No matter how effective ILP/Moore’s Law, more is better  Most systems run multiple applications simultaneously  Overlapping downloads with other work  Web browser (overlaps image retrieval with display)  Total cost of ownership favors fewer systems with multiple processors rather than more systems w/fewer processors Price Performance 2P+M 2P+2M P+M Peak performance increases linearly with more processors Adding processor/memory much cheaper than a second complete system

3 What about Sequential Code? Sequential programs get no benefit from multiple processors, they must be parallelized.  Key property is how much communication per unit of computation. The less communication per unit computation the better the scaling properties of the algorithm.  Sometimes, a multi-threaded design is good on uni & multi- processors e.g., throughput for a web server (that uses system multi-threading) Speedup is limited by Amdahl’s Law  Speedup <= 1/(seq + (1 – seq)/proc) as proc -> infinity, Speedup is limited to 1/seq Many applications can be (re)designed/coded/compiled to generate cooperating, parallel instruction streams – specifically to enable improved responsiveness/throughput with multiple processors.

4 Performance of parallel algorithms is NOT limited by which factor 1. The need to synchronize work done on different processors. 2. The portion of code that remains sequential. 3. The need to redesign algorithms to be more parallel. 4. The increased cache area due to multiple processors.

5 Parallel Programming Models Parallel programming involves:  Decomposing an algorithm into parts  Distributing the parts as tasks which are worked on by multiple processors simultaneously  Coordinating work and communications of those processors  Synchronization Parallel programming considerations:  Type of parallel architecture being used  Type of processor communications used No automated compiler/language exists to automate this “parallelization” process. Two Programming Models exist…..  Shared Memory  Message Passing

6 Process Coordination Shared Memory v. Message Passing Shared memory  Efficient, familiar  Not always available  Potentially insecure send(process : process_id, message : string) receive(process : process_id, var message : string) send(process : process_id, message : string) receive(process : process_id, var message : string) Canonical syntax: process foo begin : x :=... : end foo process bar begin : y := x : end bar global int x Message passing Extensible to communication in distributed systems

7 Shared Memory Programming Model Programs/threads communicate/cooperate via loads/stores to memory locations they share. Communication is therefore at memory access speed (very fast), and is implicit. Cooperating pieces must all execute on the same system (computer). OS services and/or libraries used for creating tasks (processes/threads) and coordination (semaphores/barriers/locks.)

8 Shared Memory Code fork N processes each process has a number, p, and computes istart[p], iend[p], jstart[p], jend[p] for(s=0;s<STEPS;s++) { k = s&1 ; m = k^1 ; forall(i=istart[p];i<=iend[p];i++) { forall(j=jstart[p];j<=jend[p];j++) { a[k][i][j] = c1*a[m][i][j] + c2*a[m][i-1][j] + c3*a[m][i+1][j] + c4*a[m][i][j-1] + c5*a[m][i][j+1] ; // implicit comm } barrier() ; }

9 Symmetric Multiprocessors Several processors share one address space  conceptually a shared memory Communication is implicit  read and write accesses to shared memory locations Synchronization  via shared memory locations  spin waiting for non-zero  Atomic instructions (Test&set, compare&swap, load linked/store conditional)  barriers P M Network PP Conceptual Model

10 Non-Uniform Memory Access - NUMA CPU/Memory busses cannot support more than ~4-8 CPUs before bus bandwidth is exceeded (the SMP “sweet spot”). To provide shared-memory MPs beyond these limits requires some memory to be “closer to” some processors than to others. The “Interconnect” usually includes  a cache-directory to reduce snoop traffic  Remote Cache to reduce access latency (think of it as an L3) Cache-Coherent NUMA Systems (CC-NUMA):  SGI Origin, Stanford Dash, Sequent NUMA-Q, HP Superdome Non Cache-Coherent NUMA (NCC-NUMA)  Cray T32E P1 Interconnect P2 $$ M PN $ Pn $ M

11 Message Passing Programming Model “Shared” data is communicated using “send”/”receive” services (across an external network). Unlike Shared Model, shared data must be formatted into message chunks for distribution (shared model works no matter how the data is intermixed). Coordination is via sending/receiving messages. Program components can be run on the same or different systems, so can use 1,000s of processors. “Standard” libraries exist to encapsulate messages:  Parasoft's Express (commercial)  PVM (standing for Parallel Virtual Machine, non-commercial)  MPI (Message Passing Interface, also non-commercial).

12 Message Passing Issues Synchronization semantics When does a send /receive operation terminate? OS Kernel SenderReceiver Partially blocking/non-blocking: send() / receive() with timeout Non-blocking (aka Asynchronous): Send operation “immediately” returns Receive operation returns if no message is available (polling) Blocking (aka Synchronous): Sender waits until its message is received Receiver waits if no message is available OS Kernel SenderReceiver How many buffers?

13 Clustered Computers designed for Message Passing A collection of computers (nodes) connected by a network  computers augmented with fast network interface  send, receive, barrier  user-level, memory mapped  otherwise indistinguishable from conventional PC or workstation One approach is to network workstations with a very fast network  Often called ‘cluster computers’  Berkeley NOW  IBM SP2 (remember Deep Blue?) P M P M P M Network Node

14 Which is easier to program? 1. Shared memory 2. Message passing