1 Introduction to Parallel Architectures and Programming Models.

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Reference: Message Passing Fundamentals.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
CS267 L3 Programming Models.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming.
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Architectures
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Hybrid MPI and OpenMP Parallel Programming
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Parallel Computing.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CS267 L3 Programming Models.1 CS 267: Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming Models David.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:
Data Structures and Algorithms in Parallel Computing Lecture 1.
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Parallel Computing Presented by Justin Reschke
Concurrency and Performance Based on slides by Henri Casanova.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
01/28/2011CS267 Lecture 31 Parallel Machines and Programming Models Lecture 3 Based on James Demmel and Kathy Yelick
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
18-447: Computer Architecture Lecture 30B: Multiprocessors
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
Morgan Kaufmann Publishers
Parallel Programming in C with MPI and OpenMP
EE 193: Parallel Computing
Multi-Processing in High Performance Computer Architecture:
Lecture 3 : Performance of Parallel Programs
Multiprocessors - Flynn’s taxonomy (1966)
Parallel Processing Architectures
Background and Motivation
Distributed Systems CS
Chapter 4 Multiprocessors
Presentation transcript:

1 Introduction to Parallel Architectures and Programming Models

2 A generic parallel architecture P P PP Interconnection Network MMMM ° Where is the memory physically located? Memory

3 Classifying computer architecture Computer are often classified using two different measures (Flynn’s taxonomy) memory shared memory distributed memory instruction stream and data stream SISD single instruction Single data SIMD Single instruction Multiple data MISD Multiple instruction Single data MIMD Multiple instruction Multiple data single multiple Instruction stream singlemultiple Data stream

4 Parallel Programming Models Control How is parallelism created? What orderings exist between operations? How do different threads of control synchronize? Data What data is private vs. shared? How is logically shared data accessed or communicated? Operations What are the atomic operations? Cost How do we account for the cost of each of the above?

5 Simple Example Consider a sum of an array function: Parallel Decomposition: Each evaluation and each partial sum is a task. Assign n/p numbers to each of p procs Each computes independent “private” results and partial sum. One (or all) collects the p partial sums and computes the global sum. Two Classes of Data: Logically Shared The original n numbers, the global sum. Logically Private The individual function evaluations. What about the individual partial sums?

6 Programming Model 1: Shared Memory Program is a collection of threads of control. Many languages allow threads to be created dynamically, I.e., mid-execution. Each thread has a set of private variables, e.g. local variables on the stack. Collectively with a set of shared variables, e.g., static variables, shared common blocks, global heap. Threads communicate implicitly by writing and reading shared variables. Threads coordinate using synchronization operations on shared variables i res s PP P i res s... x =... y =..x... Address: Shared Private

7 Machine Model 1a: Shared Memory P1P2Pn network $$$ memory Processors all connected to a large shared memory. Typically called Symmetric Multiprocessors (SMPs) Sun, DEC, Intel, IBM SMPs “Local” memory is not (usually) part of the hardware. Cost: much cheaper to access data in cache than in main memory. Difficulty scaling to large numbers of processors <10 processors typical

8 Machine Model 1b: Distributed Shared Memory Memory is logically shared, but physically distributed Any processor can access any address in memory Cache lines (or pages) are passed around machine SGI Origin is canonical example (+ research machines) Scales to 100s Limitation is cache consistency protocols – need to keep cached copies of the same address consistent P1P2Pn network $$$ memory

9 Shared Memory Code for Computing a Sum Thread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 What is the problem? static int s = 0; A race condition or data race occurs when: -two processors (or two threads) access the same variable, and at least one does a write. -The accesses are concurrent (not synchronized)

10 Pitfalls and Solution via Synchronization ° Pitfall in computing a global sum s = s + local_si: Thread 1 (initially s=0) load s [from mem to reg] s = s+local_s1 [=local_s1, in reg] store s [from reg to mem] Time Thread 2 (initially s=0) load s [from mem to reg; initially 0] s = s+local_s2 [=local_s2, in reg] store s [from reg to mem] ° Instructions from different threads can be interleaved arbitrarily. ° One of the additions may be lost °Possible solution: mutual exclusion with locks Thread 1 lock load s s = s+local_s1 store s unlock Thread 2 lock load s s = s+local_s2 store s unlock

11 Programming Model 2: Message Passing Program consists of a collection of named processes. Usually fixed at program startup time Thread of control plus local address space -- NO shared data. Logically shared data is partitioned over local processes. Processes communicate by explicit send/receive pairs Coordination is implicit in every communication event. MPI is the most common example PPP i res s... i res s send P0,X recv Pn,Y X Y A: n0

12 Machine Model 2: Distributed Memory Cray T3E, IBM SP. Each processor is connected to its own memory and cache but cannot directly access another processor’s memory. Each “node” has a network interface (NI) for all communication and synchronization. interconnect P1 memory NIP2 memory NI Pn memory NI...

13 Computing s = x(1)+x(2) on each processor Processor 1 send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote Processor 2 receive xremote, proc1 send xlocal, proc1 [xlocal = x(2)] s = xlocal + xremote ° First possible solution: ° Second possible solution -- what could go wrong? Processor 1 send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote Processor 2 send xlocal, proc1 [xlocal = x(2)] receive xremote, proc1 s = xlocal + xremote ° What if send/receive acts like the telephone system? The post office?

14 Programming Model 2b: Global Addr Space Program consists of a collection of named processes. Usually fixed at program startup time Local and shared data, as in shared memory model But, shared data is partitioned over local processes Remote data stays remote on distributed memory machines Processes communicate by writes to shared variables Explicit synchronization needed to coordinate UPC, Titanium, Split-C are some examples Global Address Space programming is an intermediate point between message passing and shared memory Most common on a the Cray t3e, which had some hardware support for remote reads/writes

15 Programming Model 3: Data Parallel Single thread of control consisting of parallel operations. Parallel operations applied to all (or a defined subset) of a data structure, usually an array Communication is implicit in parallel operators Elegant and easy to understand and reason about Coordination is implicit – statements executed synchronousl Drawbacks: Not all problems fit this model Difficult to map onto coarse-grained machines A: fA: f sum A = array of all data fA = f(A) s = sum(fA) s:

16 Machine Model 3a: SIMD System A large number of (usually) small processors. A single “control processor” issues each instruction. Each processor executes the same instruction. Some processors may be turned off on some instructions. Machines are not popular (CM2), but programming model is. interconnect P1 memory NIP2 memory NIPn memory NI... Implemented by mapping n-fold parallelism to p processors. Mostly done in the compilers (e.g., HPF). control processor

17 Model 3B: Vector Machines Vector architectures are based on a single processor Multiple functional units All performing the same operation Instructions may specific large amounts of parallelism (e.g., 64-way) but hardware executes only a subset in parallel Historically important Overtaken by MPPs in the 90s Still visible as a processor architecture within an SMP

18 Machine Model 4: Clusters of SMPs SMPs are the fastest commodity machine, so use them as a building block for a larger machine with a network Common names: CLUMP = Cluster of SMPs Hierarchical machines, constellations Most modern machines look like this: IBM SPs, Compaq Alpha, (not the t3e)... What is an appropriate programming model #4 ??? Treat machine as “flat”, always use message passing, even within SMP (simple, but ignores an important part of memory hierarchy). Shared memory within one SMP, but message passing outside of an SMP.

19 Summary so far Historically, each parallel machine was unique, along with its programming model and programming language You had to throw away your software and start over with each new kind of machine - ugh Now we distinguish the programming model from the underlying machine, so we can write portably correct code, that runs on many machines MPI now the most portable option, but can be tedious Writing portably fast code requires tuning for the architecture Algorithm design challenge is to make this process easy Example: picking a blocksize, not rewriting whole algorithm

20 Steps in Writing Parallel Programs

21 Creating a Parallel Program Pieces of the job Identify work that can be done in parallel Partition work and perhaps data among processes=threads Manage the data access, communication, synchronization Goal: maximize Speedup due to parallelism Speedup prob (P procs) = Time to solve prob with “best” sequential solution Time to solve prob in parallel on P processors <= P (Brent’s Theorem) Efficiency(P) = Speedup(P) / P <= 1

22 Steps in the Process Task: arbitrarily defined piece of work that forms the basic unit of concurrency Process/Thread: abstract entity that performs tasks tasks are assigned to threads via an assignment mechanism threads must coordinate to accomplish their collective tasks Processor: physical entity that executes a thread Overall Computation Decomposition Grains of Work Assignment Processes/ Threads Orchestration Processes/ Threads Mapping Processors

23 Decomposition Break the overall computation into grains of work (tasks) identify concurrency and decide at what level to exploit it concurrency may be statically identifiable or may vary dynamically it may depend only on problem size, or it may depend on the particular input data Goal: enough tasks to keep the target range of processors busy, but not too many establishes upper limit on number of useful processors (i.e., scaling)

24 Assignment Determine mechanism to divide work among threads functional partitioning assign logically distinct aspects of work to different threads –eg pipelining structural mechanisms assign iterations of “parallel loop” according to simple rule –eg proc j gets iterates j*n/p through (j+1)*n/p-1 throw tasks in a bowl (task queue) and let threads feed data/domain decomposition data describing the problem has a natural decomposition break up the data and assign work associated with regions –eg parts of physical system being simulated Goal Balance the workload to keep everyone busy (all the time) Allow efficient orchestration

25 Orchestration Provide a means of naming and accessing shared data, communication and coordination among threads of control Goals: correctness of parallel solution respect the inherent dependencies within the algorithm avoid serialization reduce cost of communication, synchronization, and management preserve locality of data reference

26 Mapping Binding processes to physical processors Time to reach processor across network does not depend on which processor (roughly) lots of old literature on “network topology”, no longer so important Basic issue is how many remote accesses Proc Cache Memory Proc Cache Memory Network fast slow really slow

27 Example f(A[1]) + … + f(A[n])s = f(A[1]) + … + f(A[n]) DecompositionDecomposition computing each f(A[j])computing each f(A[j]) n-fold parallelism, where n may be >> pn-fold parallelism, where n may be >> p computing sum scomputing sum s AssignmentAssignment thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1])thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1]) thread 1 sums s = s1+ … + sp (for simplicity of this example)thread 1 sums s = s1+ … + sp (for simplicity of this example) thread 1 communicates s to other threadsthread 1 communicates s to other threads OrchestrationOrchestration starting up threads communicating, synchronizing with thread 1 Mapping processor j runs thread j

28 Cost Modeling and Performance Tradeoffs

29 Identifying enough Concurrency Amdahl’s law let s be the fraction of total work done sequentially Simple Decomposition: f ( A[i] ) is the parallel task sum is sequential Concurrency Time 1 x time(sum(n)) Concurrency p x n/p x time(f) n x time(f) Parallelism profile area is total work done After mapping n p

30 Algorithmic Trade-offs Parallelize partial sum of the f’s what fraction of the computation is “sequential” what does this do for communication? locality? what if you sum what you “own” Parallelize the final summation (tree sum) Generalize Amdahl’s law for arbitrary “ideal” parallelism profile Concurrency p x n/p x time(f) p x time(sum(n/p) ) 1 x time(sum(p)) Concurrency p x n/p x time(f) p x time(sum(n/p) ) 1 x time(sum(log_2 p))

31 Problem Size is Critical Suppose Total work= n + P Serial work: P Parallel work: n s = serial fraction = P/ (n+P) Amdahl’s Law Bounds In general seek to exploit a fraction of the peak parallelism in the problem. n

32 Load Balance Insufficient Concurrency will appear as load imbalance Use of coarser grain tends to increase load imbalance. Poor assignment of tasks can cause load imbalance. Synchronization waits are instantaneous load imbalance Concurrency Idle Time if n does not divide by P Idle Time due to serialization SpeedupP Work pidle () () max ( ())   1 p

33 Extra Work There is always some amount of extra work to manage parallelism e.g., to decide who is to do what Concurrency SpeedupP Work pidleextra () () Max ( ())   1 p

34 Communication and Synchronization There are many ways to reduce communication costs. Concurrency Coordinating Action (synchronization) requires communication Getting data from where it is produced to where it is used does too.

35 Reducing Communication Costs Coordinating placement of work and data to eliminate unnecessary communication Replicating data Redundant work Performing required communication efficiently e.g., transfer size, contention, machine specific optimizations

36 The Tension Fine grain decomposition and flexible assignment tends to minimize load imbalance at the cost of increased communication In many problems communication goes like the surface-to-volume ratio Larger grain => larger transfers, fewer synchronization events Simple static assignment reduces extra work, but may yield load imbalance Minimizing one tends to increase the others

37 The Good News The basic work component in the parallel program may be more efficient that in the sequential case Only a small fraction of the problem fits in cache Need to chop problem up into pieces and concentrate on them to get good cache performance. Indeed, the best sequential program may emulate the parallel one. Communication can be hidden behind computation May lead to better algorithms for memory hierarchies Parallel algorithms may lead to better serial ones parallel search may explore space more effectively