CS267 L3 Programming Models.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming.

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Reference: Message Passing Fundamentals.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
1 Introduction to Parallel Architectures and Programming Models.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
CS267 L4 Shared Memory.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and Programming Jim Demmel.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CS267 L3 Programming Models.1 CS 267: Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming Models David.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
Outline Why this subject? What is High Performance Computing?
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Parallel Computing Presented by Justin Reschke
Concurrency and Performance Based on slides by Henri Casanova.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
18-447: Computer Architecture Lecture 30B: Multiprocessors
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
Morgan Kaufmann Publishers
Parallel Programming in C with MPI and OpenMP
EE 193: Parallel Computing
Parallelization of An Example Program
Background and Motivation
Distributed Systems CS
Chapter 4 Multiprocessors
Mattan Erez The University of Texas at Austin
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

CS267 L3 Programming Models.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming Models Jim Demmel

CS267 L3 Programming Models.2 Demmel Sp 1999 Recap of Last Lecture °The actual performance of a simple program can be a complicated function of the architecture °Slight changes in the architecture or program may change the performance significantly °Since we want to write fast programs, we must take the architecture into account, even on uniprocessors °Since the actual performance is so complicated, we need simple models to help us design efficient algorithms °We illustrated with a common technique for improving cache performance, called blocking, applied to matrix multiplication Blocking works for many architectures, but choosing the blocksize depends on the architecture

CS267 L3 Programming Models.3 Demmel Sp 1999 Outline °Parallel machines and programming models °Steps in writing a parallel program °Cost modeling and performance trade-offs

CS267 L3 Programming Models.4 Demmel Sp 1999 Parallel Machines and Programming Models

CS267 L3 Programming Models.5 Demmel Sp 1999 A generic parallel architecture P P PP Interconnection Network MMMM ° ° Where does the memory go? Memory

CS267 L3 Programming Models.6 Demmel Sp 1999 Parallel Programming Models °Control how is parallelism created what orderings exist between operations how do different threads of control synchronize °Naming what data is private vs. shared how logically shared data is accessed or communicated °Set of operations what are the basic operations what operations are considered to be atomic °Cost how do we account for the cost of each of the above

CS267 L3 Programming Models.7 Demmel Sp 1999 Trivial Example  °Parallel Decomposition: Each evaluation and each partial sum is a task °Assign n/p numbers to each of p procs each computes independent “private” results and partial sum one (or all) collects the p partial sums and computes the global sum => Classes of Data °Logically Shared the original n numbers, the global sum °Logically Private the individual function evaluations what about the individual partial sums?

CS267 L3 Programming Models.8 Demmel Sp 1999 Programming Model 1 °Shared Address Space program consists of a collection of threads of control, each with a set of private variables -e.g., local variables on the stack collectively with a set of shared variables -e.g., static variables, shared common blocks, global heap threads communicate implicity by writing and reading shared variables threads coordinate explicitly by synchronization operations on shared variables -writing and reading flags -locks, semaphores °Like concurrent programming on uniprocessor i res s PPP i res s... x =... y =..x... A: Shared Private

CS267 L3 Programming Models.9 Demmel Sp 1999 Machine Model 1 P1P2Pn network $$$ memory °A shared memory machine °Processors all connected to a large shared memory °“Local” memory is not (usually) part of the hardware Sun, DEC, Intel “SMPs” (Symmetric multiprocessors) in Millennium; SGI Origin °Cost: much cheaper to cache than main memory ° °Machine model 1a: A Shared Address Space Machine replace caches by local memories (in abstract machine model) this affects the cost model -- repeatedly accessed data should be copied Cray T3E

CS267 L3 Programming Models.10 Demmel Sp 1999 Shared Memory code for computing a sum Thread 1 [s = 0 initially] local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 Thread 2 [s = 0 initially] local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 What could go wrong?

CS267 L3 Programming Models.11 Demmel Sp 1999 Pitfall and solution via synchronization ° ° Pitfall in computing a global sum s = local_s1 + local_s2 Thread 1 (initially s=0) load s [from mem to reg] s = s+local_s1 [=local_s1, in reg] store s [from reg to mem] Time Thread 2 (initially s=0) load s [from mem to reg; initially 0] s = s+local_s2 [=local_s2, in reg] store s [from reg to mem] ° ° Instructions from different threads can be interleaved arbitrarily ° ° What can final result s stored in memory be? ° ° Race Condition ° ° Possible solution: Mutual Exclusion with Locks Thread 1 lock load s s = s+local_s1 store s unlock Thread 2 lock load s s = s+local_s2 store s unlock ° ° Locks must be atomic (execute completely without interruption)

CS267 L3 Programming Models.12 Demmel Sp 1999 Programming Model 2 °Message Passing program consists of a collection of named processes -thread of control plus local address space -local variables, static variables, common blocks, heap processes communicate by explicit data transfers -matching pair of send & receive by source and dest. proc. coordination is implicit in every communication event logically shared data is partitioned over local processes °Like distributed programming PPP i res s... i res s send P0,X recv Pn,Y X Y A: n0 ° ° Program with standard libraries: MPI, PVM

CS267 L3 Programming Models.13 Demmel Sp 1999 Machine Model 2 °A distributed memory machine Cray T3E (too!), IBM SP2, NOW, Millennium °Processors all connected to own memory (and caches) cannot directly access another processor’s memory °Each “node” has a network interface (NI) all communication and synchronization done through this interconnect P1 memory NIP2 memory NI Pn memory NI...

CS267 L3 Programming Models.14 Demmel Sp 1999 Computing s = x(1)+x(2) on each processor Processor 1 send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote Processor 2 receive xremote, proc1 send xlocal, proc1 [xlocal = x(2)] s = xlocal + xremote ° ° First possible solution ° ° Second possible solution - what could go wrong? Processor 1 send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote Processor 2 send xlocal, proc1 [xlocal = x(2)] receive xremote, proc1 s = xlocal + xremote ° ° What if send/receive act like the telephone system? The post office?

CS267 L3 Programming Models.15 Demmel Sp 1999 Programming Model 3 °Data Parallel Single sequential thread of control consisting of parallel operations Parallel operations applied to all (or defined subset) of a data structure Communication is implicit in parallel operators and “shifted” data structures Elegant and easy to understand and reason about Not all problems fit this model °Like marching in a regiment A: fA: f sum A = array of all data fA = f(A) s = sum(fA) ° ° Think of Matlab s:

CS267 L3 Programming Models.16 Demmel Sp 1999 Machine Model 3 °An SIMD (Single Instruction Multiple Data) machine °A large number of small processors °A single “control processor” issues each instruction each processor executes the same instruction some processors may be turned off on any instruction interconnect P1 memory NIP2 memory NIPn memory NI... ° °Machines not popular (CM2), but programming model is implemented by mapping n-fold parallelism to p processors mostly done in the compilers (HPF = High Performance Fortran) control processor

CS267 L3 Programming Models.17 Demmel Sp 1999 Machine Model 4 °Since small shared memory machines (SMPs) are the fastest commodity machine, why not build a larger machine by connecting many of them with a network? °CLUMP = Cluster of SMPs °Shared memory within one SMP, message passing outside °Millennium, ASCI Red (Intel),... °Programming model? Treat machine as “flat”, always use message passing, even within SMP (simple, but ignore important part of memory hierarchy) Expose two layers: shared memory and message passing (higher performance, but ugly to program)

CS267 L3 Programming Models.18 Demmel Sp 1999 Programming Model 5 °Bulk synchronous °Used within the message passing or shared memory models as a programming convention °Phases separated by global barriers Compute phases: all operate on local data (in distributed memory) -or read access to global data (in shared memory) Communication phases: all participate in rearrangement or reduction of global data °Generally all doing the “same thing” in a phase all do f, but may all do different things within f °Simplicity of data parallelism without restrictions

CS267 L3 Programming Models.19 Demmel Sp 1999 Summary so far °Historically, each parallel machine was unique, along with its programming model and programming language °You had to throw away your software and start over with each new kind of machine - ugh °Now we distinguish the programming model from the underlying machine, so we can write portably correct code, that runs on many machines MPI now the most portable option, but can be tedious °Writing portably fast code requires tuning for the architecture Algorithm design challenge is to make this process easy Example: picking a blocksize, not rewriting whole algorithm

CS267 L3 Programming Models.20 Demmel Sp 1999 Steps in Writing Parallel Programs

CS267 L3 Programming Models.21 Demmel Sp 1999 Creating a Parallel Program °Pieces of the job Identify work that can be done in parallel Partition work and perhaps data among processes=threads Manage the data access, communication, synchronization °Goal: maximize Speedup due to parallelism Speedup prob (P procs) = Time to solve prob with “best” sequential solution Time to solve prob in parallel on P processors <= P (Brent’s Theorem) Efficiency(P) = Speedup(P) / P <= 1 ° ° Key question is when you can solve each piece statically, if information is known in advance dynamically, otherwise

CS267 L3 Programming Models.22 Demmel Sp 1999 Steps in the Process °Task: arbitrarily defined piece of work that forms the basic unit of concurrency °Process/Thread: abstract entity that performs tasks tasks are assigned to threads via an assignment mechanism threads must coordinate to accomplish their collective tasks °Processor: physical entity that executes a thread Overall Computation Decomposition Grains of Work Assignment Processes/ Threads Orchestration Processes/ Threads Mapping Processors

CS267 L3 Programming Models.23 Demmel Sp 1999 Decomposition °Break the overall computation into grains of work (tasks) identify concurrency and decide at what level to exploit it concurrency may be statically identifiable or may vary dynamically it may depend only on problem size, or it may depend on the particular input data °Goal: enough tasks to keep the target range of processors busy, but not too many establishes upper limit on number of useful processors (i.e., scaling)

CS267 L3 Programming Models.24 Demmel Sp 1999 Assignment °Determine mechanism to divide work among threads functional partitioning -assign logically distinct aspects of work to different threads –eg pipelining structural mechanisms -assign iterations of “parallel loop” according to simple rule –eg proc j gets iterates j*n/p through (j+1)*n/p-1 -throw tasks in a bowl (task queue) and let threads feed data/domain decomposition -data describing the problem has a natural decomposition -break up the data and assign work associated with regions –eg parts of physical system being simulated °Goal Balance the workload to keep everyone busy (all the time) Allow efficient orchestration

CS267 L3 Programming Models.25 Demmel Sp 1999 Orchestration °Provide a means of naming and accessing shared data, communication and coordination among threads of control °Goals: correctness of parallel solution -respect the inherent dependencies within the algorithm avoid serialization reduce cost of communication, synchronization, and management preserve locality of data reference

CS267 L3 Programming Models.26 Demmel Sp 1999 Mapping °Binding processes to physical processors °Time to reach processor across network does not depend on which processor (roughly) lots of old literature on “network topology”, no longer so important °Basic issue is how many remote accesses Proc Cache Memory Proc Cache Memory Network fast slow really slow

CS267 L3 Programming Models.27 Demmel Sp 1999 Example f(A[1]) + … + f(A[n]) °s = f(A[1]) + … + f(A[n]) °Decomposition computing each f(A[j])computing each f(A[j]) n-fold parallelism, where n may be >> pn-fold parallelism, where n may be >> p computing sum scomputing sum s °Assignment thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1])thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1]) thread 1 sums s = s1+ … + sp (for simplicity of this example)thread 1 sums s = s1+ … + sp (for simplicity of this example) thread 1 communicates s to other threadsthread 1 communicates s to other threads °Orchestration starting up threads communicating, synchronizing with thread 1 °Mapping processor j runs thread j

CS267 L3 Programming Models.28 Demmel Sp 1999 Administrative Issues °Assignment 2 will be on the home page later today Matrix Multiply contest Find a partner (outside of your own department) Due in 2 weeks °Lab/discussion section will be 5-6pm Tuesdays °Reading assignment Optional: -Chapter 1 of Culler/Singh book -Chapters 1 and 2 of

CS267 L3 Programming Models.29 Demmel Sp 1999 Cost Modeling and Performance Tradeoffs

CS267 L3 Programming Models.30 Demmel Sp 1999 Identifying enough Concurrency °Amdahl’s law let s be the fraction of total work done sequentially Simple Decomposition: f ( A[i] ) is the parallel task sum is sequential Concurrency Time 1 x time(sum(n)) Concurrency p x n/p x time(f) n x time(f) ° °Parallelism profile area is total work done After mapping n p

CS267 L3 Programming Models.31 Demmel Sp 1999 Algorithmic Trade-offs °Parallelize partial sum of the f’s what fraction of the computation is “sequential” what does this do for communication? locality? what if you sum what you “own” °Parallelize the final summation (tree sum) Generalize Amdahl’s law for arbitrary “ideal” parallelism profile Concurrency p x n/p x time(f) p x time(sum(n/p) ) 1 x time(sum(p)) Concurrency p x n/p x time(f) p x time(sum(n/p) ) 1 x time(sum(log_2 p))

CS267 L3 Programming Models.32 Demmel Sp 1999 Problem Size is Critical °Suppose Total work= n + P °Serial work: P °Parallel work: n °s = serial fraction = P/ (n+P) Amdahl’s Law Bounds In general seek to exploit a fraction of the peak parallelism in the problem. n

CS267 L3 Programming Models.33 Demmel Sp 1999 Load Balance °Insufficient Concurrency will appear as load imbalance °Use of coarser grain tends to increase load imbalance. °Poor assignment of tasks can cause load imbalance. °Synchronization waits are instantaneous load imbalance Concurrency Idle Time if n does not divide by P Idle Time due to serialization SpeedupP Work pidle () () max ( ())   1 p

CS267 L3 Programming Models.34 Demmel Sp 1999 Extra Work °There is always some amount of extra work to manage parallelism e.g., to decide who is to do what Concurrency SpeedupP Work pidleextra () () Max ( ())   1 p

CS267 L3 Programming Models.35 Demmel Sp 1999 Communication and Synchronization °There are many ways to reduce communication costs. Concurrency Coordinating Action (synchronization) requires communication Getting data from where it is produced to where it is used does too.

CS267 L3 Programming Models.36 Demmel Sp 1999 Reducing Communication Costs °Coordinating placement of work and data to eliminate unnecessary communication °Replicating data °Redundant work °Performing required communication efficiently e.g., transfer size, contention, machine specific optimizations

CS267 L3 Programming Models.37 Demmel Sp 1999 The Tension °Fine grain decomposition and flexible assignment tends to minimize load imbalance at the cost of increased communication In many problems communication goes like the surface-to-volume ratio Larger grain => larger transfers, fewer synchronization events °Simple static assignment reduces extra work, but may yield load imbalance Minimizing one tends to increase the others

CS267 L3 Programming Models.38 Demmel Sp 1999 The Good News °The basic work component in the parallel program may be more efficient that in the sequential case Only a small fraction of the problem fits in cache Need to chop problem up into pieces and concentrate on them to get good cache performance. Similar to the parallel case Indeed, the best sequential program may emulate the parallel one. °Communication can be hidden behind computation May lead to better algorithms for memory hierarchies °Parallel algorithms may lead to better serial ones parallel search may explore space more effectively