1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
SE-292 High Performance Computing
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Reference: Message Passing Fundamentals.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
MEMORY MANAGEMENT By KUNAL KADAKIA RISHIT SHAH. Memory Memory is a large array of words or bytes, each with its own address. It is a repository of quickly.
1 Organization of Programming Languages-Cheng (Fall 2004) Concurrency u A PROCESS or THREAD:is a potentially-active execution context. Classic von Neumann.
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
Mapping Techniques for Load Balancing
1 TRAPEZOIDAL RULE IN MPI Copyright © 2010, Elsevier Inc. All rights Reserved.
CS 470/570:Introduction to Parallel and Distributed Computing.
10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
LIGO-G Z 8 June 2001L.S.Finn/LDAS Camp1 How to think about parallel programming.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Lecture 3 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.
L19: Putting it together: N-body (Ch. 6) November 22, 2011.
Parallel Architecture, Software And Performance UCSB CS240A, T. Yang, 2016.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Barriers and Condition Variables
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Parallel Computing Presented by Justin Reschke
MEMORY MANAGEMENT Operating System. Memory Management Memory management is how the OS makes best use of the memory available to it Remember that the OS.
Lecture 4 CSS314 Parallel Computing Book: “An Introduction to Parallel Programming” by Peter Pacheco
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Concurrency and Performance Based on slides by Henri Casanova.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Parallel Hardware and Parallel Software
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CS5102 High Performance Computer Systems Thread-Level Parallelism
CS427 Multicore Architecture and Parallel Computing
Parallel Programming By J. H. Wang May 2, 2017.
Introduction to Parallel Programming
Computer Engg, IIT(BHU)
Parallel Programming in C with MPI and OpenMP
EE 193: Parallel Computing
The University of Adelaide, School of Computer Science
COMP60611 Fundamentals of Parallel and Distributed Systems
The University of Adelaide, School of Computer Science
Concurrency: Mutual Exclusion and Process Synchronization
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Chapter 01: Introduction
Presentation transcript:

1/46 PARALLEL SOFTWARE ( SECTION 2.4)

2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads. Threads carry out tasks. In distributed memory programs: Start multiple processes. Processes carry out tasks.

3/46 SPMD – single program multiple data A SPMD programs consists of a single executable that can behave as if it were multiple different programs through the use of conditional branches. if (I’m thread process i) do this; else do that;

4/46 Writing Parallel Programs double x[n], y[n]; … for (i = 0; i < n; i++) x[i] += y[i]; 1.Divide the work among the processes/threads (a)so each process/thread gets roughly the same amount of work (b)and communication is minimized. 2.Arrange for the processes/threads to synchronize. 3.Arrange for communication among processes/threads.

5/46 Shared Memory Dynamic threads Master thread waits for work, forks new threads, and when threads are done, they terminate Efficient use of resources, but thread creation and termination is time consuming. Static threads Pool of threads created and are allocated work, but do not terminate until cleanup. Better performance, but potential waste of system resources.

6/46 Nondeterminism... printf ( "Thread %d > my_val = %d\n", my_rank, my_x ) ;... Thread 0 > my_val = 7 Thread 1 > my_val = 19 Thread 0 > my_val = 7

7/46 Nondeterminism my_val = Compute_val ( my_rank ) ; x += my_val ;

8/46 Nondeterminism Race condition Critical section Mutually exclusive Mutual exclusion lock (mutex, or simply lock) my_val = Compute_val ( my_rank ) ; Lock(&add_my_val_lock ) ; x += my_val ; Unlock(&add_my_val_lock ) ;

9/46 busy-waiting my_val = Compute_val ( my_rank ) ; i f ( my_rank == 1) whi l e ( ! ok_for_1 ) ; /* Busy−wait loop */ x += my_val ; /* Critical section */ i f ( my_rank == 0) ok_for_1 = true ; /* Let thread 1 update x */

10/46 message-passing char message [ ] ;... my_rank = Get_rank ( ) ; i f ( my_rank == 1) { sprintf ( message, "Greetings from process 1" ) ; Send ( message, MSG_CHAR, 100, 0 ) ; } e l s e i f ( my_rank == 0) { Receive ( message, MSG_CHAR, 100, 1 ) ; printf ( "Process 0 > Received: %s\n", message ) ; }

11/46 Partitioned Global Address Space Languages shared i n t n =... ; shared double x [ n ], y [ n ] ; private i n t i, my_first_element, my_last_element ; my_first_element =... ; my_last_element =... ; / * Initialize x and y */... f o r ( i = my_first_element ; i <= my_last_element ; i++) x [ i ] += y [ i ] ;

12/46 Input and Output In distributed memory programs, only process 0 will access stdin. In shared memory programs, only the master thread or thread 0 will access stdin. In both distributed memory and shared memory programs all the processes/threads can access stdout and stderr.

13/46 Input and Output However, because of the indeterminacy of the order of output to stdout, in most cases only a single process/thread will be used for all output to stdout other than debugging output. Debug output should always include the rank or id of the process/thread that’s generating the output.

14/46 Input and Output Only a single process/thread will attempt to access any single file other than stdin, stdout, or stderr. So, for example, each process/thread can open its own, private file for reading or writing, but no two processes/threads will open the same file.

15/46 PERFORMANCE

16/46 Speedup Number of cores = p Serial run-time = T serial Parallel run-time = T parallel T parallel = T serial / p linear speedup

17/46 Speedup of a parallel program T serial T parallel S =

18/46 Efficiency of a parallel program E = T serial T parallel S p = p = T serial p T parallel.

19/46 Speedups and efficiencies of a parallel program

20/46 Speedups and efficiencies of parallel program on different problem sizes

21/46 Speedup

22/46 Efficiency

23/46 Effect of overhead T parallel = T serial / p + T overhead

24/46 Amdahl’s Law Unless virtually all of a serial program is parallelized, the possible speedup is going to be very limited — regardless of the number of cores available.

25/46 Example We can parallelize 90% of a serial program. Parallelization is “perfect” regardless of the number of cores p we use. T serial = 20 seconds Runtime of parallelizable part is 0.9 x T serial / p = 18 / p

26/46 Example (cont.) Runtime of “unparallelizable” part is Overall parallel run-time is 0.1 x T serial = 2 T parallel = 0.9 x T serial / p x T serial = 18 / p + 2

27/46 Example (cont.) Speed up 0.9 x T serial / p x T serial T serial S == 18 / p

28/46 Scalability In general, a problem is scalable if it can handle ever increasing problem sizes. If we increase the number of processes/threads and keep the efficiency fixed without increasing problem size, the problem is strongly scalable. If we keep the efficiency fixed by increasing the problem size at the same rate as we increase the number of processes/threads, the problem is weakly scalable.

29/46 Taking Timings What is time? Start to finish? A program segment of interest? CPU time? Wall clock time?

30/46 Taking Timings theoretical function MPI_Wtime omp_get_wtime

31/46 Taking Timings

32/46 Taking Timings

33/46 PARALLEL PROGRAM DESIGN

34/46 Foster’s methodology 1.Partitioning: divide the computation to be performed and the data operated on by the computation into small tasks. The focus here should be on identifying tasks that can be executed in parallel.

35/46 Foster’s methodology 2.Communication: determine what communication needs to be carried out among the tasks identified in the previous step.

36/46 Foster’s methodology 3.Agglomeration or aggregation: combine tasks and communications identified in the first step into larger tasks. For example, if task A must be executed before task B can be executed, it may make sense to aggregate them into a single composite task.

37/46 Foster’s methodology 4.Mapping: assign the composite tasks identified in the previous step to processes/threads. This should be done so that communication is minimized, and each process/thread gets roughly the same amount of work.

38/46 Example - histogram 1.3,2.9,0.4,0.3,1.3,4.4,1.7,0.4,3.2,0.3,4.9,2.4,3.1,4.4,3.9,0.4,4.2,4.5,4.9,0.9

39/46 Serial program - input 1.The number of measurements: data_count 2.An array of data_count floats: data 3.The minimum value for the bin containing the smallest values: min_meas 4.The maximum value for the bin containing the largest values: max_meas 5.The number of bins: bin_count

40/46 Serial program - output 1.bin_maxes : an array of bin_count floats 2.bin_counts : an array of bin_count ints

41/46 First two stages of Foster’s Methodology

42/46 Alternative definition of tasks and communication

43/46 Adding the local arrays

44/46 Concluding Remarks (1) Serial systems The standard model of computer hardware has been the von Neumann architecture. Parallel hardware Flynn’s taxonomy. Parallel software We focus on software for homogeneous MIMD systems, consisting of a single program that obtains parallelism by branching. SPMD programs.

45/46 Concluding Remarks (2) Input and Output We’ll write programs in which one process or thread can access stdin, and all processes can access stdout and stderr. However, because of nondeterminism, except for debug output we’ll usually have a single process or thread accessing stdout.

46/46 Concluding Remarks (3) Performance Speedup Efficiency Amdahl’s law Scalability Parallel Program Design Foster’s methodology