1 Friday, October 06, 2006 Measure twice, cut once. -Carpenter’s Motto.

Slides:



Advertisements
Similar presentations
Chapter 3. MPI MPI = Message Passing Interface Specification of message passing libraries for developers and users –Not a library by itself, but specifies.
Advertisements

Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Reference: Message Passing Fundamentals.
Reference: Getting Started with MPI.
Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Message Passing Interface (MPI) Part I NPACI Parallel.
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
12b.1 Introduction to Message-passing with MPI UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
Parallel System Performance CS 524 – High-Performance Computing.
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
Mapping Techniques for Load Balancing
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 9 October 30, 2002 Nayda G. Santiago.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Parallel Processing LAB NO 1.
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Parallel & Cluster Computing MPI Basics Paul Gray, University of Northern Iowa David Joiner, Shodor Education Foundation Tom Murphy, Contra Costa College.
2.1 Message-Passing Computing ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 17, 2012.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
An Introduction to Parallel Programming and MPICH Nikolaos Hatzopoulos.
Sharif University of technology, Parallel Processing course, MPI & ADA Server Introduction By Shervin Daneshpajouh.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Hybrid MPI and OpenMP Parallel Programming
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Message Passing Interface (MPI) 1 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Chapter 4 Message-Passing Programming. The Message-Passing Model.
How to for compiling and running MPI Programs. Prepared by Kiriti Venkat.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:
MPI and OpenMP.
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
An Introduction to MPI (message passing interface)
1 HPCI Presentation Kulathep Charoenpornwattana. March 12, Outline Parallel programming with MPI Running MPI applications on Azul & Itanium Running.
1 Running MPI on “Gridfarm” Bryan Carpenter February, 2005.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
-1.1- MPI Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
Implementing Processes and Threads CS550 Operating Systems.
Message Passing Interface Using resources from
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
PVM and MPI.
Potential for parallel computers/parallel programming
The OSCAR Cluster System
Introduction to parallel computing concepts and technics
MPI Basics.
MPI Message Passing Interface
CS 584.
Message Passing Models
Introduction to parallelism and the Message Passing Interface
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
MPI MPI = Message Passing Interface
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Distributed Memory Programming with Message-Passing
CS 584 Lecture 8 Assignment?.
Presentation transcript:

1 Friday, October 06, 2006 Measure twice, cut once. -Carpenter’s Motto

2 Sources of overhead §Inter-process communication §Idling §Replicated computation

3 Sources of overhead §Inter-process communication §Idling §Replicated computation

4 Ts: The original single-processor serial time. Tis: The additional serial time spent on average for Inter-processor communications Setup Depends on N. Tp: The original single-processor parallelizable time. Tip: The additional time spent on average by each processor Setup Idle time

5 Simplified expression l S(N) = Ts + Tp Ts+ N*Tis + Tp/N + Tip

6 T s =10, T ip =1, T is =0 Communication time negligible compared to computation. What you would expect from Amdahl’s law alone. Straight line reference for linear speedup

7 T s =10, T ip =1, T is =10 Adding small serial time. Adding more processors results in lower speedup.

8 T s =10, T ip =1, T is =1 Quadratic N dependence, e.g. every processor speaks to all others.

9 §Adding processors won’t provide additional speedup unless the problem is scaled up as well. §Should not distribute calculations with small T p /T is over a large number of processors.

10 Scaling a problem §Does number of tasks scale with the problem size? §Increase in problem size should increase the number of tasks rather than the size of individual tasks. l Should be able to solve larger problems when more processors are available.

11 What can we tell from our observations? §We implemented an algorithm on parallel computer X and achieved a speedup of 10.8 on 12 processors with problem size N=100.

12 What can we tell from our observations? §We implemented an algorithm on parallel computer X and achieved a speedup of 10.8 on 12 processors with problem size N=100. §Region of observation is too narrow. §What if N=10 or N=1000?

13 What can we tell from our observations? §T is the execution time, P is number of processors and N is problem size §T= N + N 2 /P §T= (N + N 2) /P §T= (N + N 2) /P P 2 All these algorithms all achieve a speedup of about 10.8 when P=12 and N=100.

14

15 Addition example

16 Addition example Speedup : §Ratio of time taken to solve a problem on a single processor to time required to solve it on a parallel computer with p identical processing elements §Speedup for addition example?

17 Speedup : §Comparison with best known serial algorithm

18 Efficiency : Fraction of time which a processor spends doing useful work. E = S/p

19 Cost : Product of parallel runtime and the number of processors. Cost: pTp (Note: Tp here stands to the parallel runtime. The time from the moment the parallel computation starts to the moment last processing element finishes execution)

20 Cost optimal : If cost of solving a problem on a parallel computer has same asymptotic growth as a function of input size as the fastest known sequential algorithm on a single processor. Cost for addition example: O(n logn)

21 Cost optimal : If cost of solving a problem on a parallel computer has same asymptotic growth as a function of input size as the fastest known sequential algorithm on a single processor. Cost for addition example: O(n logn) Not cost optimal.

22 Effect of non-cost-optimality

23.

24.

25.

26 If overhead increases sub-linearly with respect to problem size. Keep efficiency fixed by increasing both the problem size and number of processors

27 Keep efficiency fixed by increasing both the problem size and number of processors Scalable parallel systems Ability to utilize increasing processing elements effectively

28 Scalability and cost-optimality are related Scalable system can always be made cost- optimal if number of processing elements and size of problem are chosen carefully

29 Scalability and cost-optimality are related Scalable system can always be made cost-optimal if number of processing elements and size of problem are chosen carefully

30 §Speedup that is greater than linear: Super- linear Speedup Anomalies

31 Cache effects. §Each processor has a small amount of cache §When a problem is executed on a greater number of processors, more of its data can be placed in cache and as a result, total computation time will tend to decrease. §If reduction in computation time due to this cache effect offsets increases in communication and idle time from use of additional processors then super-linearity results. §Similarly, the increased physical memory available in a multiprocessor may reduce the cost of memory accesses by avoiding the need for virtual memory paging. Speedup Anomalies

32 Search anomalies. If a search tree contains solutions at varying depths, then multiple depth-first searches will, on average, explore fewer tree nodes before finding a solution than will a sequential depth-first search. Speedup Anomalies

33 Message Passing §Partitioned address space §Data explicitly decomposed and placed by programmer §Locality of access. §Cooperation for send receive operations. §Structured and static requirements

34 Message Passing §Most message passing programs are written using SPMD

35 Message Passing §The need for a standard.

36 §The Message Passing Interface (MPI) standard is the de-facto industry standard for parallel applications. l Designed by leading industry and academic researchers §MPI l Library that is widely used to parallelize scientific and compute-intensive programs

37 §LAM (Indiana University), MPICH (Argonne National Laboratory, Chicago) are popular open source implementations of MPI library.

38 §Implementations of MPI (such as LAM, MPICH) provide an API of library calls that allow users to pass messages between nodes of a parallel application. §Run on a wide variety of systems, from desktop workstations, clusters to large supercomputers (and everything in between).

39 MPI: the Message Passing Interface The minimal set of MPI routines. MPI_Init Initializes MPI. MPI_Finalize Terminates MPI. MPI_Comm_size Determines the number of processes. MPI_Comm_rank Determines the label of calling process. MPI_Send Sends a message. MPI_Recv Receives a message.

40 Starting and Terminating the MPI Library  MPI_Init is called prior to any calls to other MPI routines. Its purpose is to initialize the MPI environment.  MPI_Finalize is called at the end of the computation, and it performs various clean-up tasks to terminate the MPI environment. §The prototypes of these two functions are: int MPI_Init(int *argc, char ***argv) int MPI_Finalize()  MPI_Init also strips off any MPI related command-line arguments.  All MPI routines, data-types, and constants are prefixed by “ MPI _”. The return code for successful completion is MPI_SUCCESS. (mpi.h)

41 Hello World MPI Program #include int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world! I am %d of %d\n", rank, size); MPI_Finalize(); return 0; }

42 LAM §Before any MPI programs can be executed, the LAM run-time environment must be launched. This is typically called “booting LAM.”

43 LAM §Before any MPI programs can be executed, the LAM run-time environment must be launched. This is typically called “booting LAM.” §A text file is required that lists the hosts on which to launch the LAM run-time environment. This file is typically referred to as a “boot schema”, “hostfile”, or “machinefile.”

44 Sample machinefile hpcc.lums.edu.pk compute-0-0.local compute-0-1.local compute-0-2.local compute-0-3.local compute-0-4.local compute-0-5.local compute-0-6.local

45 LAM Settings have been done on your accounts and the following files have been copied in your home directory. §ssh_script §machinefile §hellompi.c

46 First time commands (Logout of all old sessions and re-login) source ssh_script

47 First time commands source ssh_script Warning: Permanently added 'compute-0-0.local' (RSA) to the list of known hosts. /bin/bash Warning: Permanently added 'compute-0-1.local' (RSA) to the list of known hosts. /bin/bash Warning: Permanently added 'compute-0-2.local' (RSA) to the list of known hosts. /bin/bash Warning: Permanently added 'compute-0-3.local' (RSA) to the list of known hosts. /bin/bash Warning: Permanently added 'compute-0-4.local' (RSA) to the list of known hosts. /bin/bash Warning: Permanently added 'compute-0-5.local' (RSA) to the list of known hosts. /bin/bash Warning: Permanently added 'compute-0-6.local' (RSA) to the list of known hosts. /bin/bash

48 First time commands source ssh_script /bin/bash

49 First time commands lamboot -v machinefile LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University n-1 ssi:boot:base:linear: booting n0 (hpcc.lums.edu.pk) n-1 ssi:boot:base:linear: booting n1 (compute-0-0.local) n-1 ssi:boot:base:linear: booting n2 (compute-0-1.local) n-1 ssi:boot:base:linear: booting n3 (compute-0-2.local) n-1 ssi:boot:base:linear: booting n4 (compute-0-3.local) n-1 ssi:boot:base:linear: booting n5 (compute-0-4.local) n-1 ssi:boot:base:linear: booting n6 (compute-0-5.local) n-1 ssi:boot:base:linear: booting n7 (compute-0-6.local) n-1 ssi:boot:base:linear: finished

50 First time commands lamnodes n0 hpcc.lums.edu.pk:1:origin,this_node n1 compute-0-0.local:1: n2 compute-0-1.local:1: n3 compute-0-2.local:1: n4 compute-0-3.local:1: n5 compute-0-4.local:1: n6 compute-0-5.local:1: n7 compute-0-6.local:1:

51 First time commands mpicc hellompi.c -o hello

52 First time commands mpirun -np 8 hello Hello, world! I am 0 of 8 Hello, world! I am 4 of 8 Hello, world! I am 2 of 8 Hello, world! I am 6 of 8 Hello, world! I am 3 of 8 Hello, world! I am 5 of 8 Hello, world! I am 7 of 8 Hello, world! I am 1 of 8

53 First time commands lamhalt LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University lamwipe machinefile LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University lamnodes It seems that there is no lamd running on the host hpcc.lums.edu.pk. This indicates that the LAM/MPI runtime environment is not operating. The LAM/MPI runtime environment is necessary for the "lamnodes" command. Please run the "lamboot" command the start the LAM/MPI runtime environment. See the LAM/MPI documentation for how to invoke "lamboot" across multiple machines.

54 Sequence whenever you want to run an MPI program 1.Compile using mpicc 2.Start LAM runtime environment using lamboot 3.Run MPI program using mpirun 4.When you are done, shut down LAM universe using lamhalt and lamwipe 5.lamclean can be useful if a parallel job crashes to remove all running programs