1 ParallelAlgorithms Parallel Algorithms Dr. Stephen Tse Lesson 9.

1 ParallelAlgorithms Parallel Algorithms Dr. Stephen Tse stephen_tse@qc.edu Lesson 9

2 Parallel Algorithms Main Strategy: divide-conquer-combine Main issues: 1.Problem decompositions (data, control, data + control) 2.Process scheduling 3.Communication handling (interconnect topology, size and number of messages) 4.Synchronization 5.Performance analysis and algorithm improvement

3 Problems of Various Complexity Embarrassingly parallel: no communication, no load imbalance, nearly 100\% parallel efficiency. Possible ending to scalability. Synchronized parallel: communications are needed and loads are usually balanced, but both are predictable (natural synchronization), e.g. solving simple PDEs. Possible leading to quasi-scalability. Asynchronized parallel: communications are needed and loads are usually not balanced, and both are unpredictable (asynchronous), e.g. Monte Carlo methods to locate global minimum. Rare leading to quasi-scalability.

4 Illustration of Parallel Complexity

5 Programming Paradigms Master-slave Domain decomposition –Data –Control Data parallel Single program multiple data (SPMD) Virtual-shared-memory model

6 Illustration of Programming Paradigms

7 Comparison of Programming Paradigms Explicit Vs. implicit –Men vs. machine Virtual-shared memory Vs. message passing –Shared-memory architectures tend to be very expensive: an mxn crossbar will need mn hardware switches. thus, they tend to be very small. –Message passing is a powerful way of expressing parallelism. It has been called the “assembly language of parallel computing” because it forces the programmer to deal with the detail. But programmers can design extremely sophisticated program by calling sophisticated algorithms from the recent portable MPI libraries. Data parallel Vs. control parallel –Many parallel algorithms cannot efficiently converted into the current generation of High Power Fortran (HPF) data-parallel language compilation. Master-slave Vs. the rest –Master-slave mode: One slow (not doing what it suppose to do) processor (t max ) can mess up the entire team. This observation shows that Master-Slave scheme is usually very inefficient because of the load imbalance issue due to slow master processor. Therefore, Slave-Master scheme is usually avoided.

8 Illustration of Virtual Shared

9 Allocation of Partitions on Space- and-Time-Sharing Systems:

10 Men vs. Machines

11 Distributed-Memory MIMD In this system, each processor has its own private memory The vertex corresponds to processor/memory pair called node. In a network, some vertices are nodes and some are switches. Network connects only nodes are called static networks (mesh network, if every node connects to at lease two other nodes). Network connects to both nodes and switches are called dynamic Networks (crossbar network, if every node connects to the left and the right with switches) Memory CPU Memory CPU Memory CPU..... Interconnection Network Generic distributed-memory system

12 Process A process is a fundamental building block of parallel computing. A process is an instance of a program that is executing on a physical processor. A program is parallel if, at any time during its execution, it can comprise more than one process. In a parallel program, processes can be created, specified, and destroyed. There are also ways to coordinate inter-process interaction.

13 Mutual Exclusion and Binary Semaphore To ensure only one process can execute a certain sequence of statements at a time, we arrange for mutual exclusion and the sequence of statements is called a critical section. One approach is to use binary semaphore. –A shared variable s, if its value indicates 1, the critical section is free, if its value is 0, the region cannot be accessed. This process may look like this: shared int s=1; while (!s);/* Wait until s = 1 */ s = 0;/* Close down access */ Sum = sum + private_x;/* Critical section */ s = 1;/* Re-open access*/ The problem is that the operations involved in manipulating s under multi processes situations are not atomic actions, actions that are indivisible or change the program state. While one process is fetching s=1 into a register to test whether it’s OK to enter the critical region, another process can be storing s=0. Binary Semaphore: In addition to the shared variable, we have to arrange a function such that once a process starts to access s, no other process can access it until the original process is done with the access, including the rest of the value. i.e. void P(int* s);/* await (s > 0) s = s – 1 */ void V(int* s);/* s = s + 1 */ The function P has the effect as:While (!s); s = 0; it prevents other processes from accessing s once one process gets out of the loop. The function V sets s to 1, but it does this “atomically.” A binary semaphore has value either 0 or 1. A V operation on a binary semaphore is executed only when the semaphore has value 0. The idea of binary semaphore is somewhat error-prone and forces serial execution of the critical region. Thus, a number of alternatives have been devised.

14 Synchronous and Buffered Communication The most commonly used method of programming distributed- memory MIMD systems is message passing. The basic idea is to coordinate the processes activities by explicitly sending and receiving messages. The number of processes are set at the beginning of program execution, and each processes is assigned a unique integer rank in the range 0, 1, …, p-1, where p is the number of processors. While processes are running on different nodes, i.e. process 0 sends a “request to send” to process 1 and wait until it receives a “ready to receive” from 1, then it begins transmission of the actual message. This is a synchronous communication. If the system buffers the message, that is the contents of the message copied into a system-controlled block of memory, and the sending process can continue to do useful work. When the receiving process arrives at the point that it is ready to receive the message, the system software then simply copies the buffered message to the memory location controlled by the receiving process. This is called buffered communication.

15 Single-program multi-data (SPMD) If programming consisted the following styles: 1.The user issues message passing directive to the operating system that has the effect of placing a copy of the executable program on one processor. 2.Each processor begins execution of its copy of the executable. 3.Different processes can execute different statements by branching within the program based on their process ranks. This approach of programming the distributed-memory MIMD system is called the Single-Program Multi- Data (SPMD) approach. The effect of running different programs in SPMD is obtained by use of conditional branches within the source code. Message Passing Interface (MPI) is a library of definitions and functions that can be used in C programs. All programs in our MPI studies are using SPMD paradigm.

16 Communicator and Rank A communicator is a collection of processes that can send messages to each other. The function MPI_COMM_WORLD is predefined in MPI and consists of all the processes running when program execution begins. The flow of control of an SPMD program depends on the rank of a process. The function MPI_Comm_rank returns the rank of a process in its second parameter. The first parameter is the communicator.

17 MPI Message The actual message passing in the program is carried out by the MPI function MPI_Send (sends a message to a designated process) and MPI_Recv (receives a message from a process.) Problems involved in message passing: 1.a message must be composed and put in a buffer; 2.the message must be “dropped in a mailbox”; in order to know where to deliver the message, it must “enclosing the message in an envelop” and the designation addressed of the message. 3.But just the address isn’t enough. Since the physical message is a sequence of electrical signals, the system needs to know where the message ends or the size of the message. 4.To take appropriate action about the message, the receiver needs the return address or the address of the source process. 5.Also different message type or tag can help receiver to take proper action of the message. 6.Need to know from which communicator the message comes from. Therefore, the message envelop contains: 1.The rank of the receiver 2.The rank of the sender 3.A tag (message type) 4.a communicator The actual message are stored in a block of memory. The system needs the count and datatype to determine how much storage is needed for the message: 1.the count value 2.the MPI datatype The message also need a message pointer to know where to get the message 1.message pointer –.

18 Sending Message The parameters for MPI_Send and MPI_Recv are: int MPI_Sent( void*message/* in */, intcount/* in */, MPI_Datatype datatype/* in */, intdest/* in */, inttag/* in */, MPI_Commcomm/* in */) int MPI_Recv( void*message/*out */, intcount/* in */, MPI_Datatype datatype/* in */, intsource/* in */, inttag/* in */, MPI_Commcomm/* in */, MPI_Status*status/*out */)

19 Send and Receive pair The status returns information on the data that was actually received. It reference the struct with at least three members: status -> MPI_SOURCE/* contains the rank of the process that sent the message*/ status -> MPI_TAG/* status -> MPI_ERROR Send Tag A Receive Tag B (can be any sender/tag: MPI_ANY_TAG MPI_ANY_SENDER ) Identical Message

20 In Summary The count and datatype determine the size of the message The tag and comm are used to make sure that messages don’t get mixed up. Each message consists of two parts: the data being transmitted and the envelop of information message Data Envelop Pointer Count Datatype 1.the rank of the receiver 2.the rank of the sender 3.a tag 4.a communicator 5.status (for receive)

1 ParallelAlgorithms Parallel Algorithms Dr. Stephen Tse Lesson 9.

Similar presentations

Presentation on theme: "1 ParallelAlgorithms Parallel Algorithms Dr. Stephen Tse Lesson 9."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 ParallelAlgorithms Parallel Algorithms Dr. Stephen Tse Lesson 9.

Similar presentations

Presentation on theme: "1 ParallelAlgorithms Parallel Algorithms Dr. Stephen Tse Lesson 9."— Presentation transcript:

Similar presentations

About project

Feedback