Lecture 6 CSS314 Parallel Computing

Slides:

Advertisements

Similar presentations

Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed

Advertisements

1 Collective Operations Dr. Stephen Tse Lesson 12.

Lecture 11 CSS314 Parallel Computing

MPI Program Structure Self Test with solution. Self Test 1.How would you modify "Hello World" so that only even-numbered processors print the greeting.

1 Introduction to Computability Theory Lecture11: Variants of Turing Machines Prof. Amos Israeli.

CSC1016 Coursework Clarification Derek Mortimer March 2010.

Point-to-Point Communication Self Test with solution.

Portability Issues. The MPI standard was defined in May of This standardization effort was a response to the many incompatible versions of parallel.

Collective Communications Self Test with solution.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

16 MULTIPLE INTEGRALS.

Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.

1 The first step in understanding pointers is visualizing what they represent at the machine level. In most modern computers, main memory is divided into.

Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.

8 TECHNIQUES OF INTEGRATION. There are two situations in which it is impossible to find the exact value of a definite integral. TECHNIQUES OF INTEGRATION.

Algebra Problems… Solutions Algebra Problems… Solutions © 2007 Herbert I. Gross Set 6 By Herb I. Gross and Richard A. Medeiros next.

1 TRAPEZOIDAL RULE IN MPI Copyright © 2010, Elsevier Inc. All rights Reserved.

Parallel Numerical Integration Spring Semester 2005 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 10 CSS314 Parallel Computing

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

Lecture 2 CSS314 Parallel Computing

Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

Lecture 5 MPI Introduction Topics …Readings January 19, 2012 CSCE 713 Advanced Computer Architecture.

The University of Adelaide, School of Computer Science

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 3 Distributed Memory Programming with MPI An Introduction to Parallel Programming Peter Pacheco.

CS1Q Computer Systems Lecture 8

Parallel Programming and Algorithms – MPI Collective Operations David Monismith CS599 Feb. 10, 2015 Based upon MPI: A Message-Passing Interface Standard.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

INTEGRALS Areas and Distances INTEGRALS In this section, we will learn that: We get the same special type of limit in trying to find the area under.

Comparing AlgorithmsCSCI 1900 – Discrete Structures CSCI 1900 Discrete Structures Complexity Reading: Kolman, Sections 4.6, 5.2, and 5.3.

Message Passing Interface Dr. Bo Yuan

Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard

Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.

PP Lab MPI programming VI. Program 1 Break up a long vector into subvectors of equal length. Distribute subvectors to processes. Let them compute the.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

MPI Communications Point to Point Collective Communication Data Packaging.

An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 6 Using Methods.

Parallel Programming with MPI By, Santosh K Jena..

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Parallel Programming & Cluster Computing MPI Collective Communications Dan Ernst Andrew Fitz Gibbon Tom Murphy Henry Neeman Charlie Peck Stephen Providence.

Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.

SECTION 12.8 CHANGE OF VARIABLES IN MULTIPLE INTEGRALS.

1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 3 Distributed Memory Programming with MPI An Introduction to Parallel Programming Peter Pacheco.

In Chapters 6 and 8, we will see how to use the integral to solve problems concerning:  Volumes  Lengths of curves  Population predictions  Cardiac.

LECTURE 4 Logic Design. LOGIC DESIGN We already know that the language of the machine is binary – that is, sequences of 1’s and 0’s. But why is this?

Copyright © 2014 Curt Hill Algorithms From the Mathematical Perspective.

April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.

1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.

Lecture 5 CSS314 Parallel Computing Book: “An Introduction to Parallel Programming” by Peter Pacheco

Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN

Lecture 7 CSS314 Parallel Computing Book: “An Introduction to Parallel Programming” by Peter Pacheco

Lecture 6 MPI Introduction Topics …Readings January 19, 2012 CSCE 713 Advanced Computer Architecture.

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Copyright © Cengage Learning. All rights reserved. 4 Integrals.

INTRODUCTION TO STATISTICS

User-Written Functions

CS4402 – Parallel Computing

MPI Message Passing Interface

Introduction to Summary Statistics

Introduction to Summary Statistics

Parallel Programming with MPI and OpenMP

O.S Lecture 13 Virtual Memory.

An Introduction to Parallel Programming with MPI

Functions BIS1523 – Lecture 17.

Lecture 14: Inter-process Communication

EGR 2131 Unit 12 Synchronous Sequential Circuits

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Lecture 6 CSS314 Parallel Computing Book: “An Introduction to Parallel Programming” by Peter Pacheco http://instructor.sdu.edu.kz/~moodle/ M.Sc. Bogdanchikov Andrey, Suleyman Demirel University

Content The trapezoidal rule Parallelizing the trapezoidal rule Dealing with I/O Collective communication MPI_Reduce MPI_Allreduce Broadcast Data distributions MPI_Scatter

The trapezoidal rule Recall that we can use the trapezoidal rule to approximate the area between the graph of a function, y=f (x), two vertical lines, and the x-axis. The basic idea is to divide the interval on the x-axis into n equal subintervals. If the endpoints of the subinterval are xi and xi+1, then the length of the subinterval is h = xi+1 - xi. Also, if the lengths of the two vertical segments are f(xi) and f(xi+1), then the area of the trapezoid is Self-study: learn formula of multiple consecutive trapezoids. Chapter 3.2.1

The trapezoidal rule: (a) area to be estimated and (b) approximate area using trapezoids

Parallelizing the trapezoidal rule Recall that we can design a parallel program using four basic steps: 1. Partition the problem solution into tasks. 2. Identify the communication channels between the tasks. 3. Aggregate the tasks into composite tasks. 4. Map the composite tasks to cores.

Pseudo-code Let’s make the simplifying assumption that comm_sz evenly divides n. Then pseudo-code for the program might look something like the following: 1 Get a, b, n; 2 h = (b-a)/n; 3 local_n = n/comm_sz; 4 local_a = a + my_rank*local_n*h; 5 local_b = local_a + local_n*h; 6 local_integral = Trap(local_a, local_b, local_n, h);

Pseudo-code 7 if (my_rank != 0) 8 Send local integral to process 0; 9 else /* my_rank == 0 */ 10 total_integral = local_integral; 11 for (proc = 1; proc < comm_sz; proc++) { 12 Receive local integral from proc; 13 total_integral += local_integral; 14 } 15 16 if (my_rank == 0) 17 print result;

Discussion Notice that in our choice of identifiers, we try to differentiate between local and global variables. Local variables are variables whose contents are significant only on the process that’s using them. Some examples from the trapezoidal rule program are local_a, local_b, and local_n. Variables whose contents are significant to all the processes are sometimes called global variables. Some examples from the trapezoidal rule are a, b, and n.

trapezoid.c part 1

trapezoid.c part 2

Dealing with I/O Of course, the current version of the parallel trapezoidal rule has a serious deficiency: it will only compute the integral over the interval [0, 3] using 1024 trapezoids. We can edit the code and recompile, but this is quite a bit of work compared to simply typing in three new numbers. We need to address the problem of getting input from the user. While we’re talking about input to parallel programs, it might be a good idea to also take a look at output.

Output In both the “greetings” program and the trapezoidal rule program we’ve assumed that process 0 can write to stdout, that is, its calls to printf behave as we might expect. Although the MPI standard doesn’t specify which processes have access to which I/O devices, virtually all MPI implementations allow all the processes in MPI_COMM_WORLD full access to stdout and stderr, so most MPI implementations allow all processes to execute printf and fprintf(stderr, ...).

Input Unlike output, most MPI implementations only allow process 0 in MPI_COMM_WORLD access to stdin. In order to write MPI programs that can use scanf, we need to branch on process rank, with process 0 reading in the data and then sending it to the other processes. For example, we might write the Get_input function shown in next slide for our parallel trapezoidal rule program. In this function, process 0 simply reads in the values for a, b, and n and sends all three values to each process.

Get_input function

Collective communication If we pause for a moment and think about our trapezoidal rule program, we can find several things that we might be able to improve on. One of the most obvious is that the “global sum” after each process has computed its part of the integral. If we hire eight workers to, say, build a house, we might feel that we weren’t getting our money’s worth if seven of the workers told the first what to do, and then the seven collected their pay and went home. Sometimes it does happen that this is the best we can do in a parallel program, but we have seen already that this problem can be optimized.

Tree-structured communication

An alternative tree-structured global sum

MPI_Reduce With virtually limitless possibilities, it’s unreasonable to expect each MPI programmer to write an optimal global-sum function, so MPI specifically protects programmers against this trap of endless optimization by requiring that MPI implementations include implementations of global sums. This places the burden of optimization on the developer of the MPI implementation, rather than the application developer. The assumption here is that the developer of the MPI implementation should know enough about both the hardware and the system software so that she can make better decisions about implementation details.

Point-to-point communications Now, a “global-sum function” will obviously require communication. However, unlike the MPI_Send-MPI Recv pair, the global-sum function may involve more than two processes. In fact, in our trapezoidal rule program it will involve all the processes in MPI COMM WORLD. In MPI parlance, communication functions that involve all the processes in a communicator are called collective communications. To distinguish between collective communications and functions such as MPI_Send and MPI_Recv, MPI_Send and MPI_Recv are often called point-to-point communications.

MPI_Reduce In fact, global sum is just a special case of an entire class of collective communications. For example, it might happen that instead of finding the sum of a collection of comm_sz numbers distributed among the processes, we want to find the maximum or the minimum or the product or any one of many other possibilities:

Description The key to the generalization is the fifth argument, operator. It has type MPI_Op, which is a predefined MPI type like MPI_Datatype and MPI_Comm. There are a number of predefined values in this type. MPI_Reduce(&local_int, &total_int, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

Collective vs. point-to-point communications It’s important to remember that collective communications differ in several ways from point-to-point communications: All the processes in the communicator must call the same collective function. The arguments passed by each process to an MPI collective communication must be “compatible.” The output_data_p argument is only used on dest process. Point-to-point communications are matched on the basis of tags and communicators. Collective communications don’t use tags, so they’re matched solely on the basis of the communicator and the order in which they’re called. Self-Study: Chapter 3.4.3

MPI_Allreduce In our trapezoidal rule program, we just print the result, so it’s perfectly natural for only one process to get the result of the global sum. However, it’s not difficult to imagine a situation in which all of the processes need the result of a global sum in order to complete some larger computation. In this situation, we encounter some of the same problems we encountered with our original global sum. For example, if we use a tree to compute a global sum, we might “reverse” the branches to distribute the global sum

Alternative Alternatively, we might have the processes exchange partial results instead of using one-way communications. Such a communication pattern is sometimes called a butterfly. Once again, we don’t want to have to decide on which structure to use, or how to code it for optimal performance.

A butterfly-structured global sum

MPI_Allreduce MPI provides a variant of MPI_Reduce that will store the result on all the processes in the communicator: The argument list is identical to that for MPI_Reduce except that there is no dest process since all the processes should get the result.

Broadcast If we can improve the performance of the global sum in our trapezoidal rule program by replacing a loop of receives on process 0 with a tree-structured communication, we ought to be able to do something similar with the distribution of the input data. In fact, if we simply “reverse” the communications in the tree-structured global sum in previous slides, we obtain the tree-structured communication shown in next slide, and we can use this structure to distribute the input data.

A tree-structured broadcast

MPI_Bcast A collective communication in which data belonging to a single process is sent to all of the processes in the communicator is called a broadcast, and you’ve probably guessed that MPI provides a broadcast function: The process with rank source_proc sends the contents of the memory referenced by data_p to all the processes in the communicator comm.

A version of Get_input that uses MPI_Bcast

Data distributions Suppose we want to write a function that computes a vector sum: Code of this vector sum can be like:

Data distribution How could we implement this using MPI? The work consists of adding the individual components of the vectors, so we might specify that the tasks are just the additions of corresponding components. Then there is no communication between the tasks, and the problem of parallelizing vector addition boils down to aggregating the tasks and assigning them to the cores.

Data distribution If the number of components is n and we have comm_sz cores or processes, let’s assume that n evenly divides comm_sz and define local_n = n / comm_sz. Then we can simply assign blocks of local_n consecutive components to each process. This is often called a block partition of the vector. An alternative to a block partition is a cyclic partition. In a cyclic partition, we assign the components in a round robin fashion

Different Partitions of a 12-Component Vector among Three Processes A third alternative is a block-cyclic partition. The idea here is that instead of using a cyclic distribution of individual components, we use a cyclic distribution of blocks of components.

A parallel implementation of vector addition Once we’ve decided how to partition the vectors, it’s easy to write a parallel vector addition function: each process simply adds its assigned components. Furthermore, regardless of the partition, each process will have local_n components of the vector, and, in order to save on storage, we can just store these on each process as an array of local_n elements.

Scatter Now suppose we want to test our vector addition function. It would be convenient to be able to read the dimension of the vectors and then read in the vectors x and y. If there are 10 processes and the vectors have 10,000 components, then each process will need to allocate storage for vectors with 10,000 components, when it is only operating on subvectors with 1000 components. Using this approach, processes 1 to 9 would only need to allocate storage for the components they’re actually using.

MPI_Scatter For the communication MPI provides just such a function: Perhaps surprisingly, send_count should also be local_n—send_count is the amount of data going to each process; it’s not the amount of data in the memory referred to by send_buf_p.

A function for reading and distributing a vector

Gather Of course, our test program will be useless unless we can see the result of our vector addition, so we need to write a function for printing out a distributed vector. Our function can collect all of the components of the vector onto process 0, and then process 0 can print all of the components. The communication in this function can be carried out by MPI Gather,

MPI_Gather The data stored in the memory referred to by send_buf_p on process 0 is stored in the first block in recv_buf_p, the data stored in the memory referred to by send_buf_p on process 1 is stored in the second block referred to by recv_buf_p, and so on. So, if we’re using a block distribution, we can implement our distributed vector print function as shown in next slide.

A function for printing a distributed vector

Allgather This is for self-study: Chapter 3.4.9

Homework Exercise 3.3 Exercise 3.6 Exercise 3.7 Exercise 3.13

THE END Thank you