and Divide-and-Conquer Strategies

Slides:



Advertisements
Similar presentations
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Advertisements

Practical techniques & Examples
Lecture 3: Parallel Algorithm Design
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CSCI-455/552 Introduction to High Performance Computing Lecture 11.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
4.1 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Partitioning and Divide-and-Conquer Strategies Cluster Computing, UNC-Charlotte, B. Wilkinson, 2007.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2007.
Reference: / MPI Program Structure.
MPI Fundamentals—A Quick Overview Shantanu Dutt ECE Dept., UIC.
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2007.
Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.
EECC756 - Shaaban #1 lec # 8 Spring Message-Passing Computing Examples Problems with a very large degree of parallelism: –Image Transformations:
CSE 160/Berman Mapping and Scheduling W+A: Chapter 4.
COMPE575 Parallel & Cluster Computing 5.1 Pipelined Computations Chapter 5.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy H. Kaiser, Ph.D. San Diego.
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Copyright © 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin “ Introduction to the Design & Analysis of Algorithms, ” 2 nd ed., Ch. 1 Chapter.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
Message Passing Interface (MPI) 1 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
1. 2 Sorting Algorithms - rearranging a list of numbers into increasing (strictly nondecreasing) order.
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
CSCI-455/552 Introduction to High Performance Computing Lecture 11.5.
CSCI-455/552 Introduction to High Performance Computing Lecture 6.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Data Structures and Algorithms in Parallel Computing Lecture 10.
High-Performance Computing 12.2: Parallel Algorithms.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
CSCI-455/552 Introduction to High Performance Computing Lecture 21.
Message Passing Interface Using resources from
1 Chapter4 Partitioning and Divide-and-Conquer Strategies 划分和分治的并行技术 Lecture 5.
1 CS4402 – Parallel Computing Lecture 7 - Simple Parallel Sorting. - Parallel Merge Sort.
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming - Exercises Dr. Xiao Qin Auburn University
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Partitioning & Divide and Conquer Strategies Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! 1 ITCS 4/5145 Parallel Computing,
Divide-and-Conquer Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, June 25, 2012 slides4.ppt. 4.1.
Lecture 3: Parallel Algorithm Design
Partitioning and Divide-and-Conquer Strategies
Parallel Sorting Algorithms
and Divide-and-Conquer Strategies
Unit-2 Divide and Conquer
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
Pipelined Computations
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
Parallel Sorting Algorithms
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Adding numbers n data items, p processors ts = O(n) tp = O(n/p) if data on each proc => S=ts/tp=O(p) tp = O(n + n/p) if data needs broadcasting.
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
CS203 Lecture 15.
Presentation transcript:

and Divide-and-Conquer Strategies Chapter 4 Partitioning and Divide-and-Conquer Strategies

Partitioning Divide and Conquer Partitioning simply divides the problem into parts. Divide and Conquer Characterized by dividing problem into sub-problems of same form as larger problem. Further divisions into still smaller sub-problems, usually done by recursion. Recursive divide and conquer amenable to parallelization because separate processes can be used for divided parts. Also usually data is naturally localized. 4.1

Partitioning/Divide and Conquer Examples Many possibilities. • Operations on sequences of number such as simply adding them together • Several sorting algorithms can often be partitioned or constructed in a recursive fashion • Numerical integration • N-body problem 4.2

Partitioning a sequence of numbers into parts and adding the parts 4.3

! Parallel Summation 1 program main include "mpif.h" double precision list(1000), slist(1000), rlist(1000), sum, part_sum integer s, k, n, inc, myid, numprocs, i, ierr integer status(MPI_STATUS_SIZE) call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) if ( myid .eq. 0 ) then read(*,*) n s = n/(numprocs-1) end if call MPI_BCAST(s,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr) call random_seed() do k=1,n call random_number(rnum) print*,'random=',rnum list(k) = rnum end do endif

! send list to slaves:assume values of n divisible by p if (myid .eq. 0 ) then inc = 0; do k = 1,numprocs-1 do j=1,s slist(j)=list(inc+j) end do call MPI_SEND(slist,s,MPI_DOUBLE_PRECISION,k,0,MPI_COMM_WORLD, status,ierr) inc = inc + s sum = 0.0; do i=1,numprocs-1 call MPI_RECV(part_sum, 1, MPI_DOUBLE_PRECISION, i, 0, MPI_COMM_WORLD, status, ierr) sum = sum + part_sum print*,'Sum=',sum !

! slaves else call MPI_RECV(rlist, s, MPI_DOUBLE_PRECISION, 0, 0, MPI_COMM_WORLD, status, ierr) part_sum = 0.0 do i=1,s part_sum = part_sum + rlist(i) end do call MPI_SEND(part_sum, 1, MPI_DOUBLE_PRECISION, 0, 0, MPI_COMM_WORLD, status, ierr) end if call MPI_FINALIZE(ierr) stop end

Divide & Conquer Divide the problem into subproblems that are of the same form as the larger problem. Further subdivisions use “recursion”. A recursive divide-conquer forms a binary tree.

Tree construction 4.4

Dividing a list into parts 4.5

Partial summation 4.6

Bucket sort One “bucket” assigned to hold numbers that fall within each region. Numbers in each bucket sorted using a sequential sorting algorithm. Sequential sorting time complexity: O(nlog(n/m). Works well if the original numbers uniformly distributed across a known interval, say 0 to a - 1. 4.9

Method List of numbers in [0,19], m=5 buckets 0 1 2 3 4 [0,3] [4,7] [8,11] [12,15] [16,19] List={12,8,4,5,3,0,19,7,11,10,14,18} Bucket 0: 3,0 Bucket 1: 4,5,7 Bucket 2: 8,11,10 Bucket 3: 12,14 Bucket 4: 19,18 Sort within each bucket using a simple algorithm Or, let bucket size=1 (bucket/number)

Sequential Algorithm To place a number in a bucket, m-1 searches are required. Or divide the number by m and use the result to identify the bucket number. This requires one computational step. Ceiling[18/5]+1 =5 Ceiling[8/5]+1=3

Sequential Complexity n numbers m buckets In each bucket n/m numbers Quick sort: n/m log(n/m) For m buckets m(n/m)log(n/m) Inserting n numbers into m buckets n ops. ts = n + m(n/m)log(n/m) = O(n log(n/m))

Parallel version of bucket sort Simple approach Assign one processor for each bucket (p=m) 4.10

Further Parallelization Partition sequence into m regions, one region for each processor. Each processor maintains p “small” buckets and separates numbers in its region into its own small buckets. Small buckets then emptied into p final buckets for sorting, which requires each processor to send one small bucket to each of the other processors (bucket i to processor i). 4.11

Another parallel version of bucket sort Introduces new message-passing operation - all-to-all broadcast. 4.12

Steps of Parallel Bucket Sort Partiton n numbers Sort into small buckets Send to larger buckets Sort large buckets n numbers p processors

Analysis Partiton n numbers Sort into small buckets tcomm1 = tstartup + n tdata Sort into small buckets tcomp2 = n/p Send to larger buckets n/(p*p) numbers tcomm3 = p(p-1)(tstartup + n/(p*p) tdata) tcomm3 = (p-1)(tstartup + n/(p*p) tdata) if communications could overlap Computation tcomp4 = (n/p)log(n/p)

… Overall execution time tp = tcomm1 + tcomp2 + tcomm3 + tcomp4 tp = tstartup + n tdata + n/p + (p-1)(tstartup + n/(p*p) tdata) + (n/p)log(n/p) tp = n/p(1 + log(n/p)) + p tstartup + (n+(p-1)(n/(p*p))) tdata

“all-to-all” broadcast routine Sends data from each process to every other process 4.13

“all-to-all” routine actually transfers rows of an array to columns: Transposes a matrix. 4.14

Numerical integration using rectangles Each region calculated using an approximation given by rectangles: Aligning the rectangles: 4.15

Numerical integration using trapezoidal method May not be better! 4.16

Adaptive Quadrature Solution adapts to shape of curve. Use three areas, A, B, and C. Computation terminated when largest of A and B sufficiently close to sum of remain two areas . 4.17

Adaptive quadrature with false termination. Some care might be needed in choosing when to terminate. Might cause us to terminate early, as two large regions are the same (i.e., C = 0). 4.18

3.19

pi_calc.cpp calculates value of pi and compares with actual /********************************************************************************* pi_calc.cpp calculates value of pi and compares with actual value (to 25digits) of pi to give error. Integrates function f(x)=4/(1+x^2). July 6, 2001 K. Spry CSCI3145 **********************************************************************************/ #include <math.h> //include files #include <iostream.h> #include "mpi.h“ void printit(); //function prototypes int main(int argc, char *argv[]) { double actual_pi = 3.141592653589793238462643; //for comparison later int n, rank, num_proc, i; double temp_pi, calc_pi, int_size, part_sum, x; char response = 'y', resp1 = 'y'; MPI::Init(argc, argv); //initiate MPI 4.20

num_proc = MPI::COMM_WORLD.Get_size(); rank = MPI::COMM_WORLD.Get_rank(); if (rank == 0) printit(); /* I am root node, print out welcome */ while (response == 'y') { if (resp1 == 'y') { if (rank == 0) { /*I am root node*/ cout <<"__________________________________" <<endl; cout <<"\nEnter the number of intervals: (0 will exit)" << endl; cin >> n;} } else n = 0; MPI::COMM_WORLD.Bcast(&n, 1, MPI::INT, 0); //broadcast n if (n==0) break; //check for quit condition 4.21

int_size = 1.0 / (double) n; //calcs interval size part_sum = 0.0; else { int_size = 1.0 / (double) n; //calcs interval size part_sum = 0.0; for (i = rank + 1; i <= n; i += num_proc) { //calcs partial sums x = int_size * ((double)i - 0.5); part_sum += (4.0 / (1.0 + x*x)); } temp_pi = int_size * part_sum; //collects all partial sums computes pi MPI::COMM_WORLD.Reduce(&temp_pi,&calc_pi, 1, MPI::DOUBLE, MPI::SUM, 0); 4.22

if (rank == 0) { /*I am server*/ cout << "pi is approximately " << calc_pi << ". Error is " << fabs(calc_pi - actual_pi) << endl <<"_______________________________________" << endl; } } //end else if (rank == 0) { /*I am root node*/ cout << "\nCompute with new intervals? (y/n)" << endl; cin >> resp1; }//end while MPI::Finalize(); //terminate MPI return 0; } //end main 4.23

cout << "\n*********************************" << endl //functions void printit() { cout << "\n*********************************" << endl << "Welcome to the pi calculator!" << endl << "Programmer: K. Spry" << endl << "You set the number of divisions \nfor estimating the integral: \n\tf(x)=4/(1+x^2)" << endl << "*********************************" << endl; } //end printit 4.24

Gravitational N-Body Problem Finding positions and movements of bodies in space subject to gravitational forces from other bodies, using Newtonian laws of physics. 4.25

Gravitational N-Body Problem Equations Gravitational force between two bodies of masses ma and mb is: G is the gravitational constant and r the distance between the bodies. Subject to forces, body accelerates according to Newton’s 2nd law: F = ma m is mass of the body, F is force it experiences, and a the resultant acceleration. 4.26

Let the time interval be t. For a body of mass m, the force is: Details Let the time interval be t. For a body of mass m, the force is: New velocity is: where vt+1 is the velocity at time t + 1 and vt is the velocity at time t. Over time interval Dt, position changes by where xt is its position at time t. Once bodies move to new positions, forces change. Computation has to be repeated. 4.27

Overall gravitational N-body computation can be described by: Sequential Code Overall gravitational N-body computation can be described by: for (t = 0; t < tmax; t++) /* for each time period */ for (i = 0; i < N; i++) { /* for each body */ F = Force_routine(i); /* compute force on ith body */ v[i]new = v[i] + F * dt / m; /* compute new velocity */ x[i]new = x[i] + v[i]new * dt; /* and new position */ } for (i = 0; i < nmax; i++) { /* for each body */ x[i] = x[i]new; /* update velocity & position*/ v[i] = v[i]new; 3.28

Parallel Code The sequential algorithm is an O(N2) algorithm (for one iteration) as each of the N bodies is influenced by each of the other N - 1 bodies. Not feasible to use this direct algorithm for most interesting N-body problems where N is very large. 4.29

Time complexity can be reduced approximating a cluster of distant bodies as a single distant body with mass sited at the center of mass of the cluster: 4.30

Barnes-Hut Algorithm Start with whole space in which one cube contains the bodies (or particles). • First, this cube is divided into eight subcubes. • If a subcube contains no particles, subcube deleted from further consideration. • If a subcube contains one body, subcube retained. • If a subcube contains more than one body, it is recursively divided until every subcube contains one body. 4.31

Creates an octtree - a tree with up to eight edges from each node. The leaves represent cells each containing one body. After the tree has been constructed, the total mass and center of mass of the subcube is stored at each node. 4.32

where is a constant typically 1.0 or less. Force on each body obtained by traversing tree starting at root, stopping at a node when the clustering approximation can be used, e.g. when: where is a constant typically 1.0 or less. Constructing tree requires a time of O(nlogn), and so does computing all the forces, so that overall time complexity of method is O(nlogn). 4.33

Recursive division of 2-dimensional space 4.34

Orthogonal Recursive Bisection (For 2-dimensional area) First, a vertical line found that divides area into two areas each with equal number of bodies. For each area, a horizontal line found that divides it into two areas each with equal number of bodies. Repeated as required. 4.35

Astrophysical N-body simulation By Scott Linssen (UNCC student, 1997) using O(N2) algorithm. 4.36

Astrophysical N-body simulation By David Messager (UNCC student 1998) using Barnes-Hut algorithm. 4.37