L19: Putting it together: N-body (Ch. 6) November 22, 2011.

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

MPI 2.2 William Gropp. 2 Scope of MPI 2.2 Small changes to the standard. A small change is defined as one that does not break existing correct MPI 2.0.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Practical techniques & Examples
N-Body Simulation Michael Mersic CS680.
Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
MPI Collective Communications
1 Collective Operations Dr. Stephen Tse Lesson 12.
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 7- 1 Overview 7.1 Introduction to Arrays 7.2 Arrays in Functions 7.3.
MPI_Gatherv CISC372 Fall 2006 Andrew Toy Tom Lynch Bill Meehan.
MPI_AlltoAllv Function Outline int MPI_Alltoallv ( void *sendbuf, int *sendcnts, int *sdispls, MPI_Datatype sendtype, void *recvbuf, int *recvcnts, int.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Reference: Message Passing Fundamentals.
12c.1 Collective Communication in MPI UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
SOME BASIC MPI ROUTINES With formal datatypes specified.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Matrix-vector Multiplication.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
MPI Workshop - II Research Staff Week 2 of 3.
Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.
Distributed Systems CS Programming Models- Part II Lecture 17, Nov 2, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
1 TRAPEZOIDAL RULE IN MPI Copyright © 2010, Elsevier Inc. All rights Reserved.
L20: Putting it together: Tree Search (Ch. 6) November 29, 2011.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 10 CSS314 Parallel Computing
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Chapter 6 Parallel Sorting Algorithm Sorting Parallel Sorting Bubble Sort Odd-Even (Transposition) Sort Parallel Odd-Even Transposition Sort Related Functions.
Parallel Programming and Algorithms – MPI Collective Operations David Monismith CS599 Feb. 10, 2015 Based upon MPI: A Message-Passing Interface Standard.
1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
L21: Final Preparation and Course Retrospective (also very brief introduction to Map Reduce) December 1, 2011.
PP Lab MPI programming VI. Program 1 Break up a long vector into subvectors of equal length. Distribute subvectors to processes. Let them compute the.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Parallel Programming with MPI By, Santosh K Jena..
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Synchronization These notes introduce:
Parallel Computing Presented by Justin Reschke
Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.
1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.
L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Computer Science Department
Parallel Programming By J. H. Wang May 2, 2017.
CS4402 – Parallel Computing
Computer Science Department
L21: Putting it together: Tree Search (Ch. 6)
Parallel Programming with MPI and OpenMP
L18: CUDA, cont. Memory Hierarchy and Examples
An Introduction to Parallel Programming with MPI
Collective Communication Operations
CSCE569 Parallel Computing
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
L4: Memory Hierarchy Optimization II, Locality and Data Placement
CSCE569 Parallel Computing
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Parallel Programming in C with MPI and OpenMP
Synchronizing Computations
Computer Science Department
Presentation transcript:

L19: Putting it together: N-body (Ch. 6) November 22, 2011

Administrative Project sign off due today, about a third of you are done (will accept it tomorrow, otherwise 5% loss on project grade) Next homework, CUDA, MPI (Ch. 3) and Apps (Ch. 6) -Assignment by Monday, goal is to prepare you for final -We’ll discuss it in class on Thursday -Solutions due on Friday, Dec. 1 (should be straightforward if you are in class) Poster dry run on Dec. 6, final presentations on Dec. 8 Optional final report (4-6 pages) due on Dec. 14 can be used to improve your project grade if you need that

Dense and sparse linear algebra Equivalent CSR matvec: for (i=0; i<nr; i++) { for (j = ptr[i]; j<ptr[i+1]-1; j++) t[i] += data[j] * b[indices[j]]; Dense matvec from L15: for (i=0; i<n; i++) { for (j=0; j<n; j++) { a[i] += c[j][i] * b[j]; }

Recall Dense CUDA code (Automatically Generated by our Research Compiler) __global__ mv_GPU(float* a, float* b, float** c) { int bx = blockIdx.x; int tx = threadIdx.x; __shared__ float bcpy[32]; double acpy = a[tx + 32 * bx]; for (k = 0; k < 32; k++) { bcpy[tx] = b[32 * k + tx]; __syncthreads(); //this loop is actually fully unrolled for (j = 32 * k; j <= 32 * k + 32; j++) { acpy = acpy + c[j][32 * bx + tx] * bcpy[j]; } __synchthreads(); } a[tx + 32 * bx] = acpy; }

Outline Go over programming assignment Chapter 6 shows two algorithms (N-body and Tree Search) written in the three programming models (OpenMP, Pthreads, MPI) -How to approach parallelization of an entire algorithm (Foster’s methodology is used in the book) -What do you have to worry about that is different for each programming model?

Foster’s methodology (Chapter 2) 1.Partitioning: divide the computation to be performed and the data operated on by the computation into small tasks. The focus here should be on identifying tasks that can be executed in parallel. 2.Communication: determine what communication needs to be carried out among the tasks identified in the previous step. 3.Agglomeration or aggregation: combine tasks and communications identified in the first step into larger tasks. For example, if task A must be executed before task B can be executed, it may make sense to aggregate them into a single composite task. 4.Mapping: assign the composite tasks identified in the previous step to processes/threads. This should be done so that communication is minimized, and each process/thread gets roughly the same amount of work. Copyright © 2010, Elsevier Inc. All rights Reserved

The n-body problem Find the positions and velocities of a collection of interacting particles over a period of time. An n-body solver is a program that finds the solution to an n-body problem by simulating the behavior of the particles. Copyright © 2010, Elsevier Inc. All rights Reserved

mass Position time 0 Velocity time 0 N-body solver Position time x Velocity time x

2. Overlying mesh defined 1. Lagrangian material points carry all state data (position, velocity, stress, etc.) 5. Particle positions/velocities updated from mesh solution. 6. Discard deformed mesh. Define new mesh and repeat Example N-Body: Material Point Method (MPM) 3. Particle state projected to mesh, e.g.: 4. Conservation of momentum solved on mesh giving updated mesh velocity and (in principal) position. Stress at particles computed based on gradient of the mesh velocity. 6 9

Copyright © 2010, Elsevier Inc. All rights Reserved Calculate Force as a function of mass and positions Force between two particles Total force on a particle

Copyright © 2010, Elsevier Inc. All rights Reserved Calculate acceleration

Serial pseudo-code: calculates position of particles at each time step Copyright © 2010, Elsevier Inc. All rights Reserved

Computation of the forces: Basic Algorithm Copyright © 2010, Elsevier Inc. All rights Reserved For all pairs (except ) compute Total Force

Copyright © 2010, Elsevier Inc. All rights Reserved Computation of the forces: Reduced Algorithm Reduce calculation of Total Force, capitalizing on f qk = -f kq Compute half the pairs

Compute position using Euler’s Method Copyright © 2010, Elsevier Inc. All rights Reserved Use tangent to compute position at later time step

OpenMP parallelization: Basic Algorithm Copyright © 2010, Elsevier Inc. All rights Reserved For all pairs (except ) compute Total Force Parallelization? Scheduling?

Copyright © 2010, Elsevier Inc. All rights Reserved Computation of the forces: Reduced Algorithm Reduce calculation of Total Force, capitalizing on f qk = -f kq Compute half the pairs Parallelization? Scheduling?

Other topics on OpenMP parallelization Data structure choice – group together in structure or separate arrays? Thread creation Alternative synchronization (reduced method) Private data (reduced method) Nowait I/O (single)

What’s different with Pthreads Must write more code to do basic OpenMP things: Create threads, pass parameters, determine number and mapping of iterations and initiate “parallel loop” Explicit barriers between “loops” Explicit synchronization Explicit declaration of global data that is shared, or thread-local declaration of private data

Choices with respect to the data structures: -Each process stores the entire global array of particle masses. -Each process only uses a single n-element array for the positions. -Each process uses a pointer loc_pos that refers to the start of its block of pos. -So on process 0 local_pos = pos; on process 1 local_pos = pos + loc_n; etc. Copyright © 2010, Elsevier Inc. All rights Reserved MPI Parallelization: Basic version

Copyright © 2010, Elsevier Inc. All rights Reserved

Detour: MPI_Scatter() Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Distribute Data from input using a scatter operation Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

MPI_Gather is analogous MPI_Gather: Gathers together values from a group of processes int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm) Input Parameters sendbuf starting address of send buffer (choice) sendcount number of elements in send buffer (integer) sendtype data type of send buffer elements (handle) recvcount number of elements for any single receive (integer, significant only at root) recvtype data type of recv buffer elements (significant only at root) (handle) root rank of receiving process (integer) comm communicator (handle) Output Parameter recvbuf address of receive buffer (choice, significant only at root)

Detour: MPI_Allgather (a collective) MPI_Allgather: Gathers data from all tasks and distributes the combined data to all tasks int MPI_Allgather (void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm) Input Parameters sendbuf starting address of send buffer (choice) sendcount number of elements in send buffer (integer) sendtype data type of send buffer elements (handle) recvcount number of elements received from any process (integer) recvtype data type of receive buffer elements (handle) comm communicator (handle) Output Parameter recvbufaddress of receive buffer (choice)

I/O: Ch. 6, p. 291 if (my_rank == 0) { for each particle, read masses[particle], pos[particle], vel[particle]; } MPI_Bcast(masses, n, MPI_DOUBLE, 0, comm); MPI_Bcast(pos, n, vect_mpi_t, loc_vel, loc_n, vect_mpi_t, 0, comm) MPI_Scatter(vel, loc_n, vect_mpi_t, 0, comm);

MPI Issues Communicating standard data types performs much better than derived data types like structures Separate fields (position, velocity, mass) into arrays that can be communicated independently Is it feasible for each processor to have all of the processor mass data locally? The position data? Depends on how many bodies and capacity constraints

MPI Parallelization: Basic version Copyright © 2010, Elsevier Inc. All rights Reserved

Communication In A Possible MPI Implementation of the N-Body Solver (for a reduced solver) Copyright © 2010, Elsevier Inc. All rights Reserved

Ring Pass of Positions Copyright © 2010, Elsevier Inc. All rights Reserved

Pseudo-code for the MPI implementation of the reduced n-body solver Copyright © 2010, Elsevier Inc. All rights Reserved

Loops iterating through global particle indexes Copyright © 2010, Elsevier Inc. All rights Reserved