Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed

Similar presentations


Presentation on theme: "Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed"— Presentation transcript:

1 Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

2 its.unc.edu 2 Overview  Parallel Algorithms  Parallel Random Numbers  Application Scaling  MPI Bandwidth

3 its.unc.edu 3 Domain Decompositon  Partition data across processors  Most widely used  “Owner” computes credit: George Karypis – Principles of Parallel Algorithm Design

4 its.unc.edu 4 Dense Matrix Multiply  Data sharing for MM with different partitioning  Shaded region of input matrices (A,B) are required by process that computes the shaded portion of output matrix C. credit: George Karypis – Principles of Parallel Algorithm Design

5 its.unc.edu 5 Dense Matrix Multiply

6 its.unc.edu 6 Parallel Sum  Sum for Nprocs=8  Complete after log(Nprocs) steps credit: Designing and Building Parallel Programs – Ian Foster

7 its.unc.edu 7 Master/Workers Model  Often embarrassingly parallel  Master: decomposes the problem into small tasks distributes to workers gathers partial results to produce the final result  Workers: work pass results back to master request more work (optional)  Mapping/Load Balancing Static Dynamic Master worker

8 its.unc.edu 8 Master/Workers Load Balance  Iterations may have different and unpredictable run times Systematic variance Algorithmic variance  Goal is to balance load balance and overhead Some Schemes  Block decomposition, static chunking  Round Robin decomposition  Self scheduling assign one iteration at a time  Guided dynamic self-scheduling Assign 1/P of the remaining iterations (P = # procs)

9 its.unc.edu 9 Functional Parallelism  map tasks onto sets of processors  further decompose each function over data domain credit: Designing and Building Parallel Programs – Ian Foster

10 its.unc.edu 10 Recursive Bisection  Orthogonal Recursive Bisection (ORB) good for decomposing irregular grids with mostly local communication partition the domain by subdividing it into equal parts of work by successively subdividing along orthogonal coordinate directions. cutting direction varied at each level of the recursion. ORB partitioning is restricted to p=2k processors.

11 its.unc.edu 11 ORB Example – Groundwater modeling at UNC-Ch Geometry of the homogeneous sphere- packed medium (a) 3D isosurface view; and (b) 2D cross section view. Blue and white areas stand for solid and fluid spaces, respectively. “A high-performance lattice Boltzmann implementation to model flow in porous media” by Chongxun Pan, Jan F. Prins, and Cass T. Miller Two-dimensional examples of the non- uniform domain decompositions on 16 processors: (left) rectilinear partitioning; and (right) orthogonal recursive bisection (ORB) decomposition.

12 its.unc.edu 12 Parallel Random Numbers  Example: Parallel Monte Carlo  Additional Requirements: usable for arbitrary (large) number of processors psuedo-random across processors – streams uncorrelated generated independently for efficiency  Rule of thumb max usable sample size is at most the square root of the period

13 its.unc.edu 13 Parallel Random Numbers  Scalable Parallel Random Number Generators Library (SPRNG) free and source available collects 5 RNG’s together in one package http://sprng.cs.fsu.edu

14 its.unc.edu 14 QCD Application  MILC (MIMD Lattice Computation)  quarks and gluons formulated on a space-time lattice  mostly asynchronous PTP communication MPI_Send_init, MPI_Start, MPI_Startall MPI_Recv_init, MPI_Wait, MPI_Waitall

15 its.unc.edu 15 MILC – Strong Scaling

16 its.unc.edu 16 MILC – Strong Scaling

17 its.unc.edu 17 UNC Capability Computing - Topsail  Compute nodes: 520 dual socket, quad core Intel “Clovertown” processors. 4M L2 cache per socket 2.66 GHz processors 4160 processors  12 GB memory/node  Shared Disk : 39TB IBRIX Parallel File System  Interconnect: Infiniband  64 bit OS cluster photos: Scott Sawyer, Dell

18 its.unc.edu 18 MPI PTP on baobab  Need large messages to achieve high rates  Latency cost dominates small messages  MPI_Send crossover from buffered to synchronous  These are instructional only not a benchmark

19 its.unc.edu 19 MPI PTP on Topsail  Infiniband (IB) interconnnect  Note higher bandwidth  lower latency  Two modes of standard send

20 its.unc.edu 20 Community Atmosphere Model (CAM)  global atmosphere model for weather and climate research communities (from NCAR)  atmospheric component of Community Climate System Model (CCSM)  hybrid MPI/OpenMP run here with MPI only  running Eulerian dynamical core with spectral truncation of 31 or 42  T31: 48x96x26 (lat x lon x nlev)  T42: 64x128x26  spectral dynamical cores domain decomposed over just latitude

21 its.unc.edu 21 CAM Performance T31 T42


Download ppt "Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed"

Similar presentations


Ads by Google