Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed

Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 2 Overview  Parallel Algorithms  Parallel Random Numbers  Application Scaling  MPI Bandwidth

its.unc.edu 3 Domain Decompositon  Partition data across processors  Most widely used  “Owner” computes credit: George Karypis – Principles of Parallel Algorithm Design

its.unc.edu 4 Dense Matrix Multiply  Data sharing for MM with different partitioning  Shaded region of input matrices (A,B) are required by process that computes the shaded portion of output matrix C. credit: George Karypis – Principles of Parallel Algorithm Design

its.unc.edu 5 Dense Matrix Multiply

its.unc.edu 6 Parallel Sum  Sum for Nprocs=8  Complete after log(Nprocs) steps credit: Designing and Building Parallel Programs – Ian Foster

its.unc.edu 7 Master/Workers Model  Often embarrassingly parallel  Master: decomposes the problem into small tasks distributes to workers gathers partial results to produce the final result  Workers: work pass results back to master request more work (optional)  Mapping/Load Balancing Static Dynamic Master worker

its.unc.edu 8 Master/Workers Load Balance  Iterations may have different and unpredictable run times Systematic variance Algorithmic variance  Goal is to balance load balance and overhead Some Schemes  Block decomposition, static chunking  Round Robin decomposition  Self scheduling assign one iteration at a time  Guided dynamic self-scheduling Assign 1/P of the remaining iterations (P = # procs)

its.unc.edu 9 Functional Parallelism  map tasks onto sets of processors  further decompose each function over data domain credit: Designing and Building Parallel Programs – Ian Foster

its.unc.edu 10 Recursive Bisection  Orthogonal Recursive Bisection (ORB) good for decomposing irregular grids with mostly local communication partition the domain by subdividing it into equal parts of work by successively subdividing along orthogonal coordinate directions. cutting direction varied at each level of the recursion. ORB partitioning is restricted to p=2k processors.

its.unc.edu 11 ORB Example – Groundwater modeling at UNC-Ch Geometry of the homogeneous sphere- packed medium (a) 3D isosurface view; and (b) 2D cross section view. Blue and white areas stand for solid and fluid spaces, respectively. “A high-performance lattice Boltzmann implementation to model flow in porous media” by Chongxun Pan, Jan F. Prins, and Cass T. Miller Two-dimensional examples of the non- uniform domain decompositions on 16 processors: (left) rectilinear partitioning; and (right) orthogonal recursive bisection (ORB) decomposition.

its.unc.edu 12 Parallel Random Numbers  Example: Parallel Monte Carlo  Additional Requirements: usable for arbitrary (large) number of processors psuedo-random across processors – streams uncorrelated generated independently for efficiency  Rule of thumb max usable sample size is at most the square root of the period

its.unc.edu 13 Parallel Random Numbers  Scalable Parallel Random Number Generators Library (SPRNG) free and source available collects 5 RNG’s together in one package http://sprng.cs.fsu.edu

its.unc.edu 14 QCD Application  MILC (MIMD Lattice Computation)  quarks and gluons formulated on a space-time lattice  mostly asynchronous PTP communication MPI_Send_init, MPI_Start, MPI_Startall MPI_Recv_init, MPI_Wait, MPI_Waitall

its.unc.edu 15 MILC – Strong Scaling

its.unc.edu 16 MILC – Strong Scaling

its.unc.edu 17 UNC Capability Computing - Topsail  Compute nodes: 520 dual socket, quad core Intel “Clovertown” processors. 4M L2 cache per socket 2.66 GHz processors 4160 processors  12 GB memory/node  Shared Disk : 39TB IBRIX Parallel File System  Interconnect: Infiniband  64 bit OS cluster photos: Scott Sawyer, Dell

its.unc.edu 18 MPI PTP on baobab  Need large messages to achieve high rates  Latency cost dominates small messages  MPI_Send crossover from buffered to synchronous  These are instructional only not a benchmark

its.unc.edu 19 MPI PTP on Topsail  Infiniband (IB) interconnnect  Note higher bandwidth  lower latency  Two modes of standard send

its.unc.edu 20 Community Atmosphere Model (CAM)  global atmosphere model for weather and climate research communities (from NCAR)  atmospheric component of Community Climate System Model (CCSM)  hybrid MPI/OpenMP run here with MPI only  running Eulerian dynamical core with spectral truncation of 31 or 42  T31: 48x96x26 (lat x lon x nlev)  T42: 64x128x26  spectral dynamical cores domain decomposed over just latitude

its.unc.edu 21 CAM Performance T31 T42

Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed

Similar presentations

Presentation on theme: "Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed

Similar presentations

Presentation on theme: "Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed"— Presentation transcript:

Similar presentations

About project

Feedback