Download presentation
Presentation is loading. Please wait.
1
Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu
2
its.unc.edu 2 Overview Parallel Algorithms Parallel Random Numbers Application Scaling MPI Bandwidth
3
its.unc.edu 3 Domain Decompositon Partition data across processors Most widely used “Owner” computes credit: George Karypis – Principles of Parallel Algorithm Design
4
its.unc.edu 4 Dense Matrix Multiply Data sharing for MM with different partitioning Shaded region of input matrices (A,B) are required by process that computes the shaded portion of output matrix C. credit: George Karypis – Principles of Parallel Algorithm Design
5
its.unc.edu 5 Dense Matrix Multiply
6
its.unc.edu 6 Parallel Sum Sum for Nprocs=8 Complete after log(Nprocs) steps credit: Designing and Building Parallel Programs – Ian Foster
7
its.unc.edu 7 Master/Workers Model Often embarrassingly parallel Master: decomposes the problem into small tasks distributes to workers gathers partial results to produce the final result Workers: work pass results back to master request more work (optional) Mapping/Load Balancing Static Dynamic Master worker
8
its.unc.edu 8 Master/Workers Load Balance Iterations may have different and unpredictable run times Systematic variance Algorithmic variance Goal is to balance load balance and overhead Some Schemes Block decomposition, static chunking Round Robin decomposition Self scheduling assign one iteration at a time Guided dynamic self-scheduling Assign 1/P of the remaining iterations (P = # procs)
9
its.unc.edu 9 Functional Parallelism map tasks onto sets of processors further decompose each function over data domain credit: Designing and Building Parallel Programs – Ian Foster
10
its.unc.edu 10 Recursive Bisection Orthogonal Recursive Bisection (ORB) good for decomposing irregular grids with mostly local communication partition the domain by subdividing it into equal parts of work by successively subdividing along orthogonal coordinate directions. cutting direction varied at each level of the recursion. ORB partitioning is restricted to p=2k processors.
11
its.unc.edu 11 ORB Example – Groundwater modeling at UNC-Ch Geometry of the homogeneous sphere- packed medium (a) 3D isosurface view; and (b) 2D cross section view. Blue and white areas stand for solid and fluid spaces, respectively. “A high-performance lattice Boltzmann implementation to model flow in porous media” by Chongxun Pan, Jan F. Prins, and Cass T. Miller Two-dimensional examples of the non- uniform domain decompositions on 16 processors: (left) rectilinear partitioning; and (right) orthogonal recursive bisection (ORB) decomposition.
12
its.unc.edu 12 Parallel Random Numbers Example: Parallel Monte Carlo Additional Requirements: usable for arbitrary (large) number of processors psuedo-random across processors – streams uncorrelated generated independently for efficiency Rule of thumb max usable sample size is at most the square root of the period
13
its.unc.edu 13 Parallel Random Numbers Scalable Parallel Random Number Generators Library (SPRNG) free and source available collects 5 RNG’s together in one package http://sprng.cs.fsu.edu
14
its.unc.edu 14 QCD Application MILC (MIMD Lattice Computation) quarks and gluons formulated on a space-time lattice mostly asynchronous PTP communication MPI_Send_init, MPI_Start, MPI_Startall MPI_Recv_init, MPI_Wait, MPI_Waitall
15
its.unc.edu 15 MILC – Strong Scaling
16
its.unc.edu 16 MILC – Strong Scaling
17
its.unc.edu 17 UNC Capability Computing - Topsail Compute nodes: 520 dual socket, quad core Intel “Clovertown” processors. 4M L2 cache per socket 2.66 GHz processors 4160 processors 12 GB memory/node Shared Disk : 39TB IBRIX Parallel File System Interconnect: Infiniband 64 bit OS cluster photos: Scott Sawyer, Dell
18
its.unc.edu 18 MPI PTP on baobab Need large messages to achieve high rates Latency cost dominates small messages MPI_Send crossover from buffered to synchronous These are instructional only not a benchmark
19
its.unc.edu 19 MPI PTP on Topsail Infiniband (IB) interconnnect Note higher bandwidth lower latency Two modes of standard send
20
its.unc.edu 20 Community Atmosphere Model (CAM) global atmosphere model for weather and climate research communities (from NCAR) atmospheric component of Community Climate System Model (CCSM) hybrid MPI/OpenMP run here with MPI only running Eulerian dynamical core with spectral truncation of 31 or 42 T31: 48x96x26 (lat x lon x nlev) T42: 64x128x26 spectral dynamical cores domain decomposed over just latitude
21
its.unc.edu 21 CAM Performance T31 T42
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.