Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Practical techniques & Examples
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
OpenFOAM on a GPU-based Heterogeneous Cluster
Reference: Message Passing Fundamentals.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Parallel Programming Models and Paradigms
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Strategies for Implementing Dynamic Load Sharing.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Load Balancing and Termination Detection Load balance : - statically before the execution of any processes - dynamic during the execution of the processes.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Institute for Mathematical Modeling RAS 1 Dynamic load balancing. Overview. Simulation of combustion problems using multiprocessor computer systems For.
QCD Project Overview Ying Zhang September 26, 2005.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Domain Decomposed Parallel Heat Distribution Problem in Two Dimensions Yana Kortsarts Jeff Rufinus Widener University Computer Science Department.
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
Master/Workers Model Example Research Computing UNC - Chapel Hill Instructor: Mark Reed
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Background Computer System Architectures Computer System Software.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Auburn University
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Parallel Density-based Hybrid Clustering
Parallel Algorithm Design
Parallel Programming in C with MPI and OpenMP
Lecture 3 : Performance of Parallel Programs
2009 AAG Annual Meeting Las Vegas, NV March 25th, 2009
CS 584.
Hybrid Programming with OpenMP and MPI
Indiana University, Bloomington
Introduction to parallelism and the Message Passing Interface
Introduction, background, jargon
Adaptivity and Dynamic Load Balancing
Chapter 01: Introduction
Parallel Programming in C with MPI and OpenMP
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed

its.unc.edu 2 Overview  Parallel Algorithms  Parallel Random Numbers  Application Scaling  MPI Bandwidth

its.unc.edu 3 Domain Decompositon  Partition data across processors  Most widely used  “Owner” computes credit: George Karypis – Principles of Parallel Algorithm Design

its.unc.edu 4 Dense Matrix Multiply  Data sharing for MM with different partitioning  Shaded region of input matrices (A,B) are required by process that computes the shaded portion of output matrix C. credit: George Karypis – Principles of Parallel Algorithm Design

its.unc.edu 5 Dense Matrix Multiply

its.unc.edu 6 Parallel Sum  Sum for Nprocs=8  Complete after log(Nprocs) steps credit: Designing and Building Parallel Programs – Ian Foster

its.unc.edu 7 Master/Workers Model  Often embarrassingly parallel  Master: decomposes the problem into small tasks distributes to workers gathers partial results to produce the final result  Workers: work pass results back to master request more work (optional)  Mapping/Load Balancing Static Dynamic Master worker

its.unc.edu 8 Master/Workers Load Balance  Iterations may have different and unpredictable run times Systematic variance Algorithmic variance  Goal is to balance load balance and overhead Some Schemes  Block decomposition, static chunking  Round Robin decomposition  Self scheduling assign one iteration at a time  Guided dynamic self-scheduling Assign 1/P of the remaining iterations (P = # procs)

its.unc.edu 9 Functional Parallelism  map tasks onto sets of processors  further decompose each function over data domain credit: Designing and Building Parallel Programs – Ian Foster

its.unc.edu 10 Recursive Bisection  Orthogonal Recursive Bisection (ORB) good for decomposing irregular grids with mostly local communication partition the domain by subdividing it into equal parts of work by successively subdividing along orthogonal coordinate directions. cutting direction varied at each level of the recursion. ORB partitioning is restricted to p=2k processors.

its.unc.edu 11 ORB Example – Groundwater modeling at UNC-Ch Geometry of the homogeneous sphere- packed medium (a) 3D isosurface view; and (b) 2D cross section view. Blue and white areas stand for solid and fluid spaces, respectively. “A high-performance lattice Boltzmann implementation to model flow in porous media” by Chongxun Pan, Jan F. Prins, and Cass T. Miller Two-dimensional examples of the non- uniform domain decompositions on 16 processors: (left) rectilinear partitioning; and (right) orthogonal recursive bisection (ORB) decomposition.

its.unc.edu 12 Parallel Random Numbers  Example: Parallel Monte Carlo  Additional Requirements: usable for arbitrary (large) number of processors psuedo-random across processors – streams uncorrelated generated independently for efficiency  Rule of thumb max usable sample size is at most the square root of the period

its.unc.edu 13 Parallel Random Numbers  Scalable Parallel Random Number Generators Library (SPRNG) free and source available collects 5 RNG’s together in one package

its.unc.edu 14 QCD Application  MILC (MIMD Lattice Computation)  quarks and gluons formulated on a space-time lattice  mostly asynchronous PTP communication MPI_Send_init, MPI_Start, MPI_Startall MPI_Recv_init, MPI_Wait, MPI_Waitall

its.unc.edu 15 MILC – Strong Scaling

its.unc.edu 16 MILC – Strong Scaling

its.unc.edu 17 UNC Capability Computing - Topsail  Compute nodes: 520 dual socket, quad core Intel “Clovertown” processors. 4M L2 cache per socket 2.66 GHz processors 4160 processors  12 GB memory/node  Shared Disk : 39TB IBRIX Parallel File System  Interconnect: Infiniband  64 bit OS cluster photos: Scott Sawyer, Dell

its.unc.edu 18 MPI PTP on baobab  Need large messages to achieve high rates  Latency cost dominates small messages  MPI_Send crossover from buffered to synchronous  These are instructional only not a benchmark

its.unc.edu 19 MPI PTP on Topsail  Infiniband (IB) interconnnect  Note higher bandwidth  lower latency  Two modes of standard send

its.unc.edu 20 Community Atmosphere Model (CAM)  global atmosphere model for weather and climate research communities (from NCAR)  atmospheric component of Community Climate System Model (CCSM)  hybrid MPI/OpenMP run here with MPI only  running Eulerian dynamical core with spectral truncation of 31 or 42  T31: 48x96x26 (lat x lon x nlev)  T42: 64x128x26  spectral dynamical cores domain decomposed over just latitude

its.unc.edu 21 CAM Performance T31 T42