Data Structures and Algorithms in Parallel Computing Lecture 10.

Slides:



Advertisements
Similar presentations
Elementary Linear Algebra Anton & Rorres, 9th Edition
Advertisements

CSNB143 – Discrete Structure
Lecture 19: Parallel Algorithms
N-Body I CS 170: Computing for the Sciences and Mathematics.
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Numerical Algorithms Matrix multiplication
Numerical Algorithms • Matrix multiplication
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
Daniel Blackburn Load Balancing in Distributed N-Body Simulations.
Weiping Shi Department of Computer Science University of North Texas HiCap: A Fast Hierarchical Algorithm for 3D Capacitance Extraction.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
Lecture 21: Parallel Algorithms
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Design of parallel algorithms
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Design of parallel algorithms Matrix operations J. Porras.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
LIAL HORNSBY SCHNEIDER
Matrix Solution of Linear Systems The Gauss-Jordan Method Special Systems.
Chapter 1 – Linear Equations
The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
Lecture 10 CSS314 Parallel Computing
1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 5 Systems and Matrices Copyright © 2013, 2009, 2005 Pearson Education, Inc.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Chapter 4 – Matrix CSNB 143 Discrete Mathematical Structures.
Monte Carlo Methods Versatile methods for analyzing the behavior of some activity, plan or process that involves uncertainty.
P ARALLELIZATION IN M OLECULAR D YNAMICS By Aditya Mittal For CME346A by Professor Eric Darve Stanford University.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Based on Rosen, Discrete Mathematics & Its Applications, 5e Prepared by (c) Michael P. Frank Modified by (c) Haluk Bingöl 1/21 Module.
Parallel & Cluster Computing N-Body Simulation and Collective Communications Henry Neeman, Director OU Supercomputing Center for Education & Research University.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Discrete Mathematics 1 Kemal Akkaya DISCRETE MATHEMATICS Lecture 16 Dr. Kemal Akkaya Department of Computer Science.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallel Programming & Cluster Computing N-Body Simulation and Collective Communications Henry Neeman, University of Oklahoma Paul Gray, University of.
Mayur Jain, John Verboncoeur, Andrew Christlieb [jainmayu, johnv, Supported by AFOSR Abstract Electrostatic particle based models Developing.
PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic
Lecture 9 Architecture Independent (MPI) Algorithm Design
Concurrency and Performance Based on slides by Henri Casanova.
Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.
CS 285- Discrete Mathematics Lecture 11. Section 3.8 Matrices Introduction Matrix Arithmetic Transposes and Power of Matrices Zero – One Matrices Boolean.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
All Pairs Shortest Path Algorithms Aditya Sehgal Amlan Bhattacharya.
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! 1 ITCS 4/5145 Parallel Computing,
Numerical Algorithms Chapter 11.
Linear Algebra Lecture 17.
Lecture 16: Parallel Algorithms I
Lecture 22: Parallel Algorithms
Numerical Algorithms • Parallelizing matrix multiplication
Supported by the National Science Foundation.
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Data Structures and Algorithms in Parallel Computing Lecture 10

Numerical algorithms Algorithms that use numerical approximation to solve mathematical problems They do not seek exact solutions because this would be nearly impossible in practice Much work has been done on parallelizing numerical algorithms – Matrix operations – Particle physics – Systems of linear equations – … Note:

Matrix operations Inner product: Outer product: Matrix vector product: Matrix matrix product:

Inner product Assign (n/k)/p coarse-grain tasks to each of p processes, for total of n/p components of x and y per process Communication: sum reduction over n/k coarse grained tasks Isoefficiency: – How the amount of computation performed must scale with processor number to keep efficiency constant – 1D mesh: Θ(p 2 ) – 2D mesh: Θ(p 3/2 ) – Hypercube: Θ(p log p)

Outer product At most n tasks store components of x and y: for some j, task (i,j) stores x i and task (j,i) stores y i, or task (i,i) stores both x i and y i, i = 1,...,n Communication: For i = 1,...,n, task that stores x i broadcasts it to all other tasks in ith task row For j = 1,...,n, task that stores y j broadcasts it to all other tasks in jth task column

1D mapping Column wise Row wise Each task holding either x or y components must broadcast them to neighbors Isoefficiency: Θ(p 2 )

2D mapping Isoefficiency: Θ(p 2 )

Matrix vector product At most 2n fine-grain tasks store components of x and y, say either – For some j, task (j,i) stores x i and task (i,j) stores y i, or – Task (i,i) stores both x i and y i, i = 1,...,n Communication – For j = 1,...,n, task that stores x j broadcasts it to all other tasks in jth task column – For i = 1,...,n, sum reduction over ith task row gives y i

Matrix vector product Steps 1.Broadcast x j to tasks (k,j), k = 1,...,n 2.Compute y i = a ij x j 3.Reduce y i across tasks (i,k), k = 1,...,n

2D mapping Isoefficiency: Θ(p 2 )

1D column mapping Isoefficiency: Θ(p 2 )

1D row mapping Isoefficiency: Θ(p 2 )

Matrix matrix product Matrix-matrix product can be viewed as: – n 2 inner products, or – sum of n outer products, or – n matrix-vector products Each viewpoint yields different algorithm One way to derive parallel algorithms for matrix- matrix product is to apply parallel algorithms already developed for inner product, outer product, or matrix-vector product We will investigate parallel algorithms for this problem directly

Matrix matrix product At most 3n 2 fine-grain tasks store entries of A, B, or C, say task (i,j,j) stores a ij, task (i,j,i) stores b ij, and task (i,j,k) stores c ij for i,j = 1,...,n and some fixed k (i,j,k) = (row, column, layer) Communication – Broadcast entries of jth column of A horizontally along each task row in jth layer – Broadcast entries of ith row of B vertically along each task column in ith layer – For i,j = 1,...,n, result c ij is given by sum reduction over tasks (i,j,k), k = 1,...,n

Matrix matrix product Steps 1.Broadcast a ik to tasks (i,q,k), q = 1,...,n 2.Broadcast b kj to tasks (q,j,k), q = 1,...,n 3.cij = a ik b kj 4.Reduce c ij across tasks (i,j,q), q = 1,...,n Task grouping – Reduce number of processors

Particle systems Many physical systems can be modeled as a collection of interacting particles – Atoms in molecule – Planets in solar system – Stars in galaxy – Galaxies in clusters Particles exert mutual forces on each other – Gravitational – Electrostatic

N-body model Newton’s law: Force between two particles: Overall force on ith particle:

Complexity O(n 2 ) due to particle-particle interactions Can be reduced to O(n log n) or O(n) through – Hierarchical trees – Multipole methods Pay penalty of accuracy

Trivial parallelism High parallelism but totally work prohibitive and memory requirements may be expensive 2 steps – Broadcast position of each particle along rows and columns – Reduce forces diagonally (to home of particle) and perform time integration Agglomeration can reduce communication in rows or columns

Reducing complexity Forces have infinite range, but with declining strength Three major options – Perform full computation at O(n 2 ) cost – Discard forces from particles beyond certain range, introducing error that is bounded away from zero – Approximate long-range forces, exploiting behavior of force and/or features of problem

Approach Monopole representation Or tree code Method: – Aggregate distant particles into cells and represent effect of all particles in a cell by monopole (first term in multipole expansion) evaluated at center of cell Replace influence of far away particles with aggregate approximation of force – Use larger cells at greater distances – Approximation is relatively crude

Parallel Approach Divide domain into patches, with each patch assigned to a process Tree code replaces communication with all processes by communication with fewer processes To avoid accuracy problem of monopole expansion, use full multipole expansion

What’s next? Discuss some recent papers on parallel algorithms dealing with classes of problems discussed during this lecture