High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.

Slides:



Advertisements
Similar presentations
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Advertisements

N-Body I CS 170: Computing for the Sciences and Mathematics.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Numerical Algorithms Matrix multiplication
Numerical Algorithms • Matrix multiplication
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
Reference: Message Passing Fundamentals.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS267 L2 Memory Hierarchies.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Uniprocessor Optimizations and Matrix Multiplication
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Lecture 9 Architecture Independent (MPI) Algorithm Design
Parallel Computing Presented by Justin Reschke
Concurrency and Performance Based on slides by Henri Casanova.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Computational Fluid Dynamics Lecture II Numerical Methods and Criteria for CFD Dr. Ugur GUVEN Professor of Aerospace Engineering.
Auburn University
Lesson 8: Basic Monte Carlo integration
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Ioannis E. Venetis Department of Computer Engineering and Informatics
Parallel Programming By J. H. Wang May 2, 2017.
Optimizing Cache Performance in Matrix Multiplication
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Parallel Algorithm Design
Optimizing Cache Performance in Matrix Multiplication
EE 193: Parallel Computing
BLAS: behind the scenes
CSCI1600: Embedded and Real Time Software
CSCE569 Parallel Computing
Course Outline Introduction in algorithms and applications
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Parallel Programming in C with MPI and OpenMP
CSCI1600: Embedded and Real Time Software
Presentation transcript:

High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division

High Performance Parallel Programming Lecture 5: Parallel Programming Methods and Matrix Multiplication

High Performance Parallel Programming Performance Remember Amdahls Law –speedup limited by serial execution time Parallel Speedup: S(n, P) = T(n,1)/T(n, P) Parallel Efficiency: E(n, P) = S(n, P)/P = T(n, 1)/PT(n, P) Doesn’t take into account the quality of algorithm!

High Performance Parallel Programming Total Performance Numerical Efficiency of parallel algorithm T best (n)/T(n, 1) Total Speedup: S’(n, P) = T best (n)/T(n, P) Total Efficiency: E’(n, P) = S’(n, P)/P = T best (n)/PT(n, P) But, best serial algorithm may not be known or not be easily parallelizable. Generally use good algorithm.

High Performance Parallel Programming Performance Inhibitors Inherently serial code Non-optimal Algorithms Algorithmic Overhead Software Overhead Load Imbalance Communication Overhead Synchronization Overhead

High Performance Parallel Programming Writing a parallel program Basic concept: First partition problem into smaller tasks. –(the smaller the better) –This can be based on either data or function. –Tasks may be similar or different in workload. Then analyse the dependencies between tasks. –Can be viewed as data-oriented or time-oriented. Group tasks into work-units. Distribute work-units onto processors

High Performance Parallel Programming Partitioning Partitioning is designed to expose opportunities for parallel execution. The focus is to obtain a fine-grained decomposition. A good partition divides into small pieces both the computation and the data –data first - domain decomposition –computation first - functional decomposition These are complimentary –may be applied to different parts of program –may be applied to same program to yield alternative algorithms

High Performance Parallel Programming Decomposition Functional Decomposition –Work is divided into tasks which act consequtively on data –Usual example is pipelines –Not easily scalable –Can form part of hybrid schemes Data Decomposition –Data is distributed among processors –Data collated in some manner –Data needs to be distributed to provide each processor with equal amounts of work –Scalability limited by size of data-set

High Performance Parallel Programming Dependency Analysis Determines the communication requirements of tasks Seek to optimize performance by –distributing communications over many tasks –organizing communications to allow concurrent execution Domain decomposition often leads to disjoint easy and hard bits –easy - many tasks operating on same structure –hard - changing structures, etc. Functional decomposition usually easy

High Performance Parallel Programming Trivial Parallelism No dependencies between tasks. –Similar to Parametric problems –Perfect (mostly) speedup –Can use optimal serial algorithms –Generally no load imbalance –Not often possible but it’s great when it is! Won’t look at again

High Performance Parallel Programming Aside on Parametric Methods Parametric methods usually treat program as “black-box” No timing or other dependencies so easy to do on Grid May not be efficient! –Not all parts of program may need all parameters –May be substantial initialization –Algorithm may not be optimal –There may be a good parallel algorithm Always better to examine code/algorithm if possible.

High Performance Parallel Programming Group Tasks The first two stages produce an abstract algorithm. Now we move from the abstract to the concrete. –decide on class of parallel system fast/slow communication between processors interconnection type –may decide to combine tasks based on workunits based on number of processors based on communications –may decide to replicate data or computation

High Performance Parallel Programming Distribute tasks onto processors Also known as mapping Number of processors may be fixed or dynamic –MPI - fixed, PVM - dynamic Task allocation may be static (i.e. known at compile- time) or dynamic (determined at run-time) Actual placement may be responsibility of OS Large scale multiprocessing systems usually use space-sharing where a subset of processors is exclusively allocated to a program with the space possibly being time-shared

High Performance Parallel Programming Task Farms Basic model –matched early architectures –complete Model is made up of three different process types – Source - divides up initial tasks between workers. Allocates further tasks when initial tasks completed – Worker - recieves task from Source, processes it and passes result to Sink – Sink - recieves completed task from Worker and collates partial results. Tells source to send next task.

High Performance Parallel Programming The basic task farm Note: the source and sink process may be located on the same processor Source Worker Sink Worker

High Performance Parallel Programming Task Farms Variations –combine source and sink onto one processor –have multiple source and sink processors –buffered communication (latency hiding) –task queues Limitations –can involve a lot of communications wrt work done –difficult to handle communications between workers –load-balancing

High Performance Parallel Programming Load balancing P1P2P3P4P1P2P3P4 vs time

High Performance Parallel Programming Load balancing Ideally we want all processors to finish at the same time If all tasks same size then easy... If we know the size of tasks, can allocate largest first Not usually a problem if #tasks >> #processors and tasks are small Can interact with buffering –task may be buffered while other processors are idle –this can be a particular problem for dynamic systems Task order may be important

High Performance Parallel Programming What do we do with Supercomputers? Weather - how many do we need Calculating  - once is enough etc. Most work is simulation or modelling Two types of system –discrete particle oriented space oriented –continuous various methods of discretising

High Performance Parallel Programming discrete systems particle oriented –sparse systems –keep track of particles –calculate forces between particles –integrate equations of motion –e.g. galactic modelling space oriented –dense systems –particles passed between cells –usually simple interactions –e.g. grain hopper

High Performance Parallel Programming Continuous systems Fluid mechanics Compressible vs non-compressible Usually solve conservation laws –like loop invariables Discretization –volumetric –structured or unstructured meshes e.g. simulated wind-tunnels

High Performance Parallel Programming Introduction Particle-Particle Methods are used when the number of particles is low enough to consider each particle and it’s interactions with the other particles Generally dynamic systems - i.e. we either watch them evolve or reach equilibrium Forces can be long or short range Numerical accuracy can be very important to prevent error accumulation Also non-numerical problems

High Performance Parallel Programming Molecular dynamics Many physical systems can be represented as collections of particles Physics used depends on system being studied: there are different rules for different length scales – m - Quantum Mechanics Particle Physics, Chemistry – m - Newtonian Mechanics Biochemistry, Materials Science, Engineering, Astrophysics – m - Relativistic Mechanics Astrophysics

High Performance Parallel Programming Examples Galactic modelling –need large numbers of stars to model galaxies –gravity is a long range force - all stars involved –very long time scales (varying for universe..) Crack formation –complex short range forces –sudden state change so need short timesteps –need to simulate large samples Weather –some models use particle methods within cells

High Performance Parallel Programming General Picture Models are specified as a finite set of particles interacting via fields. The field is found by the pairwise addition of the potential energies, e.g. in an electric field: The Force on the particle is given by the field equations

High Performance Parallel Programming Simulation Define the starting conditions, positions and velocities Choose a technique for integrating the equations of motion Choose a functional form for the forces and potential energies. Sum forces over all interacting pairs, using neighbourlist or similar techniques if possible Allow for boundary conditions and incorporate long range forces if applicable Allow the system to reach equilibrium then measure properties of the system as it involves over time

High Performance Parallel Programming Starting conditions Choice of initial conditions depends on knowledge of system Each case is different –may require fudge factors to account for unknowns –a good starting guess can save equibliration time –but many physical systems are chaotic.. Some overlap between choice of equations and choice of starting conditions

High Performance Parallel Programming Integrating the equations of motion This is an O(N) operation. For Newtonian dynamics there are a number of systems Euler (direct) method - very unstable, poor conservation of energy

High Performance Parallel Programming Last Lecture Particle Methods solve problems using an iterative like scheme: If the Force Evaluation phase becomes too expensive approximation methods have to be used Initial ConditionsForce Evaluation Integrate Final

High Performance Parallel Programming e.g. Gravitational System To calculate the force on a body we need to perform operations For large N this operation count is to high

High Performance Parallel Programming An Alternative Calculating the force directly using PP methods is too expensive for large numbers of particles Instead of calculating the force at a point, the field equations can be used to mediate the force From the gradient of the potential field the force acting on a particle can be derived without having to calculate the force in a pairwise fashion  x

High Performance Parallel Programming Using the Field Equations Sample field on a grid and use this to calculate the force on particles Approximation: –Continuous field - grid –Introduces coarse sampling, i.e. smooth below grid scale –Interpolation may also introduce errors  x

High Performance Parallel Programming What do we gain Faster: Number of grid points can be less than the number of particles –Solve field equations on grid –Particles contribute charge locally to grid –Field information is fed back from neighbouring grid points Operation count: O (N 2 ) -> O (N) or O (N log N) => we can model larger numbers of bodies...with an acceptable error tolerance

High Performance Parallel Programming Requirements Particle Mesh (PM) methods are best suited for problems which have: –A smoothly varying force field –Uniform particle distribution in relation to the resolution of the grid –Long range forces Although these properties are desirable they are not essential to profitably use a PM method e.g. Galaxies, Plasmas etc

High Performance Parallel Programming Procedure The basic Particle Mesh algorithm consists of the following steps: –Generate Initial conditions –Overlay system with a covering Grid –Assign charges to the mesh –Calculate the mesh defined Force Field –Interpolate to find forces on the particles –Update Particle Positions –End

High Performance Parallel Programming Matrix Multiplication

High Performance Parallel Programming Optimizing Matrix Multiply for Caches Several techniques for making this faster on modern processors –heavily studied Some optimizations done automatically by compiler, but can do much better In general, you should use optimized libraries (often supplied by vendor) for this and other very common linear algebra operations –BLAS = Basic Linear Algebra Subroutines Other algorithms you may want are not going to be supplied by vendor, so need to know these techniques

High Performance Parallel Programming Matrix-vector multiplication y = y + A*x for i = 1:n for j = 1:n y(i) = y(i) + A(i,j)*x(j) = + * y(i) A(i,:) x(:)

High Performance Parallel Programming Matrix-vector multiplication y = y + A*x {read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n {read row i of A into fast memory} for j = 1:n y(i) = y(i) + A(i,j)*x(j) {write y(1:n) back to slow memory} ° m = number of slow memory refs = 3*n + n^2 ° f = number of arithmetic operations = 2*n^2 ° q = f/m ~= 2 ° Matrix-vector multiplication limited by slow memory speed

High Performance Parallel Programming Matrix Multiply C=C+A*B for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) =+* C(i,j) A(i,:) B(:,j)

High Performance Parallel Programming Matrix Multiply C=C+A*B (unblocked, or untiled) for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory} =+* C(i,j) A(i,:) B(:,j)

High Performance Parallel Programming Matrix Multiply aside Classic dot product: do i = i,n do j = i,n c(i,j) = 0.0 do k = 1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo saxpy: c = 0.0 do k = 1,n do j = 1,n do i = 1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo

High Performance Parallel Programming Matrix Multiply (unblocked, or untiled) Number of slow memory references on unblocked matrix multiply m = n^3 read each column of B n times + n^2 read each column of A once for each i + 2*n^2 read and write each element of C once = n^3 + 3*n^2 So q = f/m = (2*n^3)/(n^3 + 3*n^2) ~= 2 for large n, no improvement over matrix-vector multiply =+* C(i,j) A(i,:) B(:,j)

High Performance Parallel Programming Matrix Multiply (blocked, or tiled) Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory} =+* C(i,j) A(i,k) B(k,j)

High Performance Parallel Programming Matrix Multiply (blocked or tiled) Why is this algorithm correct? Number of slow memory references on blocked matrix multiply m = N*n^2 read each block of B N^3 times (N^3 * n/N * n/N) + N*n^2 read each block of A N^3 times + 2*n^2 read and write each block of C once = (2*N + 2)*n^2 So q = f/m = 2*n^3 / ((2*N + 2)*n^2) ~= n/N = b for large n

High Performance Parallel Programming PW600au - 600MHz, EV56

High Performance Parallel Programming DS10L - 466MHz, EV6

High Performance Parallel Programming Matrix Multiply (blocked or tiled) So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2) Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3) Theorem (Hong, Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to q =O(sqrt(M))

High Performance Parallel Programming Strassen’s Matrix Multiply The traditional algorithm (with or without tiling) has O(n^3) flops Strassen discovered an algorithm with asymptotically lower flops –O(n^2.81) Consider a 2x2 matrix multiply, normally 8 multiplies Let M = [m11 m12] = [a11 a12] * [b11 b12] [m21 m22] [a21 a22] [b21 b22] Let p1 = (a12 - a22) * (b21 + b22) p5 = a11 * (b12 - b22) p2 = (a11 + a22) * (b11 + b22) p6 = a22 * (b21 - b11) p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11 p4 = (a11 + a12) * b22 Then m11 = p1 + p2 - p4 + p6 m12 = p4 + p5 m21 = p6 + p7 m22 = p2 - p3 + p5 - p7

High Performance Parallel Programming Strassen (continued) T(n)= cost of multiplying nxn matrices = 7*T(n/2) + 18(n/2)^2 = O(n^log 2 7) = O(n^2.81) ° Available in several libraries ° Up to several time faster if n large enough (100s) ° Needs more memory than standard algorithm ° Can be less accurate because of roundoff error ° Current world’s record is O(n^ )

High Performance Parallel Programming Parallelizing Could use task farm with blocked algorithm Allows for any number of processors Usually doesn’t do optimal data distribution Scalability limited to n-1 (bad for small n) Requires tricky code in master to keep track of all the blocks Can be improved by double buffering

High Performance Parallel Programming Better algorithms Based on block algorithm Distribute control to all processors Usually written with fixed number of processors Can assign a block of the matrix to each node then cycle the blocks of A and B (A row-wise, B col-wise) past each processor Better to assign “column blocks” to each processor as this only requires cycling B matrix (less communication)

High Performance Parallel Programming Thursday: back to Raj…