CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6, 9.2.8, 10.4.1, Kumar 12.1.3 1. Berman, F., Wolski, R.,

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Adopt Algorithm for Distributed Constraint Optimization
Practical techniques & Examples
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
CSCI-455/552 Introduction to High Performance Computing Lecture 26.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Linear Systems of Equations
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Numerical Algorithms Matrix multiplication
Numerical Algorithms • Matrix multiplication
CSCI 317 Mike Heroux1 Sparse Matrix Computations CSCI 317 Mike Heroux.
CS 584. Review n Systems of equations and finite element methods are related.
Special Matrices and Gauss-Siedel
Performance Prediction Engineering Francine Berman U. C. San Diego Rich Wolski U. C. San Diego and University of Tennessee This presentation will probably.
CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 6.3.2, H. Casanova, A. Legrand, Z. Zaogordnov, and F. Berman, "Heuristics.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
1 Systems of Linear Equations Iterative Methods. 2 B. Iterative Methods 1.Jacobi method and Gauss Seidel 2.Relaxation method for iterative methods.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6, 9.2.8, , Kumar Berman, F., Wolski, R.,
1 Systems of Linear Equations Iterative Methods. 2 B. Direct Methods 1.Jacobi method and Gauss Seidel 2.Relaxation method for iterative methods.
Achieving Application Performance on the Computational Grid Francine Berman This presentation will probably involve audience discussion, which will create.
CSE 160/Berman Mapping and Scheduling W+A: Chapter 4.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Special Matrices and Gauss-Siedel
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
AppLeS / Network Weather Service IPG Pilot Project FY’98 Francine Berman U. C. San Diego and NPACI Rich Wolski U.C. San Diego, NPACI and U. of Tennessee.
COMPE575 Parallel & Cluster Computing 5.1 Pipelined Computations Chapter 5.
Thomas algorithm to solve tridiagonal matrices
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
CSE 160/Berman Lecture 6 -- Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6, 9.2.8, , Kumar
Scheduling From the Perspective of the Application By Francine Berman & Richard Wolski Presenter:Kun-chan Lan.
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Parallel Tomography Shava Smallen CSE Dept. U.C. San Diego.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Chapter 11.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.
Linear Systems – Iterative methods
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Adaptive Computing on the Grid Using AppLeS Francine Berman, Richard Wolski, Henri Casanova, Walfredo Cirne, Holly Dail, Marcio Faerman, Silvia Figueira,
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
Application-level Scheduling Sathish S. Vadhiyar Credits / Sources: AppLeS web pages and papers.
Linear Systems Numerical Methods. 2 Jacobi Iterative Method Choose an initial guess (i.e. all zeros) and Iterate until the equality is satisfied. No guarantee.
Parallel Tomography Shava Smallen SC99. Shava Smallen SC99AppLeS/NWS-UCSD/UTK What are the Computational Challenges? l Quick turnaround time u Resource.
Network Weather Service. Introduction “NWS provides accurate forecasts of dynamically changing performance characteristics from a distributed set of metacomputing.
CSCI-455/552 Introduction to High Performance Computing Lecture 15.
Iterative Methods Good for sparse matrices Jacobi Iteration
Lecture 19 MA471 Fall 2003.
Finite Element Method To be added later 9/18/2018 ELEN 689.
Numerical Algorithms • Parallelizing matrix multiplication
Pipelined Computations
Jacobi Project Salvatore Orlando.
CS 416 Artificial Intelligence
Chapter 11 Chapter 11.
Presentation transcript:

CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6, 9.2.8, , Kumar Berman, F., Wolski, R., Figueira, S., Schopf, J. and Shao, G., "Application-Level Scheduling on Distributed Heterogeneous Networks," Proceedings of Supercomputing '96 (

CSE 160/Berman Common Parallel Programming Paradigms Embarrassingly parallel programs Workqueue Master/Slave programs Monte Carlo methods Regular, Iterative (Stencil) Computations Pipelined Computations Synchronous Computations

CSE 160/Berman Regular, Iterative Stencil Applications Many scientific applications have the format Loop until some condition is true Perform computation which involves communicating with N,E,W,S neighbors of a point (5 point stencil) [Convergence test?]

CSE 160/Berman Stencil Example: Jacobi2D Jacobi algorithm, also known as the method of simultaneous corrections is an iterative method for approximating the solution to a system of linear equations. Jacobi addresses the problem of solving n linear equations in n unknowns Ax=b where the ith equation is or alternatively a’s and b’s are known, want to solve for x’s

CSE 160/Berman Jacobi 2D Strategy Jacobi strategy iterates until the computation converges to an exact solution, i.e. each iteration we solve where the values from the (k-1)st iteration are used to compute the values for the kth iteration For important classes of problems, Jacobi converges to a “good” solution after O(logN) iterations [Leighton] –typically, the solution is approximated to a desired error threshold

CSE 160/Berman Jacobi 2D Equation is most efficient to solve when most a’s are 0 When most a’s entries are non-zero, A is dense When most a’s are 0, A is sparse –Sparse matrices are regularly found in many scientific applications.

CSE 160/Berman La Place’s Equation Jacobi strategy can be used effectively to solve sparse linear equations. One such equation is La Place’s equation: f is solved over a 2D space having coordinates x and y If the distance between points (  ) is small enough, f can be approximated by These equations reduce to

CSE 160/Berman La Place’s Equation Note the relationship between the parameters This forms a 4 point stencil Any update will involve only local communication! (x,y) (x+ ,y)(x- ,y) (x,y+  ) (x,y-  )

CSE 160/Berman Solving La Place using Jacobi strategy Note that in La Place equation, we want to solve for all f(x,y) which has 2 parameters In Jacobi, we want to solve for x_i which has only 1 index How do we convert f(x,y) into x_i ? Associate x_i’s with the f(x,y)’s by distributing them in the f 2D matrix in row-major (natural) order For an nxn matrix, there are then nxn x_i’s, so the A matrix will need to be (nxn)X(nxn)

CSE 160/Berman Solving La Place using Jacobi strategy When the x_i’s are distributed in the f 2D matrix in row-major (natural) order becomes

CSE 160/Berman Working backward Now we want to work backward to find out what the A matrix and b vector will be for Jacobi Our solution to the La Place equation gives us equations of this form Rewriting, we get So the b_i are 0, what is the A matrix?

CSE 160/Berman Finding the A matrix Each row only at most 5 non-zero entries All entries on the diagonal are 4 N=9, n=3:

CSE 160/Berman Jacobi Implementation Strategy An initial guess is made for all the unknowns, typically x_i = b_i New values for the x_i’s are calculated using the iteration equations The updated values are substituted in the iteration equations and the process repeats again The user provides a "termination condition" to end the iteration. –An example termination condition is error threshold.

CSE 160/Berman Data Parallel Jacobi 2D Pseudo-code [Initialize ghost regions] for (i=1; i<=N; i++) x[0][i] = north[i]; x[N+1][i] = south[i]; x[i][0] = west[i]; x[i][N+1] = east[i]; [Initialize matrix] for (i=1; i<=N; i++) for (j=1; j<=N; j++) x[i][j] = initvalue; [Iterative refinement of x until values converge] while (maxdiff > CONVERG) [Update x array] for (i=1; i<=N; i++) for (j=1; j<=N; j++) newx[i][j] = ¼ (x[i-1][j] + x[i][j+1] + x[i+1][j] + x[i][j-1]); [Convergence test] maxdiff = 0; for (i=1; i<=N; i++) for (j=1; j<=N; j++) maxdiff = max(maxdiff, |newx[i][j]-x[i][j]|); x[i][j] = newx[i][j];

CSE 160/Berman Jacobi2D Programming Issues Synchronization –Should we synchronize between iterations? Between multiple iterations? –Should we tag information and let the application run asynchronously? (How bad can things get?) How often should we test for convergence? –How important is it to know when we’re done? –How expensive is it?

CSE 160/Berman Jacobi2D Programming Issues Block decomposition or strip decomposition? –How big should the blocks or strips be? How should blocks/strips be allocated to processors? BlockUniform StripNon-uniform Strip

CSE 160/Berman 1D (Processors P0 P1 P2 P3, tasks 0-15) –Block decomposition ( Task i allocated to processor floor (i/p )) –Cyclic decomposition ( Task i allocated to processor i mod p ) –Block-Cycle Decomposition ( Block i allocated to processor i mod p ) HPF-Style Data Decompositions Block Cyclic Block-cyclic

CSE 160/Berman HPF-Style Data Decompositions 2D –Each dimension partitioned by block, cyclic, block-cyclic or * (do nothing) –Useful set of uniform decompositions can be constructed [Block, Block][Block, *][*, Cyclic]

CSE 160/Berman Jacobi on a Cluster If each partition of Jacobi is executed on a processor in a lab cluster, we can no longer assume we have dedicated processors and network In particular, the performance exhibited by the cluster will vary over time and with load How can we go about developing a performance- efficient implementation in a more dynamic environment?

CSE 160/Berman Jacobi AppLeS We developed an AppLeS application scheduler AppLeS = Application-Level Scheduler AppLeS is scheduling agent which integrates with application to form a “Grid- aware” adaptive self-scheduling application Targeted Jacobi AppLeS to a distributed clustered environment

How Does AppLeS Work? Grid Infrastructure NWS Schedule Deployment Resource Discovery Resource Selection Schedule Planning and Performance Modeling Decision Model accessible resources feasible resource sets evaluated schedules “best” schedule AppLeS + application = self-scheduling application Resources

Network Weather Service (Wolski, U. Tenn.) The NWS provides dynamic resource information for AppLeS NWS is stand-alone system NWS –monitors current system state –provides best forecast of resource load from multiple models Sensor Interface Reporting Interface Forecaster Model 2Model 3Model 1

Jacobi2D AppLeS Resource Selector Feasible resources determined according to application- specific “distance” metric –Choose fastest machine as locus –Compute distance D from locus based on unit-sized application-specific benchmark D[locus,X] = |comp[unit,locus]-comp[unit,X]| + comm[W,E columns ] Resources sorted according to distance from locus, forming a desirability list –Feasible resource sets formed from initial subsets of sorted desirability list –Next step: plan a schedule for each feasible resource set –Scheduler will choose schedule with best predicted execution time

Execution time for ith strip where load = predicted percentage of CPU time available (NWS) comm = time to send and receive messages factored by predicted BW (NWS) AppLeS uses time-balancing to determine best partition on a given set of resources Solve for Jacobi2D Performance Model and Schedule Planning P1P2P3

Jacobi2D Experiments Experiments compare –Compile-time block [HPF] partitioning –Compile-time irregular strip partitioning [no NWS forecasts, no resource selection] –Run-time strip AppLeS partitioning Runs for different partitioning methods performed back-to-back on production systems Average execution time recorded Distributed UCSD/SDSC platform: Sparcs, RS6000, Alpha Farm, SP-2

Jacobi2D AppLeS Experiments Representative Jacobi 2D AppLeS experiment Adaptive scheduling leverages deliverable performance of contended system Spike occurs when a gateway between PCL and SDSC goes down Subsequent AppLeS experiments avoid slow link