Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

CS4432: Database Systems II

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

Introductory Courses in High Performance Computing at Illinois David Padua.

Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Advanced Topics in Algorithms and Data Structures Lecture 6.1 – pg 1 An overview of lecture 6 A parallel search algorithm A parallel merging algorithm.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Reference: Message Passing Fundamentals.

Introduction to Analysis of Algorithms

Point-to-Point Communication Self Test with solution.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

CSE 830: Design and Theory of Algorithms

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

CS 284a, 4 November 1997 Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997.

Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang

Chapter 5 Parallel Join 5.1Join Operations 5.2Serial Join Algorithms 5.3Parallel Join Algorithms 5.4Cost Models 5.5Parallel Join Optimization 5.6Summary.

Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Storage & Peripherals Disks, Networks, and Other Devices.

2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Performance Evaluation of Parallel Processing. Why Performance?

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 9: Algorithm Efficiency and Sorting Data Abstraction &

CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.

ICS 145B -- L. Bic1 Project: Main Memory Management Textbook: pages ICS 145B L. Bic.

Sieve of Eratosthenes by Fola Olagbemi. Outline What is the sieve of Eratosthenes? Algorithm used Parallelizing the algorithm Data decomposition options.

Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Java Programming: Guided Learning with Early Objects Chapter 11 Recursion.

MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.

Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Data Structure and Algorithms. Algorithms: efficiency and complexity Recursion Reading Algorithms.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Introduction to Computation and Problem Solving Class 33: Activ Learning: Sorting Prof. Steven R. Lerman and Dr. V. Judson Harward.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CSCI-455/552 Introduction to High Performance Computing Lecture 6.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory

Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.

Programming for Performance Laxmikant Kale CS 433.

Advanced Algorithms Analysis and Design

Ioannis E. Venetis Department of Computer Engineering and Informatics

Parallel Programming By J. H. Wang May 2, 2017.

CSCI1600: Embedded and Real Time Software

Objective of This Course

COMP60621 Fundamentals of Parallel and Distributed Systems

Introduction to Data Structure

Parallel Programming in C with MPI and OpenMP

COMP60611 Fundamentals of Parallel and Distributed Systems

Chapter 11: Indexing and Hashing

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

2 Overview of MPI MPI – a distributed space model Decomposition can be static or dynamic –Static: fixed once and for all –Dynamic: changing in response to the simulation When partitioning, consider minimizing communication

3 Implementing High Performance, Parallel Applications for MPI Choose algorithm with sufficient parallelism Optimize a sequential version of the algorithm Use simplest possible MPI operations –usually blocking, standard mode procedures Profiling and analysis –find what operations take the most time Attack the most time-consuming components

4 Steps to Good Parallel Implementation  Understand strengths and weaknesses of sequential solutions  Choose a good sequential algorithm  Design a strategy for parallelization  Develop a rough semi-analytic model of how the parallel algorithm should perform

5 Steps to Good Parallel Implementation (cont’d)  Implement MPI for interprocessor communication  Carry out measurements for verifying performance  Identify bottlenecks, sources or overhead, etc.; minimize their impact  Iterate to improve performance if possible

6 Investigate Sorting Sorting is a multi-faceted, irregular, non- grid-based problem Used widely in database servers Issues of sorting (size of elements, form of result, storage etc.) Two approaches will be discussed 1.A fairly restricted domain 2.More general approach

7 Sorting (Simple Approach) Assumptions Elements are positive integers Secondary storage (disk, tape) not used Auxiliary primary storage is available Input data uniformly distributed over range of integers

8 Sorting (Simple Approach) The Approach  Most problems for parallel computation has already been solved for the traditional sequential architectures  Sequential solutions exist as libraries, system calls or language constructs – will be used as building blocks for a parallel solution

9 Sorting (Simple Approach) The Approach (cont’d)  Advantages of this approach leverages the design, debugging and optimization that has been performed for the sequential case.  Assume that we have at disposal an optimized, debugged, sequential function ‘ isort ’ that sorts arrays of integers in the memory of a single processor.

10 Sorting (Simple Approach) The Algorithm Partially pre-sort the elements so that all elements in processor p are less than all those in higher-numbered processors –Recall that on Beowulf systems, the high latency of network communication favors transmission of large messages over small ones –Determine range of values to each processor Values between p*(INT_MAX/commsize) and (p+1)*(INT_MAX/commsize)-1, inclusive

11 Sorting (Simple Approach) The Algorithm (cont’d) –Each processor scans its own list and for other elements, labels them to a destination processor –Elements placed in buffer specific to that processor Communicate data between processors

12 Sorting (Simple Approach) The Algorithm (cont’d) MPI provides communication tools –MPI_Alltoallv –Requires each processor to argue how much data is incoming from every partner, and exactly where it should go First distribute lengths with MPI_Alltoall Allocate contiguous space for all outgoing elements (temporarily) Pacreate - Initialize the returned structure

13 Analysis of Integer Sort Quality of a parallel implementation is assessed by measuring speedup s(P) = T(1)/T(P) efficiency  (P) = T(1)/(P×T(P)) = s(P)/P overhead (P) = (P×T(P)-T(1))/T(1) = (1-  )/  –where T(1) = best available implementation on a single processor Overhead is useful because it is additive

14 Sources of Overhead Communication Redundancy Extra Work Load Imbalance Waiting

15 Sources of Overhead Communication Time spent communicating in parallel code (exclude sequential implementation) Easy to estimate; largest contribution to overhead (in sort example) –examine MPI_Alltoall and MPI_Alltoallv calls for MPICH, calls are implemented as loops over point-to-point calls, hence T comm = 2 × P × t latency + sizeof(local arrays)/bandwidth

16 Sources of Overhead Redundancy Performs same computation on many processors P-1 processors not carrying out useful work Negligible for sort1 Some O(1) operations (calling malloc to obtain temporary space) do not impact performance

17 Sources of Overhead Extra Work Parallel computation that does not take place in a sequential implementation –e.g. for sort1 : computing processor destination for every input element

18 Sources of Overhead Load Imbalance Measures extra time spent by the slowest processor, in excess of the mean over all processors Load balance should satisfy with a Gaussian distribution of ~ N(N/P,N/P×sqrt((P-1)/N) imbal=(n largest -n mean )/n mean >O(1)×sqrt((P-1)/N)

19 Sources of Overhead Waiting Fine-grained imbalance even though the overall load may be balanced –e.g. frequent synchronization between short computations For sort, synchronization occurs during calls to MPI_Alltoall and MPI_Alltoallv –occurs immediately after initial decomposition; overhead negligible

20 Measurement of Integer Sort upshot - (MPICH) –Graphical tool to render a visual representation of parallel program behavior –Logs the time spent in different phases –Goal’s to improve the performance

21 More General Sorting Performance Improvement Faster sequential sort routines –D.E. Knuth, The Art of Computer Programming volume 3: Sorting and Searching, Addison Wesley, 1973 Relax restrictions on input data –sort more general objects May no long use an integer key –use of compar function –choosing “fenceposts”

22 More General Sorting Approach Solution to choosing fencepost: oversample Select nfence objects –larger value of nfence, better load balance –but results in more work –nfence value difficult to determine a priori MPI_Allgather and bsearch to divide data into more bins than processors

23 Look at population of each bin and bins to processors, achieving load balance MPI_Allreduce computes sum over all processors of the count in each bucket Finally, MPI_Alltoallv delivers elements to the correct destinations and a call to qsort completes the local sort in each processor More General Sorting Approach (cont’d)

24 Analysis of General Sort More complicated program, as it invokes more MPI routines Tradeoff between cost of selecting more fencepost and improving load balance by using more samples Author chooses an intermediate case: P=12, N=1M, object size=32 bytes

25 Summary Trust no one A performance model Instrumentation and graphs Graphical tools Superlinear speed up Enough is enough