A PARALLEL BISECTION ALGORITHM (WITHOUT COMMUNICATION)

Slides:



Advertisements
Similar presentations
Practical techniques & Examples
Advertisements

CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Decision Tree.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
QuickSort 4 February QuickSort(S) Fast divide and conquer algorithm first discovered by C. A. R. Hoare in If the number of elements in.
Symmetric Eigensolvers in Sca/LAPACK Osni Marques
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Institute for Mathematical Modeling RAS 1 Dynamic load balancing. Overview. Simulation of combustion problems using multiprocessor computer systems For.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
An Axiomatic Basis for Computer Programming Robert Stewart.
Data Structures and Algorithms in Parallel Computing Lecture 4.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
CSCI-455/552 Introduction to High Performance Computing Lecture 21.
Concurrency and Performance Based on slides by Henri Casanova.
On Detecting Termination in Cognitive Radio Networks Shantanu Sharma 1 and Awadhesh Kumar Singh 2 1 Ben-Gurion University of the Negev, Israel 2 National.
Lesson Objectives Aims You should know about: Binary numbers ‘n’ that.
Mingze Zhang, Mun Choon Chan and A. L. Ananda School of Computing
Sampling and Sampling Distribution
Advanced Algorithms Analysis and Design
Step 1: Specify a null hypothesis
Parallel Computing and Parallel Computers
Auburn University
Parallel Graph Algorithms
Parallel Tasks Decomposition
Operating Systems (CS 340 D)
Edinburgh Napier University
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Parallel and Distributed Simulation Techniques
Systems Analysis and Design
Parallel Density-based Hybrid Clustering
Introduction to parallel algorithms
Randomized Algorithms
Algorithm Analysis CSE 2011 Winter September 2018.
Operating Systems (CS 340 D)
Parallel Programming in C with MPI and OpenMP
EE 193: Parallel Computing
Lecture 16: Parallel Algorithms I
Communication and Memory Efficient Parallel Decision Tree Construction
Chapter 17: Database System Architectures
Objective of This Course
Randomized Algorithms
Approximations and Round-Off Errors Chapter 3
Introduction to parallel algorithms
CS 201 Fundamental Structures of Computer Science
Algorithm and Ambiguity
Monte Carlo Integration Using MPI
Parallelismo.
COMP60611 Fundamentals of Parallel and Distributed Systems
Overview of Query Evaluation
CSC4005 – Distributed and Parallel Computing
Introduction to MapReduce
Mattan Erez The University of Texas at Austin
False discovery rate estimation
Parallel Programming in C with MPI and OpenMP
Parallel Sorting Algorithms
Introduction to parallel algorithms
Presentation transcript:

A PARALLEL BISECTION ALGORITHM (WITHOUT COMMUNICATION) Rui Ralha DMAT, CMAT Univ. do Minho Portugal r_ralha@math.uminho.pt

CMAT FCT POCTI (European Union contribution) Prof. B. Parlett Acknowledgements CMAT FCT POCTI (European Union contribution) Prof. B. Parlett

Outline Counting eigenvalues of symmetric tridiagonals The ScaLAPACK’s routine A parallel algorithm without communication An alternative algorithm Some conclusions

Counting eigenvalues

Nonmonotonicity of Count(x)

The ScaLAPACK’s implementation (1)

The ScaLAPACK’s implementation (2) In [1] the authors wrote “…Ideally, we would like a bracketing algorithm that was simultaneously parallel, load balanced, devoid of communication, and correct in the face of nonmonotonicity. We still do not know how to achieve this completely; in the most general case, when different parallel processors do not even possess the same floating point format, we do not know how to implement a correct and reasonably fast algorithm at all. Even when floating point formats are the same, we do not know how to avoid some global communication…” and considered a bracketing algorithm to be correct if (1) every eigenvalue is computed exactly once, (2) the computed eigenvalues are correct to within the user specified error tolerance, (3) the computed eigenvalues are in sorted order.

The ScaLAPACK’s implementation (3)

The ScaLAPACK’s implementation (4)

Drawbacks of the ScaLAPACK’s implementation

A simple and incorrect parallel algorithm (without communication) To partition the initial Gerschgorin interval into p subintervals of equal width and assign to processor i the task of finding all the eigenvalues in the ith subinterval . But, even with processors with the same arithmetic (nonmonotonic) the algorithm may be incorrect. For example, with n=p=3, it may happen [1] Therefore, the second eigenvalue will be computed twice (processors 1 and 3)

Parallel bisection for computing the eigenvalues of [-1 2 -1] with 100 processors

Our proposal (1)

Our proposal (2)

Our proposal (3)

Our proposal (4)

Sorting eigenvalues For the Wilkinson’s matrix of order 21 we have With single precision in Matlab we get With double precision we get We assume that eigenvalues are to be gathered in a “master” processor (this is a standard feature of ScaLAPACK). Supose that the “master” receives (out of order) and knows that the processor that computed has better accuracy. Then, it keeps and, if required, it corrects to be smaller than .

An alternative algorithm (1) Phase 1(equal for every processor): carry out a (not too large) number of bisection steps in a breadth first search to get a “good picture” of the spectrum. Produces a number of intervals (at least p = number of processors). Phase 2: distributes intervals to processors trying to achieve load- balance (the same number of eigenvalues to each processor) Phase 3: each processor computes the assigned eigenvalues to some prescribed accuracy

An alternative algorithm (2)

An alternative algorithm (3) Preliminar implementation (in Matlab) Finishes Phase 1 when enough intervals have been produced such that, for each k=1,…,p-1, an end point x of one of those intervals satisfies This may affect the speedup by 10%. This termination criteria for Phase 1 may be hard (i.e, take too many bisection steps) to satisfy in some cases.

Parallel bisection for computing the eigenvalues of [-1 2 -1] of order 10^4

Conclusions Parallel bracketing in ScaLAPACK’s requires global communication; We have proposed an algorithm that is communication free and is load balanced in the sense that each processor computes the same number of eigenvalues (if p divides n); In homogeneous systems, our algorithm produces sorted eigenvalues even when the arithmetic is nonmonotonic; In heterogeneous systems, eigenvalues may be unsorted (they may be sorted by the “master” if required);