A PARALLEL BISECTION ALGORITHM (WITHOUT COMMUNICATION)

Slides:

Advertisements

Similar presentations

Practical techniques & Examples

Advertisements

CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

QuickSort 4 February QuickSort(S) Fast divide and conquer algorithm first discovered by C. A. R. Hoare in If the number of elements in.

Symmetric Eigensolvers in Sca/LAPACK Osni Marques

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Institute for Mathematical Modeling RAS 1 Dynamic load balancing. Overview. Simulation of combustion problems using multiprocessor computer systems For.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

MapReduce M/R slides adapted from those of Jeff Dean’s.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

An Axiomatic Basis for Computer Programming Robert Stewart.

Data Structures and Algorithms in Parallel Computing Lecture 4.

CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

CSCI-455/552 Introduction to High Performance Computing Lecture 21.

Concurrency and Performance Based on slides by Henri Casanova.

On Detecting Termination in Cognitive Radio Networks Shantanu Sharma 1 and Awadhesh Kumar Singh 2 1 Ben-Gurion University of the Negev, Israel 2 National.

Lesson Objectives Aims You should know about: Binary numbers ‘n’ that.

Mingze Zhang, Mun Choon Chan and A. L. Ananda School of Computing

Sampling and Sampling Distribution

Advanced Algorithms Analysis and Design

Step 1: Specify a null hypothesis

Parallel Computing and Parallel Computers

Auburn University

Parallel Graph Algorithms

Parallel Tasks Decomposition

Operating Systems (CS 340 D)

Edinburgh Napier University

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

Parallel and Distributed Simulation Techniques

Systems Analysis and Design

Parallel Density-based Hybrid Clustering

Introduction to parallel algorithms

Randomized Algorithms

Algorithm Analysis CSE 2011 Winter September 2018.

Operating Systems (CS 340 D)

Parallel Programming in C with MPI and OpenMP

EE 193: Parallel Computing

Lecture 16: Parallel Algorithms I

Communication and Memory Efficient Parallel Decision Tree Construction

Chapter 17: Database System Architectures

Objective of This Course

Randomized Algorithms

Approximations and Round-Off Errors Chapter 3

Introduction to parallel algorithms

CS 201 Fundamental Structures of Computer Science

Algorithm and Ambiguity

Monte Carlo Integration Using MPI

COMP60611 Fundamentals of Parallel and Distributed Systems

Overview of Query Evaluation

CSC4005 – Distributed and Parallel Computing

Introduction to MapReduce

Mattan Erez The University of Texas at Austin

False discovery rate estimation

Parallel Programming in C with MPI and OpenMP

Parallel Sorting Algorithms

Introduction to parallel algorithms

Presentation transcript:

A PARALLEL BISECTION ALGORITHM (WITHOUT COMMUNICATION) Rui Ralha DMAT, CMAT Univ. do Minho Portugal r_ralha@math.uminho.pt

CMAT FCT POCTI (European Union contribution) Prof. B. Parlett Acknowledgements CMAT FCT POCTI (European Union contribution) Prof. B. Parlett

Outline Counting eigenvalues of symmetric tridiagonals The ScaLAPACK’s routine A parallel algorithm without communication An alternative algorithm Some conclusions

Counting eigenvalues

Nonmonotonicity of Count(x)

The ScaLAPACK’s implementation (1)

The ScaLAPACK’s implementation (2) In [1] the authors wrote “…Ideally, we would like a bracketing algorithm that was simultaneously parallel, load balanced, devoid of communication, and correct in the face of nonmonotonicity. We still do not know how to achieve this completely; in the most general case, when different parallel processors do not even possess the same floating point format, we do not know how to implement a correct and reasonably fast algorithm at all. Even when floating point formats are the same, we do not know how to avoid some global communication…” and considered a bracketing algorithm to be correct if (1) every eigenvalue is computed exactly once, (2) the computed eigenvalues are correct to within the user specified error tolerance, (3) the computed eigenvalues are in sorted order.

The ScaLAPACK’s implementation (3)

The ScaLAPACK’s implementation (4)

Drawbacks of the ScaLAPACK’s implementation

A simple and incorrect parallel algorithm (without communication) To partition the initial Gerschgorin interval into p subintervals of equal width and assign to processor i the task of finding all the eigenvalues in the ith subinterval . But, even with processors with the same arithmetic (nonmonotonic) the algorithm may be incorrect. For example, with n=p=3, it may happen [1] Therefore, the second eigenvalue will be computed twice (processors 1 and 3)

Parallel bisection for computing the eigenvalues of [-1 2 -1] with 100 processors

Our proposal (1)

Our proposal (2)

Our proposal (3)

Our proposal (4)

Sorting eigenvalues For the Wilkinson’s matrix of order 21 we have With single precision in Matlab we get With double precision we get We assume that eigenvalues are to be gathered in a “master” processor (this is a standard feature of ScaLAPACK). Supose that the “master” receives (out of order) and knows that the processor that computed has better accuracy. Then, it keeps and, if required, it corrects to be smaller than .

An alternative algorithm (1) Phase 1(equal for every processor): carry out a (not too large) number of bisection steps in a breadth first search to get a “good picture” of the spectrum. Produces a number of intervals (at least p = number of processors). Phase 2: distributes intervals to processors trying to achieve load- balance (the same number of eigenvalues to each processor) Phase 3: each processor computes the assigned eigenvalues to some prescribed accuracy

An alternative algorithm (2)

An alternative algorithm (3) Preliminar implementation (in Matlab) Finishes Phase 1 when enough intervals have been produced such that, for each k=1,…,p-1, an end point x of one of those intervals satisfies This may affect the speedup by 10%. This termination criteria for Phase 1 may be hard (i.e, take too many bisection steps) to satisfy in some cases.

Parallel bisection for computing the eigenvalues of [-1 2 -1] of order 10^4

Conclusions Parallel bracketing in ScaLAPACK’s requires global communication; We have proposed an algorithm that is communication free and is load balanced in the sense that each processor computes the same number of eigenvalues (if p divides n); In homogeneous systems, our algorithm produces sorted eigenvalues even when the arithmetic is nonmonotonic; In heterogeneous systems, eigenvalues may be unsorted (they may be sorted by the “master” if required);