Parallel Inversion of Polynomial Matrices

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

A Large-Grained Parallel Algorithm for Nonlinear Eigenvalue Problems Using Complex Contour Integration Takeshi Amako, Yusaku Yamamoto and Shao-Liang Zhang.

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Factoring of Large Numbers using Number Field Sieve Matrix Step Chandana Anand, Arman Gungor, and Kimberly A. Thomas ECE 646 Fall 2006.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Potential for parallel computers/parallel programming

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

A Parallel Matching Algorithm Based on Image Gray Scale Liang Zong, Yanhui Wu cso, vol. 1, pp , 2009 International Joint Conference on Computational.

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.

A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

CSE5304—Project Proposal Parallel Matrix Multiplication Tian Mi.

Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann Presented by Vincent Yau.

May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.

Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,

Complex Function Simplification Program Algorithm Daniel Chromeck and Michael Harris Presentation Date: Thursday August 12, 2008 Mentor: Dr George Antoniou.

Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Fourth Edition.

N. Karampetakis, S. Vologiannidis

Parallel Implementation of the Inversion of Polynomial Matrices Alina Solovyova-Vincent March 26, 2003 A thesis submitted in partial fulfillment of the.

Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.

Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Third Edition Additions by Shannon Steinfadt SP’05.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.

MATRICES Using matrices to solve Systems of Equations.

Matrix Equations Step 1: Write the system as a matrix equation. A three-equation system is shown below.

Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.

Computer Science 320 Measuring Speedup. What Is Running Time? T(N, K) says that the running time T is a function of the problem size N and the number.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.

Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Performance Evaluation of Parallel Processing. Why Performance?

Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA February 22/23, 2006 SURA, Washington DC Memory Efficient.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Independent Component Analysis (ICA) A parallel approach.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Mining High Utility Itemset in Big Data

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Parallel muiticategory Support Vector Machines (PMC-SVM) for Classifying Microarray Data 研究生 : 許景復單位 : 光電與通訊研究所.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.

Chapter 9 Efficiency of Algorithms. 9.3 Efficiency of Algorithms.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

1a.1 Parallel Computing and Parallel Computers ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.

Chapter 7: Systems of Equations and Inequalities; Matrices

Potential for parallel computers/parallel programming

PERFORMANCE EVALUATIONS

Ioannis E. Venetis Department of Computer Engineering and Informatics

CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph

Department of Computer Science University of California, Santa Barbara

Chapter 3: The Efficiency of Algorithms

Chapter 8 Arrays Objectives

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

Using matrices to solve Systems of Equations

Numerical Algorithms Quiz questions

By Brandon, Ben, and Lee Parallel Computing.

Workshop on Empirical Methods for the Analysis of Algorithms

Chapter 8 Arrays Objectives

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Complexity Measures for Parallel Computation

Potential for parallel computers/parallel programming

Department of Computer Science University of California, Santa Barbara

Potential for parallel computers/parallel programming

Using matrices to solve Systems of Equations

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Presentation transcript:

Parallel Inversion of Polynomial Matrices Alina Solovyova-Vincent Frederick C. Harris, Jr. M. Sami Fadali

Overview Introduction Existing algorithms Busłowicz’s algorithm Parallel algorithm Results Conclusions and future work

Definitions A polynomial matrix is a matrix which has polynomials in all of its entries. H(s) = Hnsn+Hn-1sn-1+Hn-2sn-2+…+Ho, where Hi are constant r x r matrices, i=0, …, n.

Definitions Example: s+2 s3+ 3s2+s s3 s2+1 n=3 – degree of the polynomial matrix r=2 – the size of the matrix H Ho= H1= … 2 0 0 1 1 0 0

Definitions H-1(s) – inverse of the matrix H(s) One of the ways to calculate it H-1(s) = adj H(s) /det H(s)

Definitions A rational matrix can be expressed as a ration of a numerator polynomial matrix and a denominator scalar polynomial.

Who Needs It??? Multivariable control systems Analysis of power systems Robust stability analysis Design of linear decoupling controllers … and many more areas.

Existing Algorithms Leverrier’s algorithm ( 1840) Exact algorithms [sI-H] - resolvent matrix Exact algorithms Approximation methods

The Selection of the Algorithm Large degree of polynomial operations Lengthy calculations Not very general Before Buslowicz’s algorithm (1980) After Some improvements at the cost of increased computational complexity

Buslowicz’s Algorithm Benefits: More general than methods proposed earlier Only requires operations on constant matrices Suitable for computer programming Drawback: the irreducible form cannot be ensured in general

Details of the Algorithm Available upon request

Challenges Encountered (sequential) Several inconsistencies in the original paper:

Challenges Encountered (parallel) Dependent loops for (i=2; i<r+1; i++) { calculations requiring R[i-1][k] } for(k=0; k<n*i+1; k++) { } O(n2r4)

Challenges Encountered (parallel) Loops of variable length for(k=0; k<n*i+1; k++) { for(ll=0; ll<min+1; ll++) main calculations } Varies with k

Shared and Distributed Memory Main differences Synchronization of the processes Shared Memory (barrier) Distributed memory (data exchange) for (i=2; i<r+1; i++) { calculations requiring R[i-1] *Synchronization point }

Platforms Distributed memory platforms: SGI 02 NOW MIPS R5000 180MHz P IV NOW 1.8 GHz P III Cluster 1GHz P IV Cluster Zeon 2.2GHz

Platforms Shared memory platforms: SGI Power Challenge 10000 8 MPIS R10000 SGI Origin 2000 16 MPIS R12000 300MHz

Understanding the Results n – degree of polynomial (<= 25) r – size of a matrix (<=25) Sequential algorithm – O(n2r5) Average of multiple runs Unloaded platforms

Sequential Run Times (n=25, r=25) Platform Times (sec) SGI O2 NOW 2645.30 P IV NOW 22.94 P III Cluster 26.10 P IV Cluster 18.75 SGI Power Challenge 913.99 SGI Origin 2000 552.95

Results – Distributed Memory Speedup SGI O2 NOW - slowdown P IV NOW - minimal speedup

Speedup (P III & P IV Clusters)

Results – Shared Memory Excellent results!!!

Speedup (SGI Power Challenge)

Speedup (SGI Origin 2000) Superlinear speedup!

Run times (SGI Power Challenge) 8 processors

Run times (SGI Origin 2000) n =25

Run times (SGI Power Challenge)

Efficiency 2 4 6 8 16 24 P III Cluster 89.7% 76.5% 61.3% 58.5% 40.1% 25.0% P IV 88.3% 68.2% 49.9% 46.9% 26.1% 15.5% SGI Power Challenge 99.7% 98.2% 97.9% 95.8% n/a SGI Origin 2000 99.9% 98.7% 99.0% 93.8%

Conclusions We have performed an exhaustive search of all available algorithms; We have implemented the sequential version of Busłowicz’s algorithm; We have implemented two versions of the parallel algorithm; We have tested parallel algorithm on 6 different platforms; We have obtained excellent speedup and efficiency in a shared memory environment.

Future Work Study the behavior of the algorithm for larger problem sizes (distributed memory). Re-evaluate message passing in distributed memory implementation. Extend Buslowicz’s algorithm to inverting multivariable polynomial matrices H(s1, s2 … sk).

Questions