08/10/2001 1 NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Lecture 19: Parallel Algorithms
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Reference: Message Passing Fundamentals.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.
The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,
Chapter 1 and 2 Computer System and Operating System Overview
1 I/O Management in Representative Operating Systems.
Distributed Process Management1 Learning Objectives Distributed Scheduling Algorithms Coordinator Elections Orphan Processes.
GPGPU platforms GP - General Purpose computation using GPU
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Network Aware Resource Allocation in Distributed Clouds.
Timing Trials An investigation arising out of the Assignment CS32310 – Nov 2013 H Holstein 1.
Efficient FPGA Implementation of QR
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Martin Kruliš by Martin Kruliš (v1.1)1.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Improvement to Hessenberg Reduction
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Auburn University
Ioannis E. Venetis Department of Computer Engineering and Informatics
Department of Computer Science University of California,Santa Barbara
Parallel Programming in C with MPI and OpenMP
Hybrid Programming with OpenMP and MPI
Dense Linear Algebra (Data Distributions)
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
Operating System Overview
Presentation transcript:

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor G.G.L. Meyer Johns Hopkins University Parallel Computing and Imaging Laboratory

08/10/ NRL Outline n Background n Problem Statement n Givens Task n Householder Task n Paths Through Dependency Graph n Parameterized Algorithms n Parameters Used n Results n Conclusion

08/10/ NRL Background n In many least squares problems, QR decomposition is employed l Factor matrix A into unitary matrix Q and upper triangular matrix R such that A = QR n Two primary algorithms available to compute QR decomposition l Givens rotations  Pre-multiplying rows i-1 and i of a matrix A by a 2x2 Givens rotation matrix will zero the entry A( i, j ) l Householder reflections  When a column of A is multiplied by an appropriate Householder reflection, it is possible to zero all the subdiagonal entries in that column

08/10/ NRL Problem Statement n Want to minimize the latency incurred when computing the QR decomposition of a matrix A and maintain performance across different platforms n Algorithm consists of parallel Givens task and serial Householder task n Parallel Givens task l Allocate blocks of rows to different processors. Each processor uses Givens rotations to zero all available entries within block such that  A( i, j ) = 0 only if A( i-1, j-1 ) = 0 and A( i, j-1 ) = 0 n Serial Householder task l Once Givens task terminates, all distributed rows are sent to root processor which utilizes Householder reflections to zero remaining entries

08/10/ NRL Givens Task n Each processor uses Givens rotations to zero entries up to the topmost row in the assigned group n Once task is complete, rows are returned to the root processor n Givens rotations are accumulated in a separate matrix before updating all of the columns in the array l Avoids updating columns that will not be use by an immediately following Givens rotation l Saves significant fraction of computational flops Processor 0 Processor 1 Processor 2

08/10/ NRL Householder Task n Root processor utilizes Householder reflections to zero remaining entries in Givens columns n By computing a-priori where zeroes will be after each Givens task is complete, root processor can perform a sparse matrix multiply when performing a Householder update for additional speed-up l Householder update is A = A - ßvv T A n Householder update involves matrix-vector multiplication and an outer product update l Makes extensive use of BLAS routines Processor 0

08/10/ NRL Dependency Graph - Path 1 n Algorithm must zero matrix entries in such an order that previously zeroed entries are not filled-in n Implies that A( i, j ) can be zeroed only if A( i-1, j-1 ) and A( i, j-1 ) are already zero n More than one sequence exists to zero entries such that above constraint is satisfied n Choice of path through dependency graph greatly affects performance

08/10/ NRL Dependency Graph - Path 2 n By traversing dependency graph in zig-zag fashion, cache line reuse is maximized l Data from row already in cache is used to zero several matrix entries before row is expunged from cache

08/10/ NRL n Parameterized Algorithms make effective use of memory hierarchy l Improve spatial locality of memory references by grouping together data used at the same time l Improve temporal locality of memory references by using data retrieved from cache as many times as possible before cache is flushed n Portable performance is primary objective Parameterized Algorithms : Memory Hierarchy

08/10/ NRL Givens Parameter n Parameter c controls the number of columns in Givens task n Determines how many matrix entries can be zeroed before rows are flushed from cache c

08/10/ NRL Householder Parameter n Parameter h controls the number of columns zeroed by Householder reflections at the root processor n If h is large, the root processor performs more serial work, avoiding the communication costs associated with the Givens task n However, the other processors sit idle longer, decreasing the efficiency of the algorithm ch

08/10/ NRL Work Partition Parameters n Parameters v and w allow operator to assign rows to processors such that the work load is balanced and processor idle time is minimized Processor 0 Processor 1 Processor 2 v w

08/10/ NRL Results

08/10/ NRL HP Superdome SPAWAR in San Diego, CA Server Computer (1) n MHz PA-RISC 8600 CPUs n 1.5 MB on-chip cache per CPU n 1 GB RAM / Processor

08/10/ NRL Server Computer (2) n 512 R12000 processors running at 400 MHz n 8 MB on-chip cache n Up to 2 GB RAM / Processor SGI O3000 NRL in Washington, D.C.

08/10/ NRL Embedded Computer n 8 Motorola 7400 processors with AltiVec units n 400 MHz clock n 64 MB RAM per processor Mercury JHU in Baltimore, MD

08/10/ NRL Effect of c 100 x 100 array 4 processors p = 12, h = Time - msec c Mercury SGI O3000 HP Superdome

08/10/ NRL 100 x 100 array 4 processors c = 63, p = 12 Effect of h h Time - msec Mercury SGI O3000 HP Superdome

08/10/ NRL Effect of w 100 x 100 array 4 processors h = 15, p = 10, c = 60, v = w Time - msec Mercury SGI O3000 HP Superdome

08/10/ NRL 10 5 n = m Time - msec Performance vs Matrix Size 4 processors Mercury SGI O3000 HP Superdome

08/10/ NRL Scalability Time - msec Number of processors Mercury SGI O3000 HP Superdome 500 x 500 array

08/10/ NRL Comparison to SCALAPACK n For matrix sizes on the order of 100 by 100, the Hybrid QR algorithm outperforms the SCALAPACK library routine PSGEQRF by 16% n Data distributed in block cyclic fashion before executing PSGEQRF 9.4 ms 7.9 ms 4 processors SGI O3000

08/10/ NRL Conclusion n Hybrid QR algorithm using combination of Givens rotations and Householder reflections is efficient way to compute QR decomposition for small arrays on the order of 100 x 100 n Algorithm implemented on SGI O3000 and HP Superdome servers as well as Mercury G4 embedded computer n Mercury implementation lacked optimized BLAS routines and as a consequence performance was significantly slower n Algorithm has applications to signal processing problems such as adaptive nulling where strict latency targets must be satisfied