TEMPLATE DESIGN © 2008 www.PosterPresentations.com H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

Slides:

Advertisements

Similar presentations

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Advertisements

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Parallel computer architecture classification

Algebraic, transcendental (i.e., involving trigonometric and exponential functions), ordinary differential equations, or partial differential equations...

The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering.

Applied Linear Algebra - in honor of Hans SchneiderMay 25, 2010 A Look-Back Technique of Restart for the GMRES(m) Method Akira IMAKURA † Tomohiro SOGABE.

MATH 685/ CSI 700/ OR 682 Lecture Notes

Solving Linear Systems (Numerical Recipes, Chap 2)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

CISE301_Topic3KFUPM1 SE301: Numerical Methods Topic 3: Solution of Systems of Linear Equations Lectures 12-17: KFUPM Read Chapter 9 of the textbook.

Rayan Alsemmeri Amseena Mansoor. LINEAR SYSTEMS Jacobi method is used to solve linear systems of the form Ax=b, where A is the square and invertible.

OpenFOAM on a GPU-based Heterogeneous Cluster

Data Parallel Algorithms Presented By: M.Mohsin Butt

The Problem With The Linpack Benchmark 1.0 Matrix Generator Jack J. Dongarra and Julien Langou International Journal of High Performance Computing Applications.

ECIV 301 Programming & Graphics Numerical Methods for Engineers Lecture 17 Solution of Systems of Equations.

ECIV 520 Structural Analysis II Review of Matrix Algebra.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.

1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.

1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.

Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Benchmarks for Parallel Systems Sources/Credits:  “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.

Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP Patrick H. Worley Computer Science.

High Performance Solvers for Semidefinite Programs

Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

A CONDENSATION-BASED LOW COMMUNICATION LINEAR SYSTEMS SOLVER UTILIZING CRAMER'S RULE Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Parallel Solution of the 3-D Laplace Equation Using a Symmetric-Galerkin Boundary Integral.

Progress report on the alignment of the tracking system A. Bonissent D. Fouchez A.Tilquin CPPM Marseille Mechanical constraints from optical measurement.

Jaguar Super Computer Topics Covered Introduction Architecture Location & Cost Bench Mark Results Location & Manufacturer Machines in top 500 Operating.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

1 Eigenvalue Problems in Nanoscale Materials Modeling Hong Zhang Computer Science, Illinois Institute of Technology Mathematics and Computer Science, Argonne.

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.

1 Incorporating Iterative Refinement with Sparse Cholesky April 2007 Doron Pearl.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

HONGIK UNIVERSITY School of Radio Science & Communication Engineering Visual Information Processing Lab Hong-Ik University School of Radio Science & Communication.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

1 Numerical Methods Solution of Systems of Linear Equations.

Chapter 2, Linear Systems, Mainly LU Decomposition

Ioannis E. Venetis Department of Computer Engineering and Informatics

Linear Equations.

for more information ... Performance Tuning

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

CSCE569 Parallel Computing

Dense Linear Algebra (Data Distributions)

Presentation transcript:

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University of Hong Kong, 3 National Institute for Computational Sciences Motivation High Performance Linpack (HPL) benchmark often outperforms ScaLAPACK in solving a large dense system of equation, but is only commonly used for performance benchmark in double precision. => HPL for LU decomposition Single Instruction Multiple Data (SIMD) capability in most processors can achieve higher performance in 32-bit over 64-bit operations. => Iterative refinement procedure Goals ScaLAPACK Based Application Code Modifications Numerical Experiments Summary Acknowledgements Contact information Eduardo F. D’Azevedo, ORNL Kwai Wong, NICS Deliver a user library by extending HPL to perform single, double, complex, double complex, and mixed precisions calculations. Deliver an interface compatible to ScaLAPACK calling convention, which allows simple code modifications to achieve top performance using this modified HPL (libmhpl.a) library. Use mixed precision solver by performing costly LU factorization in 32- bit but achieve 64-bit accuracy by iterative refinement method. Mixed Precision Iterative Refinement Solver Procedure Using High Performance Linpack (HPL) This research is partially sponsored by the Office of Advanced Scientific Computing Research; U.S. Department of Energy. This research used resources of the National Institute for Computational Sciences (NICS), which is supported by the National Science Foundation (NSF). Summer internships for H. Che, T. Chan, D. Lee, and R. Wong were supported by the Department of Mathematics, The Chinese University of Hong Kong (CUHK). Internship opportunity was provided by the Joint Institute for Computational Sciences (JICS), the University of Tennessee, and the Oak Ridge National Laboratory. HPL based dense LU solver is more efficient than standard ScaLAPACK, and achieved about 75% of the peak performance. HPL performs parallel LU factorization in double but uses hybrid left/right-looking panel method and look-ahead algorithms. Application interface compatible to ScaLAPACK. Integrated to AORSA fusion INCITE application. High Performance Linpack (HPL) Performing simple modifications of ScaLAPACK source codes one can embed calling more efficient HPL functions, simply linked with libmhpl.a provided here. For example the test program provided by ScaLAPACK ( pdludriver.f ) should be modified as follow as follows ( pdhpl_driver.f ): 1.Two extra integers are declared: integer hpl_lld, hpl_ineed 2.HPL extra functions are called: call blacs_barrier(ictxt,'A') call hpl_dblacsinit( ictxt ) call hpl_dmatinit( n, NB, hpl_lld,hpl_ineed) call descinit(descA,n,n+1,NB,NB,0,0,ICTXT,hpl_lld,ierr(1)) 3.Original ScaLAPACK function CALL PDGETRF(M,N,MEM(IPA),1,1,DESCA,MEM(IPPIV),INFO ) is replaced by HPL function call hpl_pdgesv(n, mem(IPA), descA,mem(ippiv), info) Results Table 1: Performance comparison: HPL vs. ScaLAPACK HPL is written in portable C to evaluate parallel performance of Top500 computers by solving a (random) dense linear system in double precision (64-bit) arithmetic on distributed-memory computers. HPL uses a right-looking variant of the LU factorization of random matrix with row partial pivoting. 2-D block-cyclic distribution is employed. It has tunable parameters to implement multiple look-ahead depths, recursive panel factorization with pivot search and column broadcast combined, various virtual panel broadcast topologies, bandwidth reducing swap broadcast algorithm, and a ring broadcast algorithm in the backward substitution Methodology /* * Find norm_1( A ). */ if( nq > 0 ) { work = (double*)malloc( nq * sizeof( double ) ); if( work == NULL ) { HPL_pabort( __LINE__, "HPL_pdlange", "Memory allocation failed" ); } for( jj = 0; jj < nq; jj++ ) { s = HPL_rzero; for( ii = 0; ii < mp; ii++ ) { s += Mabs( *A ); A++; } work[jj] = s; A += LDA - mp; } float*float HPL_pslange (float)HPL_rzero Unchanged: Timing variables: /hpl/testing /ptimer, /timer Norm Variables: Anorm1 in HPL_pztest.c Residue Variables: resid0 in HPL_pztest.c Function return Data-type: double HPL_pzlange Mixed Precision Using HPL HPL updates upper triangular matrix only Need to update lower triangular matrix to prepare iterative refinement Find the global pivot vector from HPL and use it to swap the rows of the lower triangular matrix Update lower triangular (to be compatible with ScaLAPACK) Modified data structure code to assemble and return pivot vector Iterative Refinement A better performance is gained by using LU factorization in single precision (much faster than in double). Then using ScaLAPACK routines for triangular solve, perform matrix-vector multiply and iterative refinement to gain double precision. Solve the matrix −Using ‘ call hpl_psgesv(…) ’ to solve matrix by HPL −Instead of ‘ call psgetrf(…) ’ (ScaLAPACK) Major computational cost −Solve the matrix: O(N 3 ), N x N is the size of the matrix The numerical experiments were performed on the athena Cray XT4 supercomputer at the NICS. Athena nodes consist of a quad-core 2.3 GHz AMD Opteron processor with 4GB of memory. Using Streaming SIMD Extension (SSE), each core has peak performance of 9.2 Gflops (18.4 Gflops) in 64-bit (32-bit) arithmetic. Precision Processor Grid (P x Q) Size of matrix (N) GFLOPS% of peak HPLSCALAPACK (p?ludriver.f) HPL SCALAPACK (p?ludriver.f) SINGLE REAL s 128 x ,00018, , ,00056, , ,000108, , ,000174, ,200,000211, x 2 5, , DOUBLE REAL d 128 x ,00014, , ,00036, , ,00061, , ,00092, ,200,000106, x 2 5, , SINGLE COMPLEX c 128 x ,00048, , ,000103, , ,000150, , ,000211, x 2 5, , DOUBLE COMPLEX z 128 x ,00032, , ,00056, , ,00079, , x 2 5, , Processor Grid (P x Q) Size of matrix (N) REAL*8 ScaLAPACK LU, sec REAL*4 ScaLAPACK LU, sec REAL*4 HPL LU, sec Mixed Precision Solve, sec 128 x , , , , , x 32 80, x x x , x x Table 2: Performance of HPL Mixed Precision – REAL MATRIX Processor Grid (P x Q) Size of matrix (N) COMPLEX*16 ScaLAPACK LU, sec COMPLEX*8 ScaLAPACK LU, sec COMPLEX*8 HPL LU, sec Mixed Precision Solve, sec 128 x , , , , , x 32 80, x x x , x x Table 3: Performance of HPL Mixed Precision – COMPLEX MATRIX Results Data-type Change: Variable Data-type: double A → double complex A MPI communication Data-type: MPI_Send(A,…,MPI_DOUBLE,…) → MPI_Send(A,…,MPI_DOUBLE_COMPLEX,…) Function return type: double HPL_rand → double complex HPL_zrand For example, HPL_pdlange.c becomes HPL_pslange.c with the following change in content: Peak Rate of 16K nodes = TFLOPS (Double precision) Peak Rate of 16K nodes = TFLOPS (Single precision) Methodology To get three additional precisions rewrite original HPL source codes by modifying data types and function names using the following convention (same as ScaLAPACK) in naming files and functions: ‘s’ – stands for SINGLE REAL ‘d’ – stands for DOUBLE REAL ‘c’ – stands for SINGLE COMPLEX ‘z’ – stands for DOUBLE COMPLEX