Makoto Kudoh*1, Hisayasu Kuroda*1,

Slides:



Advertisements
Similar presentations
A Large-Grained Parallel Algorithm for Nonlinear Eigenvalue Problems Using Complex Contour Integration Takeshi Amako, Yusaku Yamamoto and Shao-Liang Zhang.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
© 2009 IBM Corporation July, 2009 | PADTAD Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
OpenFOAM on a GPU-based Heterogeneous Cluster
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Introduction CS 524 – High-Performance Computing.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Data Locality CS 524 – High-Performance Computing.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
Data Locality CS 524 – High-Performance Computing.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Integrating Trilinos Solvers to SEAM code Dagoberto A.R. Justo – UNM Tim Warburton – UNM Bill Spotz – Sandia.
Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
JAVA AND MATRIX COMPUTATION
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Irregular Applications –Sparse Matrix Vector Multiplication
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
University of California, Berkeley
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Ioannis E. Venetis Department of Computer Engineering and Informatics
Richard Dorrance Literature Review: 1/11/13
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Vector Processing => Multimedia
Linchuan Chen, Peng Jiang and Gagan Agrawal
STUDY AND IMPLEMENTATION
Numerical Algorithms Quiz questions
Gary M. Zoppetti Gagan Agrawal
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Memory System Performance Chapter 3
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Presentation transcript:

Optimal Algorithm Selection of Parallel Sparse Matrix-Vector Multiplication Is Important Makoto Kudoh*1, Hisayasu Kuroda*1, Takahiro Katagiri*2, Yasumasa Kanada*1 *1 The University of Tokyo *2 PRESTO, Japan Science and Technology Corporation

Introduction Sparse Matrix-Vector multiplication(SpMxV) (A is a sparse matrix, x is a dense vector) Basic computational kernel used in scientific computations ex. Iterative solver for linear systems, eigenvalue problems Large scale SpMxV problems Parallel Sparse Matrix-Vector Multiplication

Calculation of Parallel Sparse Matrix-Vector Multiplication 2 4 5 3 1 -1 -2 -4 rowptr colind value A PE0 PE1 PE2 PE3 y x Row block distribution Compressed sparse row format Two phase computations: data communication and local computation PE0 PE1 PE2 PE3 Vector data communication x A x y PE0 PE1 PE2 PE3 Local computation

Optimization of Parallel SpMxV Poor performance compared with dense matrix Increased memory reference to matrix data caused by indirect access Irregular memory access pattern to vector x Many optimization algorithms of SpMxV proposed BUT The effect depends highly on the non-zero structure of the matrix and the machine’s architecture Optimal algorithm selection is important

Related Works Library approach Compiler approach PSPARSLIB, PETSc, ILIB, etc Fixed optimize algorithm Work on parallel systems Compiler approach SPARSITY, sparse compiler, etc Generate optimized code for matrix and machine Not work on parallel systems

The purpose of our work Our program compare Performance of best algorithm for matrix and machine Performance of fixed algorithm for all matrices and machines Our program Include several algorithms for local computation and data communication Measure performance of each algorithm exhaustively Select the best algorithm for the matrix and machine Algorithm selecting time is not in concern

Optimization algorithms of our program Algorithms implemented in our routine Local computation Register Blocking Diagonal Blocking Unrolling Data communication Allgather Communication Range Limited Communication Minimum Data Size Communication

Register Blocking (Local Computation 1/3) + Original matrix Blocked matrix Remaining matrix Extract small dense blocks and make a blocked matrix Reduce the number of load instruction Increase temporal locality to the source vector Abbreviate size mxn Register Blocking to Rmxn R1x2,R1x3,R1x4,R2x1,R2x2,R2x3,R2x4, R3x1,R3x2,R3x3,R3x4,R4x1,R4x2,R4x3,R4x4

Diagonal Blocking (Local Computation 2/3) + Original matrix Blocked matrix Remaining matrix For matrices with dense non-zero structure around diagonal part Block diagonal part and treat it as a dense band matrix Reduce the number of load instruction Optimize the access of register and cache Abbreviate size n Diagonal Blocking to Dn D3,D5,D7,D9,D11,D13,D15,D17,D19

Unrolling (Local Computation 3/3) Just unroll the inner loop Reduce the loop overhead Exploit instruction level parallelism Abbreviate unrolling level n to Un U1,U2,U3,U4,U5,U6,U7,U8

Allgather Communication (data communication 1/3) PE0 PE1 PE2 PE3 Each processor sends all vector data to all other processors Easy to implement (with MPI_Allgather) The communication data size is very large

Range-limited Communication (data communication 2/3) PE0 vector PE1 Send vector Send only minimum contiguous required block Not communicate between unnecessary processors Small overhead CPU time, since data rearrangement is unnecessary Communication data size is not minimum on most matrices

Minimum Data Size Communication (data communication 1/3) PE0 PE1 vector unpack pack buffer send Communicate only the required elements Need ‘pack’ and ‘unpack’ operations before and after communication The communication data size is minimum ‘pack’, ‘unpack’ operations require a little overhead CPU time

Implementation of Communication Use MPI library 3 implementations for 1 to 1 communication Send-Recv Isend-Irecv Irecv-Isend 3 implementations for range-limited and minimum data size communication Allgather SendRecv-range, IsendIrecv-range, IrecvIsend-range SendRecv-min, IsendIrecv-min, Irecv-Isend-min

Methodology of Selecting Optimal Algorithm Select at runtime Can not detect the characteristic of the matrix until runtime Measure the time of local computation and data communication independently When combined, total time is not necessarily fastest Measure time of each data communication, select best algorithm Combine local computation and best data communication, measure time and select best

Numerical Experiment Default fixed algorithms Experimental environment, test matrices Results

Default Fixed Algorithms Local computation: U1 and R2x2 Data communication: Allgather and IrecvIsend-min No. Local computation Data communication 1 U1 Allgather 2 R2x2 3 IrecvIsend-min 4

Experimental Environment Language C Communication library MPI (MPICH 1.2.1) Name Processor # of PEs Network Compiler Compiler Version Compiler Option PC-Cluster PentiumIII 800 MHz 8 100 base-T Ethernet GCC 2.95.2 -O3 SUN Enterprise 3500 Ultra Sparc II 336 MHz SMP WorkShop Compilers 5.0 -xO5 COMPAQ AlphaServer GS80 Alpha 21264 731MHz Compaq C 6.3-027 -fast SGI2100 MIPS R12000 350MHz DSM MIPSpro C 7.30 -64 -O3 HITACHI HA8000-ex880 Intel Itanium 800MHz Intel Itanium Compiler 5.0.1

From Tim Davis’ matrix collection Test Matrices ct20stif cfd1 gearbox From Tim Davis’ matrix collection No. Name Explanation Dimension Non-zeros 1 3dtube 3-D pressure tube 45,330 3,213,618 2 cfd1 Symmetric pressure matrix 70,656 1,828,364 3 crystk03 FEM crystal vibration 24,696 1,751,178 4 venkat01 Unstructured 2D euler solver 62,424 1,717,792 5 bcsstk35 Automobile seat frame and body attachment 30,237 1,450,163 6 cfd2 123,440 3,087,898 7 ct20stif Stiffness matrix 52,329 2,698,463 8 nasasrb Shuttle rocket booster 54,870 2,677,324 9 raefsky3 Fluid structure interaction turbulence 21,200 1,488,768 10 pwtk Pressurized wind tunnel 217,918 11,634,424 11 gearbox Aircraft flap actuator 153,746 9,080,404

Result of Matrix No.2 Comm-time(msec) PentiumIII-Ethernet Alpha-SMP Local-time(msec) U1 R2x2 R1x3 R2x2 R3x1 U1 R1x3 U2 R2x4 U2 U2 U2 R2x4 U6 U2 U5 IrecvIsend-min IrecvIsend-range Comm-algorithm Local-algorithm MIPS-DSM Itanium-SMP R3x1 R3x1 R1x3 R2x2 R1x3 D9 R3x1 R1x3 R2x1 U2 D7 U1 U4 U3 U4 U3 IrecvIsend-range IsendIrecv-range

Result of Matrix No.7 Comm-time(msec) PentiumIII-Ethernet Alpha-SMP Local-time(msec) U1 U1 R3x1 R3x1 U1 R3x1 U1 U1 R2x3 R3x3 R3x3 R3x3 SendRecv-min IsendIrecv-min Comm-algorithm Local-algorithm MIPS-DSM Itanium-SMP R4x2 R3x3 R3x3 R3x3 R4x1 R3x3 R3x3 R3x2 IrecvIsend-min D9 D15 R3x3 D11 D7 D9 R3x3 R2x3 SendRecv-min

Result of Matrix No.11 Comm-time(msec) Local-time(msec) PentiumIII-Ethernet Alpha-SMP R2x3 R3x3 D15 R3x3 R2x3 R3x3 R3x3 R3x3 D5 R3x3 R3x3 R3x3 R3x3 R3x3 R3x3 R3x3 IsendIrecv-min SendRecv-min Comm-algorithm Local-algorithm MIPS-DSM Itanium-SMP D5 R3x3 R3x3 R3x3 R3x3 D7 D9 R3x3 R3x3 R3x3 R3x3 R4x3 R3x3 R3x3 R3x3 R3x3 SendRecv-min SendRecv-min

Summary of Experiment Summary of speed-up def 1 def 2 def 3 def 4 PC-cluster 8.16 7.90 1.32 1.05 Sun Enterprise 3500 2.82 3.07 1.35 1.58 COMPAQ 3.56 3.10 1.59 1.44 SGI 3.73 3.33 1.61 1.36 Hitachi 2.51 1.81 2.03 1.39 Best algorithm depends highly on characteristics of matrix and machine Obtained at least 1.05 speed-up compared with fixed default algorithms

Conclusion and Future Work Compared performance of best algorithm with that of typical fixed algorithms Obtained meaningful speed-up by selecting best algorithm Selecting optimal algorithm according to characteristics of matrix and machine is important Create light overhead method of selecting algorithm Now, selecting time takes hundreds of SpMxV time