1 Exploiting BLIS to Optimize LU with Pivoting Xianyi Zhang

Slides:



Advertisements
Similar presentations
Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu 12/02/20111.
Advertisements

Example 1 Matrix Solution of Linear Systems Chapter 7.2 Use matrix row operations to solve the system of equations  2009 PBLPathways.
1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Software & Services Group Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property.
1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.
Grayson Ishihara Math 480 April 15,  What is Partial Pivoting?  What is the PA=LU Factorization?  What kinds of things can we use these tools.
April 19, 2010HIPS Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan.
Linear Systems Pivoting in Gaussian Elim. CSE 541 Roger Crawfis.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
1cs542g-term Notes  Assignment 1 is out (due October 5)  Matrix storage: usually column-major.
1cs542g-term Notes  Note that r 2 log(r) is NaN at r=0: instead smoothly extend to be 0 at r=0  Schedule a make-up lecture?
ECIV 301 Programming & Graphics Numerical Methods for Engineers Lecture 17 Solution of Systems of Equations.
ECIV 520 Structural Analysis II Review of Matrix Algebra.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
ECIV 301 Programming & Graphics Numerical Methods for Engineers REVIEW II.
The Gamma Operator for Big Data Summarization
ECIV 301 Programming & Graphics Numerical Methods for Engineers Lecture 18 LU Decomposition and Matrix Inversion.
Chapter 12 Gaussian Elimination (II) Speaker: Lung-Sheng Chien Reference book: David Kincaid, Numerical Analysis Reference lecture note: Wen-wei Lin, chapter.
Section 8.3 – Systems of Linear Equations - Determinants Using Determinants to Solve Systems of Equations A determinant is a value that is obtained from.
CS 290H Lecture 12 Column intersection graphs, Ordering for sparsity in LU with partial pivoting Read “Computing the block triangular form of a sparse.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Lecture 8: Caffe - CPU Optimization
Tuning Libraries to Effectively Exploit Memory Prof. Misha Kilmer Emily Reid Stacey Ecott.
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
1 BLIS Matrix Multiplication: from Real to Complex Field G. Van Zee.
Introduction to Numerical Analysis I MATH/CMPSC 455 PA=LU.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Lecture 4 Sparse Factorization: Data-flow Organization
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan.
Unit #1 Linear Systems Fall Dr. Jehad Al Dallal.
SimTK 1.0 Workshop Downloads Jack Middleton March 20, 2008.
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.
TI Information – Selective Disclosure
05/23/11 Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou User and Application Support.
Ioannis E. Venetis Department of Computer Engineering and Informatics
Numerical Linear Algebra
Spring Dr. Jehad Al Dallal
A computational loop k k Integration Newton Iteration
Linear Equations.
High-Performance Matrix Multiplication
BLAS: behind the scenes
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
libflame optimizations with BLIS
Linchuan Chen, Peng Jiang and Gagan Agrawal
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Programming #4 Computer Problems
Numerical Computation and Optimization
Fundamentals of Engineering Analysis
The Gamma Operator for Big Data Summarization
CS5321 Numerical Optimization
Parallelized Analytic Placer
A computational loop k k Integration Newton Iteration
Presentation transcript:

1 Exploiting BLIS to Optimize LU with Pivoting Xianyi Zhang

LU factorization A=LU Partial Pivoting  PA=LU, reorder the rows of A  Numerically stable in practice 2

Why is LU important? Solving linear equations  Ax=b  PA=LU  LUx=Pb  Solve Ly=Pb for y  Solve Ux=y for x How does LAPACK implement? 3

LAPACK implementation getrf  Blocked algorithm  BLAS3 gemm, trsm getf2  Unblocked  BLAS2 and BLAS1 4

LAPACK implementation Right looking blocked algorithm (getrf) 5

Naive implementation Compile BLIS with BLAS compatibility layer Compile LAPACK Link LAPACK with BLIS Done! 6

Thank you for your attention 7

How about performance? 8

Naïve implementation Performance dgetrf, double real, single thread Testbed  Intel Xeon E (2.7 GHz, Turbo 3.5 GHz)  Ubuntu  gcc and gfortran Input Matrix  Square matrices  , step 100 Tune NB block size  NB 128 9

Naïve implementation Performance 10

Naïve implementation Performance Compare with LAPACK+OpenBLAS 11

Why? Profiling  One step in LU 12 BLIS (second)OpenBLAS (second) DTRSM DGEMM DGETF

GETF2 panel factorization Right looking unblocked  trsm->scal  Rank-b update -> Rank-1 ger 13

Performance BLIS+Optimized GER (NB 188) 14

What’s the best NB? Input NxN matrix BLAS3 15 mnk TRSMNBN-NBNB GEMMN-NB NB

The gemm algorithm 16 +=

The gemm algorithm 17 +=NC NC (e.g on Sandy Bridge)

The gemm algorithm 18 +=

The gemm algorithm 19 += KC KC (e.g. 256 on Sandy Bridge)

The gemm algorithm 20 += mnk GEPPN-NBNC (4096)KC (256) GEMM in LUN-NB NB (188) NB=n*KC (n=1,2…) for better performance

Limits of Panel factorization Getf2 (N, NB…)  BLAS 2/1  Very slow compared to blocked implemenation  NB as small as possible 21

Recursive Panel factorization Factor NxNB panel by blocked 22

Recursive Panel factorization Factor NxNB panel by blocked  Block size: NB/2 23

Recursive Panel factorization Recursive factor Nx(NB/2) panel 24

Recursive Panel factorization Recursive factor Nx(NB/2) panel  Block size: NB/2/2 25

Recursive Panel factorization Recursive …  Stop when nb is small  Call unblocked algorithm (getf2) 26

Performance BLIS+Optimized GER+Recursive Panel 27

Revisit getf2 implementation 28 Left-looking, fewer memory accesses

Performance BLIS+Recursive+Left looking Getf2 29

Pivoting and trsm 30

The gemm algorithm 31 +=

The gemm algorithm 32 += Pack row panel of B

The gemm algorithm 33 += Pack row panel of B NR

Merge swap and trsm packing 34

Merge trsm and gemm Reuse packed B 35 Packed B

Performance BLIS+Recursive+Getf2+Merge 36

Performance Intel MKL

Performance Gap Large matrix  ~10%  BLAS3 Trsm Gemm 38

Performance Gap Large matrix  ~10%  BLAS3 Trsm Gemm How about using OpenBLAS trsm and gemm 39

Performance Gap 40

Summary Optimization  Ger  Recursive panel factorization  Left looking panel factorization  Merge pivoting, trsm, and gemm Performance  Naïve implementation, x  Intel MKL Small matrix, 1.09x Large matrix, performance gap ~10% 41

Further works Investigate trsm and gemm Recursive panel Parallelize  Paralyze 42

Thank you for your attention again 43