Presentation is loading. Please wait.

Presentation is loading. Please wait.

ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

Similar presentations


Presentation on theme: "ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,"— Presentation transcript:

1 ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin, TX gunter@csr.utexas.edu Enrique Quintana-Ortí Depto. de Ingenieria y Ciencia de Computadores Universidad Jaume I, Castellón, Spain quintana@icc.uji.es Robert van de Geijn Department of Computer Sciences The University of Texas at Austin, Austin, TX rvdg@cs.utexas.edu Thierry Joffrain Department of Computer Sciences The University of Texas at Austin, Austin, TX joffrain@cs.utexas.edu

2 ScicomP 10, Aug 9-13, 2004 Motivation Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m n In-core

3 ScicomP 10, Aug 9-13, 2004 Motivation Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m n In-core

4 ScicomP 10, Aug 9-13, 2004 Motivation Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m n In-core

5 ScicomP 10, Aug 9-13, 2004 Motivation m >> n n While this is effective for many applications, it is inherently unscalable As m >> n, fewer columns can fit into memory

6 ScicomP 10, Aug 9-13, 2004 A=QR Q = I + YTY T Out-of-Core QR Factorization Compact WY Representation Q is an orthogonal matrix R is upper triangular Y is an m×r collection of Householder vectors, normalized to be unit lower triangular (trapezoidal) T is r×r upper triangular Given the m×n matrix, A, we wish to apply the factorization

7 ScicomP 10, Aug 9-13, 2004 Step 1: Begin with an unfactored matrix which resides on disk. = Stored on disk= In memory QR Factorization Out-of-Core Implementation

8 ScicomP 10, Aug 9-13, 2004 Step 2: Divide matrix into a mesh of tiles of size t, where each tile is stored as a separate file. = Stored on disk= In memory QR Factorization Out-of-Core Implementation t t

9 ScicomP 10, Aug 9-13, 2004 Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

10 ScicomP 10, Aug 9-13, 2004 Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

11 ScicomP 10, Aug 9-13, 2004 Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

12 ScicomP 10, Aug 9-13, 2004 Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

13 ScicomP 10, Aug 9-13, 2004 Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

14 ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

15 ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

16 ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

17 ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

18 ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

19 ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

20 ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

21 ScicomP 10, Aug 9-13, 2004 Step 5: Factor next tile in first column using QR update algorithm. = Stored on disk= In memory QR Factorization Out-of-Core Implementation YiYi TiTi

22 ScicomP 10, Aug 9-13, 2004 Step 5: Factor next tile in first column using QR update algorithm. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

23 ScicomP 10, Aug 9-13, 2004 Step 5: Factor next tile in first column using QR update algorithm. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

24 ScicomP 10, Aug 9-13, 2004 Step 5: Factor next tile in first column using QR update algorithm. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

25 ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

26 ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

27 ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

28 ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

29 ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

30 ScicomP 10, Aug 9-13, 2004 Step 7: Repeat Steps 5 and 6 to any remaining rows of tiles. = Stored on disk= In memory QR Factorization Out-of-Core Implementation YiYi TiTi

31 ScicomP 10, Aug 9-13, 2004 Step 7: Repeat Steps 5 and 6 to any remaining rows of tiles. = Stored on disk= In memory QR Factorization Out-of-Core Implementation

32 ScicomP 10, Aug 9-13, 2004 Step 8: Repeat Steps 1-7 on lower quadrant. = Stored on disk= In memory QR Factorization Out-of-Core Implementation YiYi TiTi

33 ScicomP 10, Aug 9-13, 2004 Step 8: Repeat Steps 1-7 on lower quadrant. Continue until entire matrix has been factored. = Stored on disk= In memory QR Factorization Out-of-Core Implementation

34 ScicomP 10, Aug 9-13, 2004 PA=LU Out-of-Core LU Factorization P is an permutation matrix U is n×n upper triangular L is lower trapezoidal Implementation analogous to out-of-core QR factorization Given the m×n matrix, A, we wish to apply the factorization

35 ScicomP 10, Aug 9-13, 2004 Step 1: Factor first tile, saving permutation matrix. = Stored on disk= In memory LU Factorization Out-of-Core Implementation PiPi LiLi UiUi

36 ScicomP 10, Aug 9-13, 2004 Step 2: Update remaining tiles in row using panels of L and the saved permutation matrices. = Stored on disk= In memory LU Factorization Out-of-Core Implementation PiPi LiLi UiUi

37 ScicomP 10, Aug 9-13, 2004 Step 3: Factor next tile in first column using LU update algorithm. = Stored on disk= In memory LU Factorization Out-of-Core Implementation PiPi LiLi UiUi

38 ScicomP 10, Aug 9-13, 2004 Step 4: Update remaining tiles in row using panels of L and stored permutation matrices. = Stored on disk= In memory LU Factorization Out-of-Core Implementation LiLi UiUi PiPi

39 ScicomP 10, Aug 9-13, 2004 Development Environment Parallel Linear Algebra Package (PLAPACK) Optimized parallel routines (FORTRAN and C interfaces) View-based infrastructure Uses standard MPI and BLAS libraries Parallel Out-Of-Core Parallel Linear Algebra (POOCLAPACK) Out-of-core extension to PLAPACK Handles the complexity of the I/O operations (i.e., hidden to user) Uses standard read/write functions for portability

40 ScicomP 10, Aug 9-13, 2004 Performance of Parallel OOC QR IBM P690: 32 Gb, T.P. of 5.2 Gflops, DGEMM of 3.723 Gflops

41 ScicomP 10, Aug 9-13, 2004 Performance for Sequential OOC LU

42 ScicomP 10, Aug 9-13, 2004 Earth Science Application Gravity Recovery And Climate Experiment (GRACE) A collaborative effort between The University of Texas Center for Space Research (CSR) The Jet Propulsion Laboratory (JPL) GeoForschungsZentrum (GFZ) Deutschen Zentrum für Luft- und Raumfahrt (DLR) National Aeronautics and Space Administration (NASA)

43 ScicomP 10, Aug 9-13, 2004 Earth Science Application Goal was to compute a rigorous 360x360 gravity model No approximation techniques Translates to roughly 100 km 2 resolution Involves the least squares estimation of ~130,000 parameters Requires the combination of hundreds of millions of observations surface gravity data (land) – ½ TB altimetry-based mean sea surface data (ocean) GRACE data (satellite) Using new parallel OOC QR algorithm A 360x360 field was generated, complete with full covariance Largest rigorous gravity field model ever created Used a single IBM P690 node OOC QR required only 32 GB To do in-core would require 165 GB of memory Required ~6 days of wall clock time to compute (2326 CPU hours) A single processor machine with sufficient memory would require 3.2 months

44 ScicomP 10, Aug 9-13, 2004 Conclusion Tile-based out-of-core algorithms provide scalability Size of the tile is based on the memory of the machine (i.e. fixed) and is independent of the problem size Algorithms achieve excellent performance The large tile sizes mean the algorithm spends nearly all of its time in large, highly efficient matrix-matrix operations This helps to offset the I/O cost associated with moving the tiles to and from disk Use of the PLAPACK & POOCLAPACK greatly simplified the implementation Reduces complexity of code Makes code portable Has already proven valuable to Earth science applications

45 ScicomP 10, Aug 9-13, 2004 Conclusion Broad spectrum of applications Large scale problems Small clusters Embedded systems Other small memory machines Tile-based OOC approach can be extended to other dense linear algebra operations Cholesky, matrix inverse, BLAS-3, etc. Goal is to provide a full suite of OOC utilities

46 ScicomP 10, Aug 9-13, 2004 For More Information Visit the PLAPACK website:www.cs.utexas.edu/users/plapackwww.cs.utexas.edu/users/plapack Visit the GRACE website: www.csr.utexas.edu/gracewww.csr.utexas.edu/grace


Download ppt "ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,"

Similar presentations


Ads by Google