03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07.

03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07

03/09/2007CS267 Lecture 162 High-end simulation in the physical sciences = 7 numerical methods : 1.Structured Grids (including locally structured grids, e.g. AMR) 2.Unstructured Grids 3.Fast Fourier Transform 4.Dense Linear Algebra 5.Sparse Linear Algebra 6.Particles 7.Monte Carlo Well-defined targets from algorithmic, software, and architecture standpoint Phillip Colella’s “Seven dwarfs” Add 4 for embedded 8. Search/Sort 9. Finite State Machine 10. Filter 11. Combinational logic Then covers all 41 EEMBC benchmarks Revise 1 for SPEC 7. Monte Carlo => Easily parallel (to add ray tracing) Then covers 26 SPEC benchmarks Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004

03/09/2007CS267 Lecture 163 ODEs and Sparse Matrices All these problems reduce to sparse matrix problems Explicit: sparse matrix-vector multiplication (SpMV). Implicit: solve a sparse linear system direct solvers (Gaussian elimination). iterative solvers (use sparse matrix-vector multiplication). Eigenvalue/vector algorithms may also be explicit or implicit. Conclusion: SpMV is key to many ODE problems Relatively simple algorithm to study in detail Two key problems: locality and load balance

03/09/2007CS267 Lecture 164 Matrix-vector multiply kernel: y (i)  y (i) + A (i,j)  x (j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] SpMV in Compressed Sparse Row (CSR) Format Matrix-vector multiply kernel: y (i)  y (i) + A (i,j)  x (j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] A y x Representation of A CSR format is one of many possibilities

03/09/2007CS267 Lecture 165 Motivation for Automatic Performance Tuning of SpMV Historical trends Sparse matrix-vector multiply (SpMV): 10% of peak or less Performance depends on machine, kernel, matrix Matrix known at run-time Best data structure + implementation can be surprising Our approach: empirical performance modeling and algorithm search

03/09/2007CS267 Lecture 166 SpMV Historical Trends: Fraction of Peak

03/09/2007CS267 Lecture 167 Example: The Difficulty of Tuning n = 21200 nnz = 1.5 M kernel: SpMV Source: NASA structural analysis problem

03/09/2007CS267 Lecture 168 Example: The Difficulty of Tuning n = 21200 nnz = 1.5 M kernel: SpMV Source: NASA structural analysis problem 8x8 dense substructure

03/09/2007CS267 Lecture 169 Taking advantage of block structure in SpMV Bottleneck is time to get matrix from memory Only 2 flops for each nonzero in matrix Don’t store each nonzero with index, instead store each nonzero r-by-c block with index Storage drops by up to 2x, if rc >> 1, all 32-bit quantities Time to fetch matrix from memory decreases Change both data structure and algorithm Need to pick r and c Need to change algorithm accordingly In example, is r=c=8 best choice? Minimizes storage, so looks like a good idea…

03/09/2007CS267 Lecture 1610 Speedups on Itanium 2: The Need for Search Reference Best: 4x2 Mflop/s

03/09/2007CS267 Lecture 1611 Register Profile: Itanium 2 190 Mflop/s 1190 Mflop/s

03/09/2007CS267 Lecture 1612 SpMV Performance (Matrix #2): Generation 2 Ultra 2i - 9%Ultra 3 - 5% Pentium III-M - 15%Pentium III - 19% 63 Mflop/s 35 Mflop/s 109 Mflop/s 53 Mflop/s 96 Mflop/s 42 Mflop/s 120 Mflop/s 58 Mflop/s

03/09/2007CS267 Lecture 1613 Register Profiles: Sun and Intel x86 Ultra 2i - 11%Ultra 3 - 5% Pentium III-M - 15%Pentium III - 21% 72 Mflop/s 35 Mflop/s 90 Mflop/s 50 Mflop/s 108 Mflop/s 42 Mflop/s 122 Mflop/s 58 Mflop/s

03/09/2007CS267 Lecture 1614 SpMV Performance (Matrix #2): Generation 1 Power3 - 13%Power4 - 14% Itanium 2 - 31%Itanium 1 - 7% 195 Mflop/s 100 Mflop/s 703 Mflop/s 469 Mflop/s 225 Mflop/s 103 Mflop/s 1.1 Gflop/s 276 Mflop/s

03/09/2007CS267 Lecture 1615 Register Profiles: IBM and Intel IA-64 Power3 - 17%Power4 - 16% Itanium 2 - 33%Itanium 1 - 8% 252 Mflop/s 122 Mflop/s 820 Mflop/s 459 Mflop/s 247 Mflop/s 107 Mflop/s 1.2 Gflop/s 190 Mflop/s

03/09/2007CS267 Lecture 1616 Another example of tuning challenges More complicated non-zero structure in general N = 16614 NNZ = 1.1M

03/09/2007CS267 Lecture 1617 Zoom in to top corner More complicated non-zero structure in general N = 16614 NNZ = 1.1M

03/09/2007CS267 Lecture 1618 3x3 blocks look natural, but… More complicated non-zero structure in general Example: 3x3 blocking Logical grid of 3x3 cells But would lead to lots of “fill-in”

03/09/2007CS267 Lecture 1619 Extra Work Can Improve Efficiency! More complicated non-zero structure in general Example: 3x3 blocking Logical grid of 3x3 cells Fill-in explicit zeros Unroll 3x3 block multiplies “Fill ratio” = 1.5 On Pentium III: 1.5x speedup! Actual mflop rate 1.5 2 = 2.25 higher

03/09/2007CS267 Lecture 1620 Automatic Register Block Size Selection Selecting the r x c block size Off-line benchmark Precompute Mflops(r,c) using dense A for each r x c Once per machine/architecture Run-time “search” Sample A to estimate Fill(r,c) for each r x c Run-time heuristic model Choose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)

03/09/2007CS267 Lecture 1621 Accurate and Efficient Adaptive Fill Estimation Idea: Sample matrix Fraction of matrix to sample: s  [0,1] Cost ~ O(s * nnz) Control cost by controlling s Search at run-time: the constant matters! Control s automatically by computing statistical confidence intervals Idea: Monitor variance Cost of tuning Lower bound: convert matrix in 5 to 40 unblocked SpMVs Heuristic: 1 to 11 SpMVs

03/09/2007CS267 Lecture 1622 Accuracy of the Tuning Heuristics (1/4) NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”) See p. 375 of Vuduc’s thesis for matrices

03/09/2007CS267 Lecture 1623 Accuracy of the Tuning Heuristics (2/4)

03/09/2007CS267 Lecture 1624 Accuracy of the Tuning Heuristics (3/4)

03/09/2007CS267 Lecture 1625 Accuracy of the Tuning Heuristics (3/4) DGEMV

03/09/2007CS267 Lecture 1626 Upper Bounds on Performance for blocked SpMV P = (flops) / (time) Flops = 2 * nnz(A) Lower bound on time: Two main assumptions 1. Count memory ops only (streaming) 2. Count only compulsory, capacity misses: ignore conflicts Account for line sizes Account for matrix size and nnz Charge minimum access “latency”  i at L i cache &  mem e.g., Saavedra-Barrera and PMaC MAPS benchmarks

03/09/2007CS267 Lecture 1627 Example: L2 Misses on Itanium 2 Misses measured using PAPI [Browne ’00]

03/09/2007CS267 Lecture 1628 Example: Bounds on Itanium 2

03/09/2007CS267 Lecture 1631 Summary of Other Performance Optimizations Optimizations for SpMV Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 2.8x over CSR Multiple vectors (SpMM): 7x over CSR And combinations… Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels AA T *x, A T A*x: 4x over CSR, 1.8x over RB A  *x: 2x over CSR, 1.5x over RB

03/09/2007CS267 Lecture 1632 Data Structure Transformations Thread blocking Cache blocking Register Blocking Format selection Index size reduction Kernel Optimizations Prefetching Loop structure SPMV for Shared Memory and Multicore

03/09/2007CS267 Lecture 1633 Load Balancing Evenly divide number of nonzeros Exploit NUMA memory systems on multi-socket SMPs Must pin threads to cores AND pin data to sockets Thread Blocking

03/09/2007CS267 Lecture 1634 R x C processor grid Each covers the same number of rows and columns. Potentially unbalanced Naïve Approach

03/09/2007CS267 Lecture 1635 R x C processor grid First, block into rows same number of nonzeros in each of the R blocked rows Second, block within each blocked row Not only should each block within a row have ~same number of nonzeros, But all blocks should have ~same number of nonzeros Third, prune unneeded rows & columns Fourth, re-encode the column indices to be relative to each thread block. Load Balanced Approach

03/09/2007CS267 Lecture 1636 Cache blocking Performed for each thread block. Chop into blocks so entire source vector fits in cache Prefetching Insert explicit prefetch operations to mask latency to memory Tune prefetch distance/time using search Register blocking As in OSKI, but done separately per cache block Simpler heuristic: choose block size that minimize total storage Index compression Use 16b ints for indices in blocks less than 64K wide Memory Optimizations

03/09/2007CS267 Lecture 1637 1 thread Performance (preliminary) 493513 269 Naive Register Blocking Naive Software Prefetch 258 467695 439 Naive Register Blocking Naive Software Prefetch 297 476460 351 Naive Register Blocking Naive Software Prefetch 324 6121372 623 Naive Register Blocking Naive Software Prefetch 430 memplus.rua raefsky3.rua Dual Socket, Dual Core Opteron @ 2.2GHz Quad Socket, Single Core Opteron @ 2.4GHz 3.2x 1.4x 2.3x 1.5x 1.6x 2x 1.9x 1.4x 1.5x

03/09/2007CS267 Lecture 1638 2 thread Performance (preliminary) - 495 0.96x - NaiveT,R Blocked Naive Software Prefetch - - 1284 1.85x - NaiveT,R Blocked Naive Software Prefetch - - 755 1.6x - NaiveT,R Blocked Naive Software Prefetch - - 1639 1.2x - NaiveT,R Blocked Naive Software Prefetch - memplus.rua raefsky3.rua Dual Socket, Dual Core Opteron @ 2.2GHz Quad Socket, Single Core Opteron @ 2.4GHz

03/09/2007CS267 Lecture 1639 4 thread Performance (preliminary) - 985 2.0x - NaiveT,R Blocked Naive Software Prefetch - - 1911 2.75x - NaiveT,R Blocked Naive Software Prefetch - - 1369 3.0x - NaiveT,R Blocked Naive Software Prefetch - - 3148 2.3x - NaiveT,R Blocked Naive Software Prefetch - memplus.rua raefsky3.rua Dual Socket, Dual Core Opteron @ 2.2GHz Quad Socket, Single Core Opteron @ 2.4GHz

03/09/2007CS267 Lecture 1640 Speedup for the best combination of NThreads, blocking, prefetching, … -985 - NaiveT,R Blocked Naive Software Prefetch 258 -1911 - NaiveT,R Blocked Naive Software Prefetch 297 -1369 - NaiveT,R Blocked Naive Software Prefetch 324 -3148 - NaiveT,R Blocked Naive Software Prefetch 430 memplus.rua raefsky3.rua Dual Socket, Dual Core Opteron @ 2.2GHz Quad Socket, Single Core Opteron @ 2.4GHz 7.3x6.4x 3.8x 4.2x

03/09/2007CS267 Lecture 1641 Distributed Memory SPMV y = A*x, where A is a sparse n x n matrix Questions which processors store y[i], x[i], and A[i,j] which processors compute y[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) * x … a sparse dot product Partitioning Partition index set {1,…,n} = N1  N2  …  Np. For all i in Nk, Processor k stores y[i], x[i], and row i of A For all i in Nk, Processor k computes y[i] = (row i of A) * x “owner computes” rule: Processor k compute the y[i]s it owns. x y P1 P2 P3 P4 May require communication

03/09/2007CS267 Lecture 1642 Two Layouts The partitions should be by nonzeros counts, not rows/columns 1D Partition: most popular, but for algorithms (NAS CG) that do reductions on y, these scale with log P 2D Partition: reductions scale with log sqrt(P), but needs to keep ~= nonzeros for load balance x y P1 P2 P3 P4 x y P1 P2 P3 P4

03/09/2007CS267 Lecture 1643 Summary Sparse matrix vector multiply critical to many applications Performance limited by memory systems (and perhaps network) Cache blocking, register blocking, prefetching are all important Autotuning can be used, but need matrix structure

03/09/2007CS267 Lecture 1644 Extra Slides Including: How to use OSKI

03/09/2007CS267 Lecture 1645 Example: Sparse Triangular Factor Raefsky4 (structural problem) + SuperLU + colmmd N=19779, nnz=12.6 M Dense trailing triangle: dim=2268, 20% of total nz Can be as high as 90+%! 1.8x over CSR

03/09/2007CS267 Lecture 1646 Cache Optimizations for AA T *x Cache-level: Interleave multiplication by A, A T Only fetch A from memory once Register-level: a i T to be r  c block row, or diag row dot product “axpy” … …

03/09/2007CS267 Lecture 1647 Example: Combining Optimizations Register blocking, symmetry, multiple (k) vectors Three low-level tuning parameters: r, c, v v k X Y A c r += *

03/09/2007CS267 Lecture 1648 Example: Combining Optimizations Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB] Symmetric, blocked, 1 vector Up to 2.6x over nonsymmetric, blocked, 1 vector Symmetric, blocked, k vectors Up to 2.1x over nonsymmetric, blocked, k vecs. Up to 7.3x over nonsymmetric, nonblocked, 1, vector Symmetric Storage: up to 64.7% savings

03/09/2007CS267 Lecture 1649 Potential Impact on Applications: T3P Application: accelerator design [Ko] 80% of time spent in SpMV Relevant optimization techniques Symmetric storage Register blocking On Single Processor Itanium 2 1.68x speedup 532 Mflops, or 15% of 3.6 GFlop peak 4.4x speedup with multiple (8) vectors 1380 Mflops, or 38% of peak

03/09/2007CS267 Lecture 1650 Potential Impact on Applications: Omega3P Application: accelerator cavity design [Ko] Relevant optimization techniques Symmetric storage Register blocking Reordering Reverse Cuthill-McKee ordering to reduce bandwidth Traveling Salesman Problem-based ordering to create blocks –Nodes = columns of A –Weights(u, v) = no. of nz u, v have in common –Tour = ordering of columns –Choose maximum weight tour –See [Pinar & Heath ’97] 2.1x speedup on Power 4, but SPMV not dominant

03/09/2007CS267 Lecture 1651 Source: Accelerator Cavity Design Problem (Ko via Husbands)

03/09/2007CS267 Lecture 1652 100x100 Submatrix Along Diagonal

03/09/2007CS267 Lecture 1653 Post-RCM Reordering

03/09/2007CS267 Lecture 1654 Before: Green + Red After: Green + Blue “Microscopic” Effect of RCM Reordering

03/09/2007CS267 Lecture 1655 “Microscopic” Effect of Combined RCM+TSP Reordering Before: Green + Red After: Green + Blue

03/09/2007CS267 Lecture 1656 (Omega3P)

03/09/2007CS267 Lecture 16 Optimized Sparse Kernel Interface - OSKI Provides sparse kernels automatically tuned for user’s matrix & machine BLAS-style functionality: SpMV, Ax & A T y, TrSV Hides complexity of run-time tuning Includes new, faster locality-aware kernels: A T Ax, A k x Faster than standard implementations Up to 4x faster matvec, 1.8x trisolve, 4x A T A*x For “advanced” users & solver library writers Available as stand-alone library (OSKI 1.0.1b, 3/06) Available as PETSc extension (OSKI-PETSc.1d, 3/06) Bebop.cs.berkeley.edu/oski

03/09/2007CS267 Lecture 1658 How the OSKI Tunes (Overview) Benchmark data 1. Build for Target Arch. 2. Benchmark Heuristic models 1. Evaluate Models Generated code variants 2. Select Data Struct. & Code Library Install-Time (offline) Application Run-Time To user: Matrix handle for kernel calls Workload from program monitoring Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system. History Matrix

03/09/2007CS267 Lecture 1659 How the OSKI Tunes (Overview) At library build/install-time Pre-generate and compile code variants into dynamic libraries Collect benchmark data Measures and records speed of possible sparse data structure and code variants on target architecture Installation process uses standard, portable GNU AutoTools At run-time Library “tunes” using heuristic models Models analyze user’s matrix & benchmark data to choose optimized data structure and code Non-trivial tuning cost: up to ~40 mat-vecs Library limits the time it spends tuning based on estimated workload –provided by user or inferred by library User may reduce cost by saving tuning results for application on future runs with same or similar matrix

03/09/2007CS267 Lecture 1660 Optimizations in the Initial OSKI Release Fully automatic heuristics for Sparse matrix-vector multiply Register-level blocking Register-level blocking + symmetry + multiple vectors Cache-level blocking Sparse triangular solve with register-level blocking and “switch- to-dense” optimization Sparse A T A*x with register-level blocking User may select other optimizations manually Diagonal storage optimizations, reordering, splitting; tiled matrix powers kernel ( A k *x ) All available in dynamic libraries Accessible via high-level embedded script language “Plug-in” extensibility Very advanced users may write their own heuristics, create new data structures/code variants and dynamically add them to the system

03/09/2007CS267 Lecture 1661 How to Call OSKI: Basic Usage May gradually migrate existing apps Step 1: “Wrap” existing data structures Step 2: Make BLAS-like kernel calls int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */ double* x = …, *y = …; /* Let x and y be two dense vectors */ /* Compute y =  ·y +  ·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, , x, , y );

03/09/2007CS267 Lecture 1662 How to Call OSKI: Basic Usage May gradually migrate existing apps Step 1: “Wrap” existing data structures Step 2: Make BLAS-like kernel calls int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */ double* x = …, *y = …; /* Let x and y be two dense vectors */ /* Step 1: Create OSKI wrappers around this data */ oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE); /* Compute y =  ·y +  ·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, , x, , y );

03/09/2007CS267 Lecture 1663 How to Call OSKI: Basic Usage May gradually migrate existing apps Step 1: “Wrap” existing data structures Step 2: Make BLAS-like kernel calls int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */ double* x = …, *y = …; /* Let x and y be two dense vectors */ /* Step 1: Create OSKI wrappers around this data */ oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE); /* Compute y =  ·y +  ·A·x, 500 times */ for( i = 0; i < 500; i++ ) oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view); /* Step 2 */

03/09/2007CS267 Lecture 1664 How to Call OSKI: Tune with Explicit Hints User calls “tune” routine May provide explicit tuning hints (OPTIONAL) oski_matrix_t A_tunable = oski_CreateMatCSR( … ); /* … */ /* Tell OSKI we will call SpMV 500 times (workload hint) */ oski_SetHintMatMult(A_tunable, OP_NORMAL, , x_view, , y_view, 500); /* Tell OSKI we think the matrix has 8x8 blocks (structural hint) */ oski_SetHint(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8); oski_TuneMat(A_tunable); /* Ask OSKI to tune */ for( i = 0; i < 500; i++ ) oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);

03/09/2007CS267 Lecture 1665 How the User Calls OSKI: Implicit Tuning Ask library to infer workload Library profiles all kernel calls May periodically re-tune oski_matrix_t A_tunable = oski_CreateMatCSR( … ); /* … */ for( i = 0; i < 500; i++ ) { oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view); oski_TuneMat(A_tunable); /* Ask OSKI to tune */ }

03/09/2007CS267 Lecture 1666 Quick-and-dirty Parallelism: OSKI-PETSc Extend PETSc’s distributed memory SpMV (MATMPIAIJ) p0 p1 p2 p3 PETSc Each process stores diag (all-local) and off- diag submatrices OSKI-PETSc: Add OSKI wrappers Each submatrix tuned independently

03/09/2007CS267 Lecture 1667 OSKI-PETSc Proof-of-Concept Results Matrix 1: Accelerator cavity design (R. Lee @ SLAC) N ~ 1 M, ~40 M non-zeros 2x2 dense block substructure Symmetric Matrix 2: Linear programming (Italian Railways) Short-and-fat: 4k x 1M, ~11M non-zeros Highly unstructured Big speedup from cache-blocking: no native PETSc format Evaluation machine: Xeon cluster Peak: 4.8 Gflop/s per node

03/09/2007CS267 Lecture 1668 Accelerator Cavity Matrix

03/09/2007CS267 Lecture 1669 OSKI-PETSc Performance: Accel. Cavity

03/09/2007CS267 Lecture 1670 Linear Programming Matrix …

03/09/2007CS267 Lecture 1671 OSKI-PETSc Performance: LP Matrix

03/09/2007CS267 Lecture 1672 Tuning Higher Level Algorithms So far we have tuned a single sparse matrix kernel y = A T *A*x motivated by higher level algorithm (SVD) What can we do by extending tuning to a higher level? Consider Krylov subspace methods for Ax=b, Ax = x Conjugate Gradients (CG), GMRES, Lanczos, … Inner loop does y=A*x, dot products, saxpys, scalar ops Inner loop costs at least O(1) messages k iterations cost at least O(k) messages Our goal: show how to do k iterations with O(1) messages Possible payoff – make Krylov subspace methods much faster on machines with slow networks Memory bandwidth improvements too (not discussed) Obstacles: numerical stability, preconditioning, …

03/09/2007CS267 Lecture 1673 Parallel Sparse Matrix-vector multiplication y = A*x, where A is a sparse n x n matrix Questions which processors store y[i], x[i], and A[i,j] which processors compute y[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) * x … a sparse dot product Partitioning Partition index set {1,…,n} = N1  N2  …  Np. For all i in Nk, Processor k stores y[i], x[i], and row i of A For all i in Nk, Processor k computes y[i] = (row i of A) * x “owner computes” rule: Processor k compute the y[i]s it owns. x y P1 P2 P3 P4 May require communication

03/09/2007CS267 Lecture 1674 Matrix Reordering via Graph Partitioning “Ideal” matrix structure for parallelism: block diagonal p (number of processors) blocks, can all be computed locally. If no non-zeros outside these blocks, no communication needed Can we reorder the rows/columns to get close to this? Most nonzeros in diagonal blocks, few outside P0 P1 P2 P3 P4 = * P0 P1 P2 P3 P4

03/09/2007CS267 Lecture 1675 Goals of Reordering Performance goals balance load (how is load measured?). Approx equal number of nonzeros (not necessarily rows) balance storage (how much does each processor store?). Approx equal number of nonzeros minimize communication (how much is communicated?). Minimize nonzeros outside diagonal blocks Related optimization criterion is to move nonzeros near diagonal improve register and cache re-use Group nonzeros in small vertical blocks so source (x) elements loaded into cache or registers may be reused (temporal locality) Group nonzeros in small horizontal blocks so nearby source (x) elements in the cache may be used (spatial locality) Other algorithms reorder for other reasons Reduce # nonzeros in matrix after Gaussian elimination Improve numerical stability

03/09/2007CS267 Lecture 1676 Graph Partitioning and Sparse Matrices 1 1 2 1 1 1 1 3 1 1 1 4 1 1 1 1 5 1 1 1 1 6 1 1 1 1 1 2 3 4 5 6 3 6 1 5 2 Relationship between matrix and graph Edges in the graph are nonzero in the matrix: here the matrix is symmetric (edges are unordered) and weights are equal (1) If divided over 3 procs, there are 14 nonzeros outside the diagonal blocks, which represent the 7 (bidirectional) edges 4

03/09/2007CS267 Lecture 1677 Graph Partitioning and Sparse Matrices 1 1 2 1 1 1 1 3 1 1 1 4 1 1 1 1 5 1 1 1 1 6 1 1 1 1 1 2 3 4 5 6 Relationship between matrix and graph A “good” partition of the graph has equal (weighted) number of nodes in each part (load and storage balance). minimum number of edges crossing between (minimize communication). Reorder the rows/columns by putting all nodes in one partition together. 3 6 1 5 42

03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07.

Similar presentations

Presentation on theme: "03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07.

Similar presentations

Presentation on theme: "03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07."— Presentation transcript:

Similar presentations

About project

Feedback