P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB Parallel Out-of-Core “Tall Skinny” QR James Demmel, Mark Hoemmen, et al. {demmel, mhoemmen}@eecs.berkeley.edu P A R A L L E L C O M P U T I N G L A B O R A T O R Y Why out-of-core? Applications of TSQR algorithm Implementation and benchmark Can solve huge problems without huge machine Workstation as cluster alternative Cluster: Job queue wait counts as “run time” Instant turnaround for compile-run-debug cycle Full(-speed) access to graphical development tools Easier to get grant for < $5000 workstation, than for > $10M cluster! Frees precious, power-hungry DRAM for other tasks Great for embedded hardware: Cell phones, robots, … Orthogonalization step in iterative methods for sparse systems A bottleneck for new Krylov methods we are developing Block column QR factorization For QR of matrices stored in any 2-D block cyclic format Currently a bottleneck in ScaLAPACK Solving linear systems More flops, but less data motion than LU Works for sparse and dense systems Solving least-squares optimization problems Linear tree with multithreaded BLAS and LAPACK (Intel MKL) Fixed # blocks at 10 Varied # threads (only 2 cores available here, Itanium 2) Varied # rows (1000 .. 10000) and # columns (100 .. 1000) per block Compared with standard (in-core) multithreaded QR Read input matrix from disk using same blocks as TSQR Wrote Q factor to disk, just like TSQR Preliminary results Why parallel out-of-core? Some TSQR reduction trees Less compute time per block frees processor time for other tasks Total amount of I/O ideally independent of # of cores If enough flops per datum, can hide reads/writes with computation What does out-of-core want? For performance: Non-blocking I/O operations QoS guarantee on disk bandwidth per core For tuning: Know ideal disk bandwidth per core Can measure actual disk bandwidth used per core Awareness of I/O buffering implementation details For ease of coding: Automatic “(un)pickling” ((un)packing of structures) Top left: TSQR on a binary tree with 4 processors. Top right: TSQR on a linear tree with 1 processor. Bottom: TSQR on a “hybrid” tree with 4 processors. In all three diagrams, steps progress from left to right. The input matrix A is divided into blocks of rows. At each step, the blocks in core are highlighted in grey. Each grey box indicates a QR factorization: multiple blocks in the factorization are stacked. Arrows show data dependencies. “Tall Skinny” QR (TSQR) algorithm For matrices with many more rows than columns (m >> n) Algorithm: Divide matrix into block rows R: Reduction on the blocks with QR factorization as the operator Q: Stored implicitly in reduction tree Can use any reduction tree: Binary tree minimizes # messages in parallel case Linear tree minimizes bandwidth costs for sequential (out-of-core) Other trees optimize for different cases Possible parallelism: In the reduction tree (one processor per local factorization) In the local factorizations (e.g., multithreaded BLAS) Conclusions OOC TSQR competes favorably with standard QR, if blocks are large enough and “square enough” Some benefit from multithreaded BLAS, but not much Suggests we should try a hybrid tree Painful development process Must serialize data manually How do we measure disk bandwidth per core?