P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Computer Science 162 Discussion Section Week 2. Agenda Recap “What is an OS?” and Why? Process vs. Thread “THE” System.
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Given UPC algorithm – Cyclic Distribution Simple algorithm does cyclic distribution This means that data is not local unless item weight is a multiple.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
Microcontroller Presented by Hasnain Heickal (07), Sabbir Ahmed(08) and Zakia Afroze Abedin(19)
Lecture 18: Dynamic Reconfiguration II November 12, 2004 ECE 697F Reconfigurable Computing Lecture 18 Dynamic Reconfiguration II.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions:
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Lab 2 Parallel processing using NIOS II processors
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
1 How to do Multithreading First step: Sampling and Hotspot hunting Myongji University Sugwon Hong 1.
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Big Picture Lab 4 Operating Systems C Andras Moritz
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Big Data is a Big Deal!.
Agenda Preliminaries Motivation and Research questions Exploring GLL
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction Super-computing Tuesday
Ioannis E. Venetis Department of Computer Engineering and Informatics
Progress Report— 11/06 宗慶.
Embedded Systems Design
Minimizing Communication in Linear Algebra
Amir Kamil and Katherine Yelick
Multi-Processing in High Performance Computer Architecture:
for more information ... Performance Tuning
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Introduction.
Introduction.
Linchuan Chen, Peng Jiang and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Numerical Algorithms Quiz questions
COMP60621 Fundamentals of Parallel and Distributed Systems
Distributed Computing:
Amir Kamil and Katherine Yelick
Lecture 2 The Art of Concurrency
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Department of Computer Science University of California, Santa Barbara
COMP60611 Fundamentals of Parallel and Distributed Systems
Support for Adaptivity in ARMCI Using Migratable Objects
Performance and Code Tuning Overview
Presentation transcript:

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB Parallel Out-of-Core “Tall Skinny” QR James Demmel, Mark Hoemmen, et al. {demmel, mhoemmen}@eecs.berkeley.edu P A R A L L E L C O M P U T I N G L A B O R A T O R Y Why out-of-core? Applications of TSQR algorithm Implementation and benchmark Can solve huge problems without huge machine Workstation as cluster alternative Cluster: Job queue wait counts as “run time” Instant turnaround for compile-run-debug cycle Full(-speed) access to graphical development tools Easier to get grant for < $5000 workstation, than for > $10M cluster! Frees precious, power-hungry DRAM for other tasks Great for embedded hardware: Cell phones, robots, … Orthogonalization step in iterative methods for sparse systems A bottleneck for new Krylov methods we are developing Block column QR factorization For QR of matrices stored in any 2-D block cyclic format Currently a bottleneck in ScaLAPACK Solving linear systems More flops, but less data motion than LU Works for sparse and dense systems Solving least-squares optimization problems Linear tree with multithreaded BLAS and LAPACK (Intel MKL) Fixed # blocks at 10 Varied # threads (only 2 cores available here, Itanium 2) Varied # rows (1000 .. 10000) and # columns (100 .. 1000) per block Compared with standard (in-core) multithreaded QR Read input matrix from disk using same blocks as TSQR Wrote Q factor to disk, just like TSQR Preliminary results Why parallel out-of-core? Some TSQR reduction trees Less compute time per block frees processor time for other tasks Total amount of I/O ideally independent of # of cores If enough flops per datum, can hide reads/writes with computation What does out-of-core want? For performance: Non-blocking I/O operations QoS guarantee on disk bandwidth per core For tuning: Know ideal disk bandwidth per core Can measure actual disk bandwidth used per core Awareness of I/O buffering implementation details For ease of coding: Automatic “(un)pickling” ((un)packing of structures) Top left: TSQR on a binary tree with 4 processors. Top right: TSQR on a linear tree with 1 processor. Bottom: TSQR on a “hybrid” tree with 4 processors. In all three diagrams, steps progress from left to right. The input matrix A is divided into blocks of rows. At each step, the blocks in core are highlighted in grey. Each grey box indicates a QR factorization: multiple blocks in the factorization are stacked. Arrows show data dependencies. “Tall Skinny” QR (TSQR) algorithm For matrices with many more rows than columns (m >> n) Algorithm: Divide matrix into block rows R: Reduction on the blocks with QR factorization as the operator Q: Stored implicitly in reduction tree Can use any reduction tree: Binary tree minimizes # messages in parallel case Linear tree minimizes bandwidth costs for sequential (out-of-core) Other trees optimize for different cases Possible parallelism: In the reduction tree (one processor per local factorization) In the local factorizations (e.g., multithreaded BLAS) Conclusions OOC TSQR competes favorably with standard QR, if blocks are large enough and “square enough” Some benefit from multithreaded BLAS, but not much Suggests we should try a hybrid tree Painful development process Must serialize data manually How do we measure disk bandwidth per core?