SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
OpenFOAM on a GPU-based Heterogeneous Cluster
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Martin Kruliš by Martin Kruliš (v1.0)1.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
GPU Programming with CUDA – Optimisation Mike Griffiths
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
ARCHES: GPU Ray Tracing I.Motivation – Emergence of Heterogeneous Systems II.Overview and Approach III.Uintah Hybrid CPU/GPU Scheduler IV.Current Uintah.
SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.
Accelerating MCAE with GPUs Information Sciences Institute 15 Sept 2010 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
1 1  Capabilities: Serial (thread-safe), shared-memory (SuperLU_MT, OpenMP or Pthreads), distributed-memory (SuperLU_DIST, hybrid MPI+ OpenM + CUDA).
Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Heterogeneous Programming
Accelerating MapReduce on a Coupled CPU-GPU Architecture
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Presentation transcript:

SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par 2014, LNCS Vol Porto, Portugal, August 25-29, 2014.

SuperLU_DIST : Algorithm and data distribution 2D-data distribution Owner update policy L and U are stored in different sparse format Right looking Euro-Par 2014

Schur complement update Schur Complement update is done in three step 1. Gather : Packing operands into dense BLAS compliant format 2. GEMM : 3. Scatter : Scatter the dense output into the sparse format Euro-Par

Offloading BLAS calls to GPU Considerations 1. Smaller operand sizes for BLAS 2. PCI-e latencies and transfer cost is high 3. Cost of Scatter phase is significant 4. Device/PCI-e resource contention Euro-Par

Aggregating BLAS calls Aggregating BLAS calls increases the operand size Require fewer transfers to device and back. May not increases arithmetic intensity Requires a buffer for temporary product. GPU memory may be limited; in which case we slices the matrix so that it fits into GPU/CPU memory. Euro-Par

Pipelining GEMM on GPU and (multithreaded) Scatter on CPU Use CUDA's stream facility to pipeline GEMM calls on GPU with Scatter on CPU 1. Slice U into n s “equal” partition with each part containing n b columns greedily 2. Schedule each part using CUDA streams 3. Assign first block column to CPU till it waits for GPU to finish GEMM This further helps in hiding offload cost of GEMM on GPUs Euro-Par

Programming Code complexity At each step of Schur complement update: Gemm_division_cpu_gpu() Decide how many CUDA streams to use: For each CUDA stream: cudaMemcpyAsync(…, HostToDevice) cublasDgemm() cudaMemcpyAsync(…, DeviceToHost) CPU performs Scatter to destination Programming, productivity, portability Can a single programming model capture all the Abstract machine models? 7

Performance evaluation :Matrices NameNNnzNnz/nSymFill-inApplication audikw_1* yes31.43structural bone010* yes43.52model reduction nd24k* yes22.492D/3D RM07R* no78fluid dynamics dds.quad** no20.18Accelerator (Omega3P) matrix211** no9.68Nuclear Fusion (M3D-C1) tdr190k** no20.43Accelerator (Omega3P) Ga19As19H42* yes182.16quantum chemistry TSOPF_RS_b2383_c1 * no3.44power network dielFilterV2real * yes22.39electromagnetics Euro-Par

Comparison of different hybrid schemes Baseline SuperLU_DIST 3.3 mkl 1  Default settings  Metis on A+A T  Maximum super node size=144 Implicit parallelism  Multithread BLAS mkl p  CUDA BLAS ( cuBLAS+scatter) Explicit parallelism  omp+mkl 1  omp + mkl 1 + cuBLAS  omp + mkl 1 + cuBLAS + pipeline (SuperLU_DIST_4.0) Euro-Par

Performance on Dirac Cluster at NERSC 2xNodes x 2x4 Tesla C2040 icc+mkl (11.1) + CUDA 5.5 Euro-Par

Strong scaling on Dirac cluster Euro-Par

Memory footprint MPI-only versus Hybrid Euro-Par

Conclusions BLAS only GPU acceleration can give up to 2-3x speed up on "denser" matrices Slow down may occur for sparser matrices BLAS acceleration leaves Scatter as bottleneck CPU-threaded BLAS (implicit parallelism) may not be sufficient : Utilizing all resources is important Hybrid always reduces memory footprint, up to 5x Euro-Par

Ongoing and future work Optimizing Scatter phase on CPU and accelerators 1. Utilizing high bandwidth of GPU Accelerating Scatter phase of the computation using a hybrid data structure 1. New algorithm tried on many-core Xeon-Phi; 2. Same algorithm may work for GPUs Using accelerators to aggressively overlap computation with MPI communication Euro-Par