Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Slides:

Advertisements

Similar presentations

Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University.

Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.

GPU Programming with CUDA – Optimisation Mike Griffiths

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

QCAdesigner – CUDA HPPS project

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

Irregular Applications –Sparse Matrix Vector Multiplication

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Appendix C Graphics and Computing GPUs

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Linchuan Chen, Peng Jiang and Gagan Agrawal

6- General Purpose GPU Programming

Presentation transcript:

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University DAC 2012

Outline Introduction Preliminaries Sparse LU factorization on GPU Experimental results Conclusions 1

Introduction Flowchart of a SPICE simulator 2

Introduction (cont.) SPICE takes several days or even weeks on simulation for modern designs. –The sparse matrix solver by LU factorization is performed iteratively and hence time-consuming. However, it is difficult to parallelize the sparse solver because of the high data-dependency during the numeric LU factorization and the irregular structure of circuit matrices. 3

Introduction (cont.) Previous works focus on dense matrices. –Factorize a sparse matrix by a highly parallelize dense solver is still much slower than sequential sparse solver. [8]~[13] compute dense blocks on GPU but the rest are on CPU. –Still the dense idea. [15]~[17] apply G/P left-looking algorithm on FPGA. –Scalability is limited because FPGA on-chip resources. [18] implements it on multi-core CPU. –Scalability is limited by the number of cores. 4

Introduction (cont.) Multi/Many-core era has come. Graphic Processing Unit (GPU) can now be used to perform general computing. –Become popular in parallel processing for its cost-effectiveness. The state of the art GPUs provide a possible solution to the limited scalability. For now, the latest nVidia GeForce GTX 690 has large number of cores and memory. 5

GeForce GTX 690 official spec 6

Contributions Exposing more parallelism for manycore architecture. Ensuring timing order on GPU. Optimizing memory access pattern. 7

Preliminaries Sparse matrix LU factorization (decomposition) GPU architecture and CUDA. 8

Sparse matrix 9

LU factorization 10

CUDA programming Compute Unified Device Architecture The CPU code does the sequential part. Highly parallelized part usually implement in the GPU code, called kernel. Calling GPU function in CPU code is called kernel launch. 11

Execution of GPU thread Threads are grouped into thread blocks. Each thread block is assigned to a streaming multiprocessor (SM), which contains multiple streaming processors (SPs), to be executed. The actual execution of threads on SPs is done in groups of 32 threads, called warps. SPs execute one warp at a time. 12

13

GPU architecture nVidia GeForce 9800 GTX 14

Sparse LU factorization on GPU Overall flow Preprocessing Exposing more parallelism Ensuring timing order Optimizing memory access pattern 15

Overall flow 16

Preprocessing HSL_MC64 algorithm to improve numeric stability. –Find permutation matrix. AMD (Approximate Minimum Degree) algorithm to reduce fill-ins. –Find permutation matrix. G/P algorithm based pre-factorization (a complete numeric factorization with partial pivoting) to calculate the symbolic structure of the LU factors. –Can extract the total flops. 17

Exposing more parallelism Sequential G/P left-looking algorithm 18

19

Exposing more parallelism Dependency graph and scheduler 20

Exposing more parallelism (cont.) Treat the threads that process the same column as a virtue group. In cluster mode –Columns are very sparse, so while ensuring enough threads in total, we make virtue groups as small as possible to minimize idle threads. In pipeline mode –Columns usually contain enough nonzeros for a warp or several warps. So the size of virtue groups matters little in the sense of reducing idle threads. We use one warp as one virtue group. 21

Ensuring timing order

Ensuring timing order (cont.) Suppose column 8, 9 and 10 are being processed, and other columns are finished. Column 9 can be first updated with column 4, 6, 7, corresponding to the solid green arrows. But currently column 9 can not be updated with column 8. It must wait for column 8 to finish. Similar situation for column

Ensuring timing order (cont.) Key is to avoid deadlock. –Not all warps are active at the beginning. –If we active warps in wrong order in the pipeline mode, deadlock will occur. There is no inter-warp context switching. 24

Optimizing memory access pattern The main difference between CPU and GPU parallel programming is the memory access. Two alternative data formats for the intermediate vectors (x in Algorithm 2). –CSC(Compressed Sparse Column) sparse vectors. Save space and can be placed in shared memory. –Dense arrays. Have to reside in global memory. Two reasons to choose dense arrays. –CSC format is inconvenient for indexed accesses. –Using too much shared memory would reduce the number of resident warps per SM, and hence performance degradation. 25

CSC format Specified by the 3 arrays {val, row_ind, col_ptr}, where row_ind stores the row indices of each nonzero, and col_ptr stores the index of the elements in val which start a column of A. Example 26

Improve data locality Memory access coalescing. –Several memory transactions can be coalesced into one transaction when consecutive threads access consecutive memory locations. –Sort the nonzeros in L and U by their row indices to improve the data locality. 27

Effectiveness of sorting GPU bandwidth increase is about 2.4x on average. CPU sparse LU factorization also benefits from sorted nonzeros, but the bandwidth increase is only 1.15x. 28

Experimental results Environments –2 Xeon E5405 CPUs, 2 Xeon X5680 CPUs, AMD Radeon 5870 GPU, and NVIDIA GTX 580 GPU. –Experiments on CPU are implemented in C on a 64-bit Linux server. Radeon 5870 is programmed using OpenCL v1.1. GTX580 is programmed with CUDA 4.0. Benchmarks are from University of Florida Sparse Matrix Collection. 29

Devices specifications 30

Performance and speedup Group A are cases under 200 Mflops. –Results are mostly worse than CPU version. Group B are cases over 200 Mflops. Group C are cases that contains many denormal number during factorization. –CPU cannot handle it in normal speed so that great speedup achieved by GPU. 31

Performance and speedup (cont.) We can see the GPU bandwidth is positively related to Mflops, which indicates that in sparse LU factorization, the high memory bandwidth of GPU can be exploited only when the problem scale is large enough. 32

Scalability Analysis The average and detail performance on the four devices are listed in table and figure, respectively. 33

Scalability Analysis (cont.) The best performance is attained with about 24 resident warps per SM, rather than with maximum resident warps. On GTX 580, it achieves 74% peak bandwidth at most. 34

Scalability Analysis (cont.) On Radeon 5870, it achieves 45% peak bandwidth at most (on xenon1). A primary reason is that there are too few active wavefronts on Radeon 5870 to fully utilize the global memory bandwidth. On the two CPUs and Radeon 5870 GPU, the bandwidth keeps increasing with the issued threads (wavefronts). 35

Hybrid solver for circuit simulation The matrices with few flops in factorization are not suitable for GPU acceleration. –Combine both CPU/GPU version as a hybrid solver. –Choose one of them based on flops computation. 36

Conclusions Sparse matrix solver is one of the runtime bottleneck of SPICE. Propose the first work on GPU-based sparse LU factorization intended for circuit simulation. Experiments demonstrate that GPU outperforms CPU on matrices with many floating point operations in factorization. 37