Presentation is loading. Please wait.

Presentation is loading. Please wait.

SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.

Similar presentations


Presentation on theme: "SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par."— Presentation transcript:

1 SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par 2014, LNCS Vol. 8632. Porto, Portugal, August 25-29, 2014.

2 SuperLU_DIST : Algorithm and data distribution 2D-data distribution Owner update policy L and U are stored in different sparse format Right looking Euro-Par 2014

3 Schur complement update Schur Complement update is done in three step 1. Gather : Packing operands into dense BLAS compliant format 2. GEMM : 3. Scatter : Scatter the dense output into the sparse format Euro-Par 2014 3

4 Offloading BLAS calls to GPU Considerations 1. Smaller operand sizes for BLAS 2. PCI-e latencies and transfer cost is high 3. Cost of Scatter phase is significant 4. Device/PCI-e resource contention Euro-Par 2014 4

5 Aggregating BLAS calls Aggregating BLAS calls increases the operand size Require fewer transfers to device and back. May not increases arithmetic intensity Requires a buffer for temporary product. GPU memory may be limited; in which case we slices the matrix so that it fits into GPU/CPU memory. Euro-Par 2014 5

6 Pipelining GEMM on GPU and (multithreaded) Scatter on CPU Use CUDA's stream facility to pipeline GEMM calls on GPU with Scatter on CPU 1. Slice U into n s “equal” partition with each part containing n b columns greedily 2. Schedule each part using CUDA streams 3. Assign first block column to CPU till it waits for GPU to finish GEMM This further helps in hiding offload cost of GEMM on GPUs Euro-Par 2014 6

7 Programming Code complexity At each step of Schur complement update: Gemm_division_cpu_gpu() Decide how many CUDA streams to use: For each CUDA stream: cudaMemcpyAsync(…, HostToDevice) cublasDgemm() cudaMemcpyAsync(…, DeviceToHost) CPU performs Scatter to destination Programming, productivity, portability Can a single programming model capture all the Abstract machine models? 7

8 Performance evaluation :Matrices NameNNnzNnz/nSymFill-inApplication audikw_1*9436957765184782.28yes31.43structural bone010*9867034785178348.49yes43.52model reduction nd24k*7200028715634398.82yes22.492D/3D RM07R*3816893746496298.15no78fluid dynamics dds.quad**3806981584436441.61no20.18Accelerator (Omega3P) matrix211**801378129413052161.48no9.68Nuclear Fusion (M3D-C1) tdr190k**110024 2 4331829239.37no20.43Accelerator (Omega3P) Ga19As19H42*133123888483966.74yes182.16quantum chemistry TSOPF_RS_b2383_c1 *3812016171169424.21no3.44power network dielFilterV2real *115745 6 4853895241.93yes22.39electromagnetics Euro-Par 2014 8

9 Comparison of different hybrid schemes Baseline SuperLU_DIST 3.3 mkl 1  Default settings  Metis on A+A T  Maximum super node size=144 Implicit parallelism  Multithread BLAS mkl p  CUDA BLAS ( cuBLAS+scatter) Explicit parallelism  omp+mkl 1  omp + mkl 1 + cuBLAS  omp + mkl 1 + cuBLAS + pipeline (SuperLU_DIST_4.0) Euro-Par 2014 9

10 Performance on Dirac Cluster at NERSC 2xNodes x 2x4 “Nehalem” @2.4GHz+1 Tesla C2040 icc+mkl (11.1) + CUDA 5.5 Euro-Par 2014 10

11 Strong scaling on Dirac cluster Euro-Par 2014 11

12 Memory footprint MPI-only versus Hybrid Euro-Par 2014 12

13 Conclusions BLAS only GPU acceleration can give up to 2-3x speed up on "denser" matrices Slow down may occur for sparser matrices BLAS acceleration leaves Scatter as bottleneck CPU-threaded BLAS (implicit parallelism) may not be sufficient : Utilizing all resources is important Hybrid always reduces memory footprint, up to 5x Euro-Par 2014 13

14 Ongoing and future work Optimizing Scatter phase on CPU and accelerators 1. Utilizing high bandwidth of GPU Accelerating Scatter phase of the computation using a hybrid data structure 1. New algorithm tried on many-core Xeon-Phi; 2. Same algorithm may work for GPUs Using accelerators to aggressively overlap computation with MPI communication Euro-Par 2014 14


Download ppt "SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par."

Similar presentations


Ads by Google