Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS, July 31, 2014
Why use HPC with R? Accelerating mKrig & Krig Parallel Cholesky ◦ Software Packages Parallel Eigen Decomposition Conclusions & Future Works
Accelerate the ‘Fields’ Krig and mKrig functions Survey of parallel linear algebra software ◦ Multicore (Shared Memory) ◦ GPU ◦ Xeon Phi
Many developers & users in the field of Statistics ◦ Readily available code base Problem: R is slow for large size problems
Bottleneck in Linear Algebra operations ◦ mKrig – Cholesky Decomposition ◦ Krig – Eigen Decomposition R uses sequential algorithms Strategy: Use C interoperable libraries to parallelize linear algebra ◦ C functions callable through R environment
Symmetric positive definite ->Triangular ◦ A = LL^T ◦ Nice properties for determinant calculation
PLASMA (Multicore Shared Memory) ◦ MAGMA (GPU & Xeon Phi) ◦ CULA (GPU) ◦
Multicore (Shared Memory) Block Scheduling ◦ Determines what operations should be done on which core Block Size optimization ◦ Dependent on Cache Memory
0 5 Speedup v s. 1 Core Plasma using 1 Node (# of Observations = 25000) 8 # of Cores Speedup Optimal Speedup
Mb1000 Block Size Time(sec) 5 PLASMA on Dual Socket Sandy Bridge (# of Observations=15000, Core=16) 256 Kb
# of Observations PLASMA Optimal Block Sizes (Cores=16) Optimal Block si z e
Utilizes GPUs or Xeon Phi for parallelization ◦ Multiple GPU & Multiple Xeon Phi implementations available ◦ 1 CPU drives one 1GPU Block Scheduling ◦ Similar to PLASMA Block Size dependent on Accelerator Architecture
CUDA Proprietary linear algebra package Capable of doing Lapack operations using 1 GPU API written in C Dense & Spare operations available
1 Node of Caldera or Pronghorn ◦ 2 x 8 core Intel Xeon E (Sandy Bridge) processors per Node 64 GB RAM (~59 GB available) Cache Per Core: L1=32Kb, L2 =256Kb Cache Per Socket: L3=20Mb ◦ 2 x Nvidia Tesla M270Q GPU (Caldera) ~5.2 GB RAM per device 1 core drives 1 GPU ◦ 2 x Xeon Phi 5110P (Pronghorn) ~7.4 GB RAM per device
Serial R: ~3 GFLOP/sec Theoretical Peak Performance 16 core Xeon SandyBridge: ~333 GFLOP/sec 1 Nvidia Tesla M2070Q: ~512 GFLOP/sec 1 Xeon Phi 5110P: ~1,011 GFLOP/sec # of Observations GFLOP/sec Accelerated Hardware has Room for Improvement Plasma (16 cores) Magma 1 GPU Magma 2 GPUs Magma 1 MIC Magma 2 MICs CULA
All Parallel Cholesky Implementations are Faster than Serial R # of Observations Time(sec) Serial R Plasma (16 Cores) CULA Magma 1 GPU Magma 2 GPUs Magma 1 Xeon Phi Magma 2 Xeon Phis >100 Times Speedup over serial R when # of Observations = 10k
~6 Times Speedup over serial R when # of Observations = 10k Time(sec) Eigendecomposition also Faster on Accelerated Hardware # of Observations Serial R CULA Magma 1 GPU Magma 2 GPUs
Both times taken using MAGMA w/ 2 GPUs Can Run ~30 Cholesky Decompositions per Eigen Decomposition # of Observations Time Eigendecomposition / Time Cholesky
If we want to do 16Cholesky decompositions in parallel, we are guaranteed better performance when speedup > # of Observations Parallel Cholesky Beats Parallel R for Moderate to Large Matrices Speedup v s. P a r alle l R Plasma Magma 2 GPUs
Using Caldera ◦ Single Cholesky Decomposition ◦ Matrix Size < 20k use PLASMA (16 cores w/ optimal block size) ◦ Matrix Size 20k – 35k use MAGMA w/ 2 GPUs ◦ Matrix Size > 35k use PLASMA (16 cores w/ optimal block size) Dependent on computing resources available
Explored Implementation on accelerated hardware ◦ GPUs ◦ Multicore (Shared Memory) ◦ Xeon Phis Installed third party linear algebra packages & programmed wrappers that call these packages from R ◦ Installation instructions and programs available through bitbucket repo for access contact Srinath Vadlamani Future Work ◦ Multicore Distributed Memory ◦ Single Precision
Douglas Nychka, Reinhard Furrer, and Stephan Sain. fields: Tools for spatial data, 2014b. URL: R package version Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, volume 180, page IOP Publishing, Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. Proc. Of VECPAR’10, Berkeley, CA, June22-25, Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. PPAM 2013, Warsaw, Poland, September, 2013.
xPOTRF xTRSMxTRSM xTRSMxTRSMxTRSM xTRSMxTRSMxTRSMxTRSM xTRSMxTRSM xSYRK xGEMM xPOTRFxTRSMxSYRKxGEMM FINAL