Download presentation
Presentation is loading. Please wait.
Published byJanet Simeon Modified over 9 years ago
1
Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS, July 31, 2014
2
Why use HPC with R? Accelerating mKrig & Krig Parallel Cholesky ◦ Software Packages Parallel Eigen Decomposition Conclusions & Future Works
3
Accelerate the ‘Fields’ Krig and mKrig functions Survey of parallel linear algebra software ◦ Multicore (Shared Memory) ◦ GPU ◦ Xeon Phi
4
Many developers & users in the field of Statistics ◦ Readily available code base Problem: R is slow for large size problems
5
Bottleneck in Linear Algebra operations ◦ mKrig – Cholesky Decomposition ◦ Krig – Eigen Decomposition R uses sequential algorithms Strategy: Use C interoperable libraries to parallelize linear algebra ◦ C functions callable through R environment
6
Symmetric positive definite ->Triangular ◦ A = LL^T ◦ Nice properties for determinant calculation
7
PLASMA (Multicore Shared Memory) ◦ http://icl.cs.utk.edu/plasma/ http://icl.cs.utk.edu/plasma/ MAGMA (GPU & Xeon Phi) ◦ http://icl.cs.utk.edu/magma/ http://icl.cs.utk.edu/magma/ CULA (GPU) ◦ http://www.culatools.com/ http://www.culatools.com/
8
Multicore (Shared Memory) Block Scheduling ◦ Determines what operations should be done on which core Block Size optimization ◦ Dependent on Cache Memory
9
0 5 Speedup v s. 1 Core 10 15 Plasma using 1 Node (# of Observations = 25000) 8 # of Cores 1241216 Speedup Optimal Speedup
10
6 7 50040 Mb1000 Block Size 1500 3 4 Time(sec) 5 PLASMA on Dual Socket Sandy Bridge (# of Observations=15000, Core=16) 256 Kb
11
01000020000 # of Observations 3000040000 100 200 300 400 500 600 PLASMA Optimal Block Sizes (Cores=16) Optimal Block si z e
12
Utilizes GPUs or Xeon Phi for parallelization ◦ Multiple GPU & Multiple Xeon Phi implementations available ◦ 1 CPU drives one 1GPU Block Scheduling ◦ Similar to PLASMA Block Size dependent on Accelerator Architecture
13
CUDA Proprietary linear algebra package Capable of doing Lapack operations using 1 GPU API written in C Dense & Spare operations available
14
1 Node of Caldera or Pronghorn ◦ 2 x 8 core Intel Xeon E5-2670 (Sandy Bridge) processors per Node 64 GB RAM (~59 GB available) Cache Per Core: L1=32Kb, L2 =256Kb Cache Per Socket: L3=20Mb ◦ 2 x Nvidia Tesla M270Q GPU (Caldera) ~5.2 GB RAM per device 1 core drives 1 GPU ◦ 2 x Xeon Phi 5110P (Pronghorn) ~7.4 GB RAM per device
15
Serial R: ~3 GFLOP/sec Theoretical Peak Performance 16 core Xeon SandyBridge: ~333 GFLOP/sec 1 Nvidia Tesla M2070Q: ~512 GFLOP/sec 1 Xeon Phi 5110P: ~1,011 GFLOP/sec 01000020000 # of Observations 3000040000 0 100 GFLOP/sec 200 300 400 Accelerated Hardware has Room for Improvement Plasma (16 cores) Magma 1 GPU Magma 2 GPUs Magma 1 MIC Magma 2 MICs CULA
16
All Parallel Cholesky Implementations are Faster than Serial R 20000 # of Observations Time(sec) 0100003000040000 0.01 0.1 1 10 100 1000 Serial R Plasma (16 Cores) CULA Magma 1 GPU Magma 2 GPUs Magma 1 Xeon Phi Magma 2 Xeon Phis >100 Times Speedup over serial R when # of Observations = 10k
17
~6 Times Speedup over serial R when # of Observations = 10k 0200040006000800010000 0 50 100 Time(sec) 150 200 250 300 Eigendecomposition also Faster on Accelerated Hardware # of Observations Serial R CULA Magma 1 GPU Magma 2 GPUs
18
Both times taken using MAGMA w/ 2 GPUs 0200040006000800010000 0 5 10 15 20 25 30 Can Run ~30 Cholesky Decompositions per Eigen Decomposition # of Observations Time Eigendecomposition / Time Cholesky
19
If we want to do 16Cholesky decompositions in parallel, we are guaranteed better performance when speedup >16 05000 0 5 10 15 20 25 10000 # of Observations 1500020000 Parallel Cholesky Beats Parallel R for Moderate to Large Matrices Speedup v s. P a r alle l R Plasma Magma 2 GPUs
20
Using Caldera ◦ Single Cholesky Decomposition ◦ Matrix Size < 20k use PLASMA (16 cores w/ optimal block size) ◦ Matrix Size 20k – 35k use MAGMA w/ 2 GPUs ◦ Matrix Size > 35k use PLASMA (16 cores w/ optimal block size) Dependent on computing resources available
21
Explored Implementation on accelerated hardware ◦ GPUs ◦ Multicore (Shared Memory) ◦ Xeon Phis Installed third party linear algebra packages & programmed wrappers that call these packages from R ◦ Installation instructions and programs available through bitbucket repo for access contact Srinath Vadlamani Future Work ◦ Multicore Distributed Memory ◦ Single Precision
22
Douglas Nychka, Reinhard Furrer, and Stephan Sain. fields: Tools for spatial data, 2014b. URL: http://CRAN.R-project.org/package=fields. R package version 7.1.http://CRAN.R-project.org/package=fields. Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, volume 180, page 012037. IOP Publishing, 2009. Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. Proc. Of VECPAR’10, Berkeley, CA, June22-25, 2010. Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. PPAM 2013, Warsaw, Poland, September, 2013.
23
xPOTRF xTRSMxTRSM xTRSMxTRSMxTRSM xTRSMxTRSMxTRSMxTRSM xTRSMxTRSM xSYRK xGEMM xPOTRFxTRSMxSYRKxGEMM 0123001230 12301230 123123 0101 2 FINAL http://www.netlib.org/lapack/lawnspdf/lawn223.pdf http://www.netlib.org/lapack/lawnspdf/lawn223.pdf
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.