Scott Michael Indiana University July 6, 2017 Performance Benchmarking of the R Programming Environment on Knight's Landing Scott Michael Indiana University July 6, 2017 Intro Slide
Who am I? Theoretical Astrophysicist NOT a statistician HPC application optimization and performance tuning Lead the Research Analytics team in Research Technologies at Indiana University
Contributors IU Eric Wernert Jefferson Davis James McCombs Esen Tuna TACC Bill Barth Tommy Minyard David Walling
Talk Overview Targeting productivity languages for Xeon Phi based architecture: Motivation and History Benchmark results and lessons learned The RHPCBenchmark package Future directions Conclusions
IU, The Stampede Supercomputer, and Xeon Phi IU Research Technologies has a partnership with TACC collaborating on systems and support Stampede – largest XSEDE machine by core count Wrangler – data intensive computing and 20 PB out of region replication Jetstream – XSEDE production science cloud IU supports data intensive and “high productivity” languages on Stampede Including R, python, and Matlab Large transition between Stampede 1 & 2
Evolution of Xeon Phi Knight’s Corner Knight’s Landing Coprocessor only Coprocessor or Self-hosted 1 TF peak (DP) 3 TF peak (DP) 8GB device + system memory 16GB MCDRAM + system memory
R Support on Stampede 1 & 2 Primary support on Stampede 1 for R Support several methods for distributed R (pbdR, Rmpi, snow, etc.) R built in offload mode Configured R to use GPUs in portion of Stampede via HiPLAR However, much of the R workload on Stampede didn’t rely on KNC Stampede 1 Nodes 6,400 Interconnect FDR IB Filesystem 14 PB Lustre Node Configuration Processor Dual E5-2680 “SandyBridge” Phi SE10P Memory 32GB DDR3 8GB GDDR5 Stampede 2 Nodes 4,200 Interconnect OmniPath v1 Node Configuration Processor Phi 7250 Memory 16GB GDDR4
R Performance on KNL KNL the sole processor on Stampede 2 Has shown good performance for large scale HPC codes (MD, climate, astro, etc.) How does KNL perform with a language like R?
KNL Architecture Intel(R) Xeon Phi(TM) CPU 7250 @ 1.60GHz (68 physical cores) Features of note for KNL Tiled architecture supporting 4 SMT threads per physical core
KNL Architecture (cont.) Features of note for KNL 16GB on chip MCDRAM to act as fast memory can be configured into several modes
Benchmarking Strategy Look at industry standard performance benchmarks for R on KNL and compare to SNB Further explore some exemplar workflows in each language and compare to benchmark results Compare both single node and multinode benchmarks
Benchmarking Strategy R standard benchmark: R-25 benchmark Very old, fixed (small) problem sizes, report output challenging to parse Reasonable mix of mini-kernels focused on dense matrix operations and linear solvers R benchmark for scalability focused on similar kernels to R-25 Built to distribute and for flexibility, currently available on CRAN at RHPCBenchmark https://github.com/IUResearchAnalytics/RBenchmarking
Talk Overview Targeting productivity languages for Xeon Phi based architecture: Motivation and History Benchmark results and lessons learned The RHPCBenchmark package Future directions Conclusions
R Benchmark Results Generally R lacks multithreading (some exceptions include mclapply) so we rely on the threading in MKL Standard profiling/tracing tools are challenging to employ Instrumenting entire R interpreter creates too much overhead
R Benchmark Results Benchmarks include Cholesky decomp, eigendecomp, LS fit, linear solve, QR decomp, matrix cross, matrix det, matrix-matrix, matrix-vector Multiple threads per core aren’t useful Contrast to KNC
R Benchmark Results For some benchmarks single core KNL outperforms SNB
R Benchmark Results Need large matrices to make full use of all 68 cores
R Benchmark Results For math intensive kernels R interpreter overhead isn’t bad
Talk Overview Targeting productivity languages for Xeon Phi based architecture: Motivation and History Benchmark results and lessons learned The RHPCBenchmark package Future directions Conclusions
RHPCBenchmark Package The RHPCBenchmark initial release is available on CRAN Provides a variety of dense matrix, sparse matrix, and machine learning benchmarks Users can configure the set of benchmarks to run and benchmark parameters Results are provided in .csv files and a data frame for further analysis
Talk Overview Targeting productivity languages for Xeon Phi based architecture: Motivation and History Benchmark results and lessons learned The RHPCBenchmark package Future directions Conclusions
Next Steps for R Performance Internode performance Higher level functions Many R packages don’t rely on the building blocks tested (e.g. nnet, cluster) Other classes of functions Sparse matrix operations Data wrangling operations
Talk Overview Targeting productivity languages for Xeon Phi based architecture: Motivation and History Benchmark results and lessons learned The RHPCBenchmark package Future directions Conclusions
Conclusions R performance on KNL better for dense matrix operations (3x SNB) and close to native C performance Performance is best for large matrices SNB does perform better for small matrices New RHPCBenchmark offers flexibility in benchmarking your hardware and R build
Questions? Suggestions? Scott Michael scamicha@iu.edu James McCombs jmccombs@iu.edu
Backups: KNL Speedup in R
Backups: KNL vs. IvyBridge
Backups: KNL Flat vs. Cached