Scott Michael Indiana University July 6, 2017

Slides:

Advertisements

Similar presentations

ACCELERATING SPARSE CANONICAL CORRELATION ANALYSIS FOR LARGE BRAIN IMAGING GENETICS DATA Jingwen Yan, Hui Zhang, Lei Du, Eric Wernert, Andew J. Saykin,

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

JLab Status & 2016 Planning April 2015 All Hands Meeting Chip Watson Jefferson Lab Outline Operations Status FY15 File System Upgrade 2016 Planning for.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.

Enabling Science Through Campus Bridging A case study with mlRho Scott Michael July 24, 2013.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Revision - 01 Intel Confidential Page 1 Intel HPC Update Norfolk, VA April 2008.

Full and Para Virtualization

| nectar.org.au NECTAR TRAINING Module 4 From PC To Cloud or HPC.

Scaling up R computation with high performance computing resources.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

Slide 1 User-Centric Workload Analytics: Towards Better Cluster Management Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal,

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

Slide 1 Cluster Workload Analytics Revisited Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal, Stephen Harrell (Purdue),

2014 Heterogeneous many cores for medical control: Performance, Scalability, and Accuracy Madhurima Pore, Arizona State University October 10,2014 #GHC14.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Manycore processors Sima Dezső October Version 6.2.

Deep Learning with Intel DAAL on Knights Landing Processor

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Early Results of Deep Learning on the Stampede2 Supercomputer

Ioannis E. Venetis Department of Computer Engineering and Informatics

Computing models, facilities, distributed computing

CS427 Multicore Architecture and Parallel Computing

Cluster Optimisation using Cgroups

Tom LeCompte High Energy Physics Division Argonne National Laboratory

Geant4 MT Performance Soon Yung Jun (Fermilab)

OCR on Knights Landing (Xeon-Phi)

High-performance tracing of many-core systems with LTTng

Unconventional applications of Intel® Xeon Phi™ Processor (KNL)

Is System X for Me? Cal Ribbens Computer Science Department

Interactive Website (

IXPUG Abstract Submission Instructions

CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph

Linchuan Chen, Peng Jiang and Gagan Agrawal

Early Results of Deep Learning on the Stampede2 Supercomputer

Template for IXPUG EMEA Ostrava, 2016

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Department of Intelligent Systems Engineering

Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Multicore and GPU Programming

Panel on Research Challenges in Big Data

The Gamma Operator for Big Data Summarization on an Array DBMS

Multicore and GPU Programming

Presentation transcript:

Scott Michael Indiana University July 6, 2017 Performance Benchmarking of the R Programming Environment on Knight's Landing Scott Michael Indiana University July 6, 2017 Intro Slide

Who am I? Theoretical Astrophysicist NOT a statistician HPC application optimization and performance tuning Lead the Research Analytics team in Research Technologies at Indiana University

Contributors IU Eric Wernert Jefferson Davis James McCombs Esen Tuna TACC Bill Barth Tommy Minyard David Walling

Talk Overview Targeting productivity languages for Xeon Phi based architecture: Motivation and History Benchmark results and lessons learned The RHPCBenchmark package Future directions Conclusions

IU, The Stampede Supercomputer, and Xeon Phi IU Research Technologies has a partnership with TACC collaborating on systems and support Stampede – largest XSEDE machine by core count Wrangler – data intensive computing and 20 PB out of region replication Jetstream – XSEDE production science cloud IU supports data intensive and “high productivity” languages on Stampede Including R, python, and Matlab Large transition between Stampede 1 & 2

Evolution of Xeon Phi Knight’s Corner Knight’s Landing Coprocessor only Coprocessor or Self-hosted 1 TF peak (DP) 3 TF peak (DP) 8GB device + system memory 16GB MCDRAM + system memory

R Support on Stampede 1 & 2 Primary support on Stampede 1 for R Support several methods for distributed R (pbdR, Rmpi, snow, etc.) R built in offload mode Configured R to use GPUs in portion of Stampede via HiPLAR However, much of the R workload on Stampede didn’t rely on KNC Stampede 1 Nodes 6,400 Interconnect FDR IB Filesystem 14 PB Lustre Node Configuration Processor Dual E5-2680 “SandyBridge” Phi SE10P Memory 32GB DDR3 8GB GDDR5 Stampede 2 Nodes 4,200 Interconnect OmniPath v1 Node Configuration Processor Phi 7250 Memory 16GB GDDR4

R Performance on KNL KNL the sole processor on Stampede 2 Has shown good performance for large scale HPC codes (MD, climate, astro, etc.) How does KNL perform with a language like R?

KNL Architecture Intel(R) Xeon Phi(TM) CPU 7250 @ 1.60GHz (68 physical cores) Features of note for KNL Tiled architecture supporting 4 SMT threads per physical core

KNL Architecture (cont.) Features of note for KNL 16GB on chip MCDRAM to act as fast memory can be configured into several modes

Benchmarking Strategy Look at industry standard performance benchmarks for R on KNL and compare to SNB Further explore some exemplar workflows in each language and compare to benchmark results Compare both single node and multinode benchmarks

Benchmarking Strategy R standard benchmark: R-25 benchmark Very old, fixed (small) problem sizes, report output challenging to parse Reasonable mix of mini-kernels focused on dense matrix operations and linear solvers R benchmark for scalability focused on similar kernels to R-25 Built to distribute and for flexibility, currently available on CRAN at RHPCBenchmark https://github.com/IUResearchAnalytics/RBenchmarking

Talk Overview Targeting productivity languages for Xeon Phi based architecture: Motivation and History Benchmark results and lessons learned The RHPCBenchmark package Future directions Conclusions

R Benchmark Results Generally R lacks multithreading (some exceptions include mclapply) so we rely on the threading in MKL Standard profiling/tracing tools are challenging to employ Instrumenting entire R interpreter creates too much overhead

R Benchmark Results Benchmarks include Cholesky decomp, eigendecomp, LS fit, linear solve, QR decomp, matrix cross, matrix det, matrix-matrix, matrix-vector Multiple threads per core aren’t useful Contrast to KNC

R Benchmark Results For some benchmarks single core KNL outperforms SNB

R Benchmark Results Need large matrices to make full use of all 68 cores

R Benchmark Results For math intensive kernels R interpreter overhead isn’t bad

Talk Overview Targeting productivity languages for Xeon Phi based architecture: Motivation and History Benchmark results and lessons learned The RHPCBenchmark package Future directions Conclusions

RHPCBenchmark Package The RHPCBenchmark initial release is available on CRAN Provides a variety of dense matrix, sparse matrix, and machine learning benchmarks Users can configure the set of benchmarks to run and benchmark parameters Results are provided in .csv files and a data frame for further analysis

Talk Overview Targeting productivity languages for Xeon Phi based architecture: Motivation and History Benchmark results and lessons learned The RHPCBenchmark package Future directions Conclusions

Next Steps for R Performance Internode performance Higher level functions Many R packages don’t rely on the building blocks tested (e.g. nnet, cluster) Other classes of functions Sparse matrix operations Data wrangling operations

Talk Overview Targeting productivity languages for Xeon Phi based architecture: Motivation and History Benchmark results and lessons learned The RHPCBenchmark package Future directions Conclusions

Conclusions R performance on KNL better for dense matrix operations (3x SNB) and close to native C performance Performance is best for large matrices SNB does perform better for small matrices New RHPCBenchmark offers flexibility in benchmarking your hardware and R build

Questions? Suggestions? Scott Michael scamicha@iu.edu James McCombs jmccombs@iu.edu

Backups: KNL Speedup in R

Backups: KNL vs. IvyBridge

Backups: KNL Flat vs. Cached