Scaling up R computation with high performance computing resources.

Slides:



Advertisements
Similar presentations
Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke
Advertisements

ACCELERATING SPARSE CANONICAL CORRELATION ANALYSIS FOR LARGE BRAIN IMAGING GENETICS DATA Jingwen Yan, Hui Zhang, Lei Du, Eric Wernert, Andew J. Saykin,
Spark: Cluster Computing with Working Sets
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
HADOOP ADMIN: Session -2
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Parallel Computing with Matlab CBI Lab Parallel Computing Toolbox TM An Introduction Oct. 27, 2011 By: CBI Development Team.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
Hadoop implementation of MapReduce computational model Ján Vaňo.
All about Revolution R Enterprise
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Matthew Winter and Ned Shawa
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
BIG DATA/ Hadoop Interview Questions.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Parallel Programming Models
Distributed SAR Image Change Detection with OpenCL-Enabled Spark
Big Data is a Big Deal!.
MapReduce Compiler RHadoop
Early Results of Deep Learning on the Stampede2 Supercomputer
Hadoop Aakash Kag What Why How 1.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Introduction to Distributed Platforms
Big Data A Quick Review on Analytical Tools
An Open Source Project Commonly Used for Processing Big Data Sets
Scott Michael Indiana University July 6, 2017
Spark Presentation.
Running R in parallel — principles and practice
Introduction to R Programming with AzureML
Data Platform and Analytics Foundational Training
Ministry of Higher Education
Introduction to Spark.
Distributed Systems CS
Early Results of Deep Learning on the Stampede2 Supercomputer
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
Overview of big data tools
Spark and Scala.
Distributed Systems CS
Big Data, Bigger Data & Big R Data
Big-Data Analytics with Azure HDInsight
Presentation transcript:

Scaling up R computation with high performance computing resources

+ =

What to do if the computation is too big for a single desktop A common user question: –I have an existing R solution for my research work. But the data is growing to big. Now my R program runs days to finish (/runs out of memory) Three strategies –Using automatically offloading with multicore/GPU/MIC. –Break big computation with multiple job submission –Implement code using parallel packages.

Hardware Acceleration with Computation Offloading Hardware supported: –Multiple cores on CPU –Intel Xeon Phi coprocessor (on Stampede) –GPGPU (on Stampede/Maverick) Libraries supporting automatic offloading –Intel Math Kernel Library (MKL) Available on stampede and maverick for users –HiPlarB Open source and freely available

MIC and MKL MKL provides BLAS/LAPACK routines that can “offload” to the Xeon Phi Coprocessor, reducing total time to solution User R script/function R interpreter Code execution with pre-built library MKL BLAS

MKL-MIC: Submission script to use MIC #!/bin/bash #SBATCH -J benchmark25-R #SBATCH -o slurm.out%j #SBATCH -p vis #SBATCH -t 01:30:00 #SBATCH -A TACC-DIC #SBATCH -e slurm.err #SBATCH -N 1 #SBATCH -n 10 #set environment module purge module load TACC module load cuda module load intel/ module load Rstats # enable mkl mic offloading export MKL_MIC_ENABLE=1 # from 0 to 1 the work division export MKL_HOST_WORKDIVISION=0.3 export MKL_MIC_WORKDIVISION=0.7 # make the offload report big to be visible export OFFLOAD_REPORT=2 # set the number of threads on host export OMP_NUM_THREADS=16 export MKL_NUM_THREADS=16 # set the number of threads on MIC export MIC_OMP_NUM_THREADS=240 export MIC_MKL_NUM_THREADS=240 # Run a script Rscript r-benchmark-25-MIC.R

HiPLAR HiPLAR (High Performance Linear Algebra in R) use the latest multi-core and GPU libraries to give substantial speed-ups to existing linear algebra functions in R. Two projects that aims to achieve high performance and portability across a wide range of multi-core architectures and hybrid systems. Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) Matrix Algebra on GPU and Multi-cores Architectures (MAGMA ) Need additional package installation

HiPLAR: example using regular matrix multiplication A <- matrix(rnorm(2048 * 2048), nrow=2048, ncol=2048) B <- matrix(rnorm(2048 * 2048), nrow=2048, ncol=2048) system.time(C <- A %*% B) user system elapsed library(HiPLARb) hiplarb_mode_magma()# to use GPU mode user system elapsed

R benchmark 2.5

Advantage: –No code changes needed –User can run R solution as before without knowledge of the parallel execution. Limitations: –Only support limited computational operations.

Break Big Computations with multiple R jobs Running R in non-interactive session User can submit multiple R jobs with different command Line parameters –Similar to run R batch mode –Parameters is specified on the command line –Good for repeated runs of same computations or running script partially Manually or via other tools

Hadoop and Spark Hadoop: –an open source project designed to support large scale data processing across clusters of computers – inspired by Google’s MapReduce-based computational infrastructure Spark –a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN.

MapReduce - Word Count from Revolution Analytics’ Getting Started with RHadoop course

Rhadoop and SparkR Rhadoop –an open source project sponsored by Revolution Analytics –package –rmr2 - all MapReduce-related functions –rhdfs - interaction with Hadoop’s HDFS file system – rhbase - access to the NoSQL HBase database SparkR –an R package that provides a light-weight frontend to use Apache Spark from R –exposes the Spark API through the RDD (Resilient Distributed Datasets) class

Text Analysis of HathiTrust Corpus (‘tm’ package, ~1M books) Guangchen Ruan, et al.

Advantages –Utilize efficiency of other data intensive processing framework –Each job can use existing R code Limitations –A “data-parallel” solution that may not suitable for simulation based analysis Supported on Wrangler and Rustler.

Running R with parallel packages

Parallel package Multicore –Utilizes multiple processing core within the same node. –Replace several common functions with parallel implementations apply  mcapply –lapply(1:30, rnorm) -> mclapply(1:30, rnorm) –Scalability is limited by the number of core and memory available within single node SNOW –Developed Based on Rmpi package, –Simplify the process to initialize parallel process over cluster. –Need to create “cluster“, e.g. cl <- makeCluster(4, type='SOCK’) parSapply(cl, 1:20, get("+"), 3)

Advantage –Do whatever you want with them –Get the best performance Limitations –Need code development –In some case, the analysis workflow may need be changed.