Writing Better R Code WIM Oct 30, 2015 Hui Zhang Research Analytics

Slides:



Advertisements
Similar presentations
Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1.
Advertisements

CIS 101: Computer Programming and Problem Solving Lecture 8 Usman Roshan Department of Computer Science NJIT.
Programming Introduction November 9 Unit 7. What is Programming? Besides being a huge industry? Programming is the process used to write computer programs.
CS 104 Introduction to Computer Science and Graphics Problems Software and Programming Language (2) Programming Languages 09/26/2008 Yang Song (Prepared.
Core 1b – Engineering Dynamic Coding a.k.a. Python in Slicer
Introduction to MATLAB Session 1 Prepared By: Dina El Kholy Ahmed Dalal Statistics Course – Biomedical Department -year 3.
Standard Grade Computing System Software & Operating Systems.
Programming Languages: Scratch Intro to Scratch. Lower level versus high level Clearly, lower level languages can be tedious Higher level languages quickly.
Guide to Programming with Python Chapter One Getting Started: The Game Over Program.
GPU Architecture and Programming
SAXS Scatter Performance Analysis CHRIS WILCOX 2/6/2008.
Siena Computational Crystallography School 2005
1 Writing Better R Code Hui Zhang Research Analytics.
CIS 601 Fall 2003 Introduction to MATLAB Longin Jan Latecki Based on the lectures of Rolf Lakaemper and David Young.
11 Computers, C#, XNA, and You Session 1.1. Session Overview  Find out what computers are all about ...and what makes a great programmer  Discover.
Improving Matlab Performance CS1114
CIS 595 MATLAB First Impressions. MATLAB This introduction will give Some basic ideas Main advantages and drawbacks compared to other languages.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
PYTHON FOR HIGH PERFORMANCE COMPUTING. OUTLINE  Compiling for performance  Native ways for performance  Generator  Examples.
Scaling up R computation with high performance computing resources.
Programming 2 Intro to Java Machine code Assembly languages Fortran Basic Pascal Scheme CC++ Java LISP Smalltalk Smalltalk-80.
Python – It's great By J J M Kilner. Introduction to Python.
Topic 2: Hardware and Software
Top 8 Best Programming Languages To Learn
Licenses and Interpreted Languages for DHTC Thursday morning, 10:45 am
Information Systems Development
CSC391/691 Intro to OpenCV Dr. Rongzhong Li Fall 2016
CS427 Multicore Architecture and Parallel Computing
Learning to Program D is for Digital.
Topics Introduction Hardware and Software How Computers Store Data
CMSC201 Computer Science I for Majors Lecture 22 – Binary (and More)
Topics Introduction to Repetition Structures
Spark Presentation.
PROJECT LIFE CYCLE AND EFFORT ESTIMATION
Existing Perl/Oracle Pipeline
CS101 Introduction to Computing Lecture 19 Programming Languages
A451 Theory – 7 Programming 7A, B - Algorithms.
Exploiting Parallelism
Lecture 1 Runtime environments.
Application Development Theory
Processing Framework Sytse van Geldermalsen
Vector Processing => Multimedia
Introduction to MATLAB
Mobile Development Workshop
Matlab as a Development Environment for FPGA Design
Compiler Back End Panel
Linux: A Product of the Internet
Compiler Back End Panel
Topics Introduction Hardware and Software How Computers Store Data
What's New in eCognition 9
Compiler Code Optimizations
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Variables Title slide variables.
Java for IOI.
Introduction to Systems Analysis and Design Stefano Moshi Memorial University College System Analysis & Design BIT
By Brandon, Ben, and Lee Parallel Computing.
CSE 373 Data Structures and Algorithms
(Computer fundamental Lab)
Tonga Institute of Higher Education IT 141: Information Systems
Simulation And Modeling
Lecture 1 Runtime environments.
Tonga Institute of Higher Education IT 141: Information Systems
General Computer Science for Engineers CISC 106 Lecture 03
What is Programming Language
Lecture 20 Parallel Programming CSE /27/2019.
1.3.7 High- and low-level languages and their translators
What's New in eCognition 9
Lecture 20 Parallel Programming CSE /8/2019.
What's New in eCognition 9
Web Application Development Using PHP
Presentation transcript:

Writing Better R Code WIM Oct 30, 2015 Hui Zhang Research Analytics

“Introduction to R” by Jefferson, Oct 23 We all love R Interactive data analysis Data mining/Machine Learning Plotting/Interactive 3D graphics Data-intensive Computing

R is SLOW For the same reason any other interpret languages are

R is SLOW For the same reason any other interpret languages are slow R is optimized to make programmer efficient (instead of making machine efficient) Every single operation carries a lot of extra baggage

Loops

Write Better R Codes Approaches for improving the performance of R codes Some previous knowledge of R is recommended Some familiarity with C/C++ is also recommended. Topics Loops Ply Functions Vectorization Loop, Plys, and Vectorization Interfacing to Compiled Code

Loops Writing Better R Code Loops Ply Functions Vectorization Loop, Plys, and Vectorization Interfacing to Compiled Code

Loops Writing Better R Code Loops for while No goto’s or do while’s They are really slow Why? for the same reason any interpreted language is slow every single operation carries a lot of extra baggage particularly slow if objects grow inside Loops

Loops Writing Better R Code Loops Best Practices Mostly try to avoid Evaluate practicality of rewrite (plys, vectorization, compiled code) Always preallocate (avoid growing objects in loops): Vectors: numeric(n), integer(n), character(n) Lists: vector(mode=“list”, length=n) Dataframes: data.frame(col1=numeric(n), …) If you can’t, try something other than an array/list.

Loops

Loops Rbenchmark inspired by the Perl module Benchmark facilitate benchmarking of arbitrary R code benchmark(..., replications=100, …, relative = “elapsed”)

Ply Fucntions Writing Better R Code Loops Ply Functions Vectorization Loop, Plys, and Vectorization Interfacing to Compiled Code

Ply Functions Writing Better R Code Loops Ply Functions R has functions that apply other functions to data In a nutshell: loop sugar Typical *ply’s apply(): apply function over matrix “margin(s)” lapply(): apply function over list/vector mapply(): apply function over multiple lists/vectors sapply(): same as lapply(), but (possibly) nicer output Plus some other mostly irrelevant ones

Ply Functions

Ply Functions

Ply Functions Writing Better R Code Loops Ply Functions Transforming Loops into Ply’s

Ply Functions Writing Better R Code Loops Ply Functions Most Ply’s are just shorthand/higher expression of loops Generally not much faster (if at all), especially with the compiler Thinking in terms of lapply() can be useful however …

Vectorization Writing Better R Code Loops Ply Functions Vectorization Loop, Plys, and Vectorization Interfacing to Compiled Code

Vectorization Writing Better R Code Loops Ply Functions Vectorization In R everything is a vector. To quote Tim Smith in aRrgh: a newcomer’s (angry) guide to R “All naked numbers are double-width floating-point atomic vectors of length one. You’re welcome.” X + Y X[, 1] <- 0 Rnorm(1000)

Vectorization Writing Better R Code Loops Ply Functions Vectorization also true in other high-level languages (Matlab, Python, …) Idea: write vectorized code use pre-existing compiled kernels to avoid interpreter overhead Much faster than loops and plys X[, 1] <- 0 Rnorm(1000)

Vectorization Writing Better R Code Loops Ply Functions Vectorization

Vectorization Writing Better R Code Loops Ply Functions Vectorization Best Practices Vectorize if at all possible Note that this consumes potentially a lot of memory

Ply Fucntions Writing Better R Code Loops Ply Functions Vectorization Loop, Plys, and Vectorization Interfacing to Compiled Code

Putting It All Together Writing Better R Code Loops Ply Functions Vectorization Loop, Plys, and Vectorization Loops are slow apply() are just for loops Ply functions are not vectorized Vectorization is fastest, but often needs a lot of memory

Putting It All Together Writing Better R Code Loops Ply Functions Vectorization Loop, Plys, and Vectorization Example: let us compute the square of the number 1-100000, using for loop without preallocation for loop with preallocation sapply() vectorization

Putting It All Together

Putting It All Together

Rcpp Writing Better R Code Loops Ply Functions Vectorization Loop, Plys, and Vectorization Interfacing to Compiled Code

Rcpp Writing Better R Code Loops Ply Functions Vectorization Loop, Plys, and Vectorization Interfacing to Compiled Code R is mostly a C program R extensions are mostly R programs Rcpp is a API for you to access/extend/modify R object at C++ level

Rcpp Writing Better R Code Loops Ply Functions Vectorization Loop, Plys, and Vectorization Interfacing to Compiled Code Rcpp is: R interface to compiled code Package ecosystem Utilities to make writing C++ more convenient for R users A tool which requires C++ knowledge to effectively utilize GPL licensed (like R)

Rcpp Writing Better R Code Loops Ply Functions Vectorization Loop, Plys, and Vectorization Interfacing to Compiled Code Rcpp is not Magic Automatic R-to-C++ converter A way around having to learn C++ A tool to make existing R functionality faster (unless you rewrite it) As easy to use as R

Rcpp Writing Better R Code Loops Ply Functions Vectorization Loop, Plys, and Vectorization Interfacing to Compiled Code Rcpp’s advantage Compiled code is fast Easy to install Easy to use (comparatively) Better documented than alternatives Large, friendly, helpful community

Rcpp

Rcpp

Rcpp

Rcpp

𝜋≈4 #𝐼𝑛𝑠𝑖𝑑𝑒 𝐶𝑖𝑟𝑐𝑙𝑒 #𝑇𝑜𝑡𝑎𝑙 =4 #𝐵𝑙𝑢𝑒 #𝐵𝑙𝑢𝑒+#𝑅𝑒𝑑 Rcpp Example: Monte Carlo Simulation to Estimate 𝜋 Sample N uniform observation (xi, yi) in the unit square [0,1] X [0,1]. Then 𝜋≈4 #𝐼𝑛𝑠𝑖𝑑𝑒 𝐶𝑖𝑟𝑐𝑙𝑒 #𝑇𝑜𝑡𝑎𝑙 =4 #𝐵𝑙𝑢𝑒 #𝐵𝑙𝑢𝑒+#𝑅𝑒𝑑

Rcpp Example: Monte Carlo Simulation to Estimate 𝜋

Rcpp Example: Monte Carlo Simulation to Estimate 𝜋

Rcpp Example: Monte Carlo Simulation to Estimate 𝜋

Rcpp Example: Monte Carlo Simulation to Estimate 𝜋

More Tricks??? So far we only use one CPU core for R codes It is possible to parallelize the computation in LOOPs/PLYs

More Tricks??? So far we only use one CPU core for R codes It is possible to parallelize the computation in LOOPs/PLYs How many cores you have?

More Tricks??? So far we only use one CPU core for R codes It is possible to parallelize the computation in LOOPs/PLYs How many cores you have? Thinking of your codes in terms of PLYs can be useful

More Tricks??? So far we only use one CPU core for R codes It is possible to parallelize the computation in LOOPs/PLYs How many cores you have? Thinking of your codes in terms of PLYs can be useful library(parallel) let each core do the job independently for you collect the results from each slave core

More Tricks??? So far we only use one CPU core for R codes It is possible to parallelize the computation in LOOPs/PLYs How many cores you have? Thinking of your codes in terms of PLYs can be useful library(parallel) let each core do the job independently for you collect the results from each slave core Note that there is overhead due to data shipping

Summary Bad R often looks like good C/C++ Vectorize your code as you much as you can Interfacing with compiled code helps Parallelization can take your code to extreme More reading: “The R Inferno.” Patrick Burns “Rcpp: seamless R and C++ integration. “ Dirk Eddelbuettel

Reading “The R Inferno.” Patrick Burns “Rcpp: seamless R and C++ integration. “ Dirk Eddelbuettel “R and Data Mining: Examples and Case Studies.” Yanchang Zhao “Data Visualization using R and Javascript.” Tom Barker “Parallel R.” Ethan McCallum