The Art of R Programming Chapter 15 – Writing Fast R Code Chapter 16 – Interfacing R to Other languages Chapter 17 – Parallel R.

Slides:



Advertisements
Similar presentations
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Advertisements

Distributed Systems CS
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Recursion in Python. Recursion Problems in every area of life can be defined recursively, that is, they can be described in terms of themselves. An English.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Java.  Java is an object-oriented programming language.  Java is important to us because Android programming uses Java.  However, Java is much more.
Advanced Computing. In reality, we typically only get one computing class with our students. Students need more exposure to internalize the ideas and.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
1 Lecture 6 Performance Measurement and Improvement.
Chapter 3.2 C++, Java, and Scripting Languages. 2 C++ C used to be the most popular language for games Today, C++ is the language of choice for game development.
Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.5 Comparing and Summarizing Performance.
Chapter 3.2 C++, Java, and Scripting Languages “The major programming languages used in game development.”
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE.
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
Discussion Section: HW1 and Programming Tips GS540.
Writing Faster Code in R R meetup Arelia T. Werner May 20 th 2015 Tectoria, Victoria, BC.
Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.
Parallel Programming with R ENDER AHMET YURT
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
Programming in R coding, debugging and optimizing Katia Oleinik Scientific Computing and Visualization Boston University
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 7 OS System Structure.
Programming in R coding, debugging and optimizing Katia Oleinik Scientific Computing and Visualization Boston University October 12, 2012.
Compiled by Maria Ramila Jimenez
Digital Electronics and Computer Interfacing Tim Mewes 4. LabVIEW - Advanced.
ITC Research Computing Support Using Matlab Effectively By: Ed Hall Research Computing Support Center Phone: Φ Fax:
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem.
Chapter 4 Message-Passing Programming. The Message-Passing Model.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
 Programming - the process of creating computer programs.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Improving Matlab Performance CS1114
GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Background Computer System Architectures Computer System Software.
Lecture 4 Speeding up your code Trevor A. Branch FISH 553 Advanced R School of Aquatic and Fishery Sciences University of Washington.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Assignment 5 is posted. Exercise 8 is very similar to what you will be doing with assignment 5. Exam.
DGrid: A Library of Large-Scale Distributed Spatial Data Structures Pieter Hooimeijer,
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
High Performance Computing with R
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Programming in R coding, debugging and optimizing Katia Oleinik Scientific Computing and Visualization Boston University
Writing Better R Code WIM Oct 30, 2015 Hui Zhang Research Analytics
Lecture 8 Speeding up your code
Chapter 1: Introduction
Day 12 Threads.
Running R in parallel — principles and practice
NGS computation services: APIs and Parallel Jobs
Mobile Development Workshop
I/O Systems I/O Hardware Application I/O Interface
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Numerical Algorithms Quiz questions
Amdahl's law.
By Brandon, Ben, and Lee Parallel Computing.
MPJ: A Java-based Parallel Computing System
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 5 R programming Instructor: Li, Han.
Presentation transcript:

The Art of R Programming Chapter 15 – Writing Fast R Code Chapter 16 – Interfacing R to Other languages Chapter 17 – Parallel R

Use vectorized functions instead of loops x <- runif( ) y <- runif( ) system.time(z <- x + y) # user system elapsed system.time(for (i in 1:length(x)) z[i] <- x[i] + y[i]) # user system elapsed R functions are much slower than native codes. The followings are all functions: “for” “:” (range operator) “[]” (vector reference)

Another example of vectorization oddCount <- function(x) return (length(which(x % 2 == 1))) x <- sample(1: , , replace=T) system.time(oddCount(x)) # user system elapsed system.time({ cnt <- 0 for (i in 1:length(x)) if (x[i] % 2 == 1) cnt <- cnt + 1 cnt }) # user system elapsed 1.353

Vectorized functions ifelse, which, where, any, all rowSums, colSums outer, lower.tri, upper.tri, expand.grid *apply family BUT not “apply” itself (apply is implemented in R)

Bytecode compiler library(compiler) oddCountCompiled <- cmpfun(oddCountSequential) system.time(oddCountCompiled(x)) # user system elapsed Faster but still not as fast as the vectorized version.

Power Matrix example powers <- function(x, degree) { pw <- matrix(x, nrow=length(x)) prod <- x for (i in 2:degree) { prod <- prod * x pw <- cbind(pw,prod) # build the matrix sequentially } return(pw) } x <- runif( ) system.time(powers(x, 8)) # user system elapsed 0.258

Power Matrix example powers2 <- function(x, degree) { # allocate the matrix in one go pw <- matrix(nrow=length(x), ncol=degree) prod <- x pw[, 1] <- prod for (i in 2:degree) { prod <- prod * x pw[, i] <- prod } return(pw) } system.time(powers2(x, 8)) # user system elapsed 0.14

Power Matrix example powers3 <- function(x, degree) { return(outer(x, 1:degree, "^")) } system.time(powers3(x, 8)) # user system elapsed powers4 <- function(x, degree) { repx <- matrix(rep(x, degree), nrow=length(x)) return(t(apply(repx, 1, cumprod))) } system.time(powers4(x, 8)) # user system elapsed 6.322

Profiling Rprof() invisible(powers(x, 8)) Rprof(NULL) SummaryRprof() $by.self self.time self.pct total.time total.pct "cbind" "*" $by.total total.time total.pct self.time self.pct "powers" "cbind" "*"

Memory Allocation and Copying # you have to run this block in one go z <- runif(10) tracemem(z) # 0x76dc288 z[3] <- 8 tracemem(z) # 0x76dc288 z[20] <- 100 tracemem(z) # 0x4a52cf0

Memory issue Object size is limited to 2^31 – 1 (=4GB) even if you are using 64bit OS with huge physical memory. Workaround: Chunking ff, bigmemory package R in 64bit OS

Language Bindings You can write codes in other languages and call it from R, or vice versa. RPy (to Python) Rcpp (to C/C++) etc... I will not go into details; ask me if you are interested.

Parallelization Two (major) packages – Rmpi Interface to MPI (Message Passing Interface) – Snow Transparent parallelization

Snow Example library(snow) # create cluster cl <- makeCluster(rep("localhost", 8), type="SOCK") a <- matrix(rnorm( ), ncol=2) # sequential execution system.time(apply(a, 1, "%*%", c(1, 10))) # parallel execution system.time(parApply(cl, a, 1, "%*%", c(1, 10))) # destroy cluster stopCluster(cl)

Snow Functions clusterExport clusterEvalQ clusterApply clusterSetupRNG Necessary to generate different random number series on each host

By Daniels220 from Wikipedia Cautions! Optimize only when necessary – Parallel codes are more complex, unpredictable, and difficult to debug Consider Amdahl's law S: Speed up N: Number of processors P: Proportion of parallelizable code