Programming in R coding, debugging and optimizing Katia Oleinik koleinik@bu.edu Scientific Computing and Visualization Boston University http://www.bu.edu/tech/research/training/tutorials/list/
if Comparison operators: Logical operators: if (condition) { command(s) } else { } Comparison operators: == equal != not equal > (<) greater (less) >= (<=) greater (less) or equal Logical operators: & and | or ! not
if > # define x > x <- 7 > # simple if statement > if ( x < 0 ) print("Negative") > # simple if-else statement > if ( x < 0 ) print("Negative") else print("Non-negative") [1] "Non-negative" > # if statement may be used inside other constructions > y <- if ( x < 0 ) -1 else 0 > y [1] 0
if > # multiline if - else statement > if ( x < 0 ) { + x <- x+10 + print("x is negative: subtract 10") + } else if ( x == 0 ) { + print("x is equal zero") + } else { + print("x is positive: add 10") + } [1] positive Note: For multiline if-statements braces are necessary even for single statement bodies. The left and right braces must be on the same line with else keyword (in interactive session).
ifelse ifelse (test_condition, true_value, false_value) > # ifelse statement > y <- ifelse ( x < 0, -1, 0 ) > # nested ifelse statement > y <- ifelse ( x < 0, -1, ifelse (x > 0, 1, 0) )
ifelse Best of all – ifelse statement operates on vectors! > # ifelse statement on a vector > digits <- 0 : 9 > (odd <- ifelse( digits %% 2 > 0, TRUE, FALSE )) [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
ifelse Exercise: define a random vector ranging from -10 to 10: x<- as.integer( runif( 10, -10, 10 ) ) create vector y, such that its elements equal to absolute values of x Note: normally, you would use abs() function to achieve this result
switch switch (statement, list) > # simple switch statement > x <- 3 > switch( x, 2, 4, 6, 8 ) [1] 6 > switch( x, 2, 4 ) # returns NULL since there are only 2 elements in the list
switch switch (statement, name1 = str1, name2 = str2, … ) > # switch statement with named list > day <- "Tue" > switch( day, Sun = 0, Mon = 1, Tue = 2, Wed = 3, … ) [1] 2 > # switch statement with a “default” value > food <- "meet" > switch( food, banana="fruit", carrot="veggie", "neither") [1] "neither"
loops There are 3 statements that provide explicit looping: - repeat - for - while Built – in constructs to control the looping: - next - break Note: Use explicit loops only if it is absolutely necessary. R has other functions for implicit looping, which will run much faster: apply(), sapply(), tapply(), and lapply().
repeat repeat { } statement causes repeated evaluation of the body until break is requested. Be careful – infinite loop may occur! > # find the greatest odd divisor of an integer > x <- 84 > repeat{ + print(x) + if( x%%2 != 0) break + x <- x/2 + } [1] 84 [1] 42 [1] 21 >
for for (object in sequence) { command(s) } > # print all words in a vector > names <- c("Sam", "Paul", "Michael") > > for( j in names ){ + print(paste("My name is" , j)) + } [1] "My name is Sam" [1] "My name is Paul" [1] "My name is Michael"
for for (object in sequence) { command(s) if (…) next # return to the start of the loop if (…) break # exit from (innermost) loop }
while while (test_statement) { command(s) } > # find the largest odd divisor of a given number > x <- 84 > while (x %% 2 == 0){ + x <- x/2 + } > x [1] 21 >
loops Exercise: Using either loop statement print all the numbers from 0 to 30 divisible by 7. Use %% - modular arithmetic operator to check divisibility.
function myFun <- function (ARG, OPT_ARGs ){ statement(s) } ARG: vector, matrix, list or a data frame OPT_ARGs: optional arguments Functions are a powerful R elements. They allows you to expand on existing functions by writing your own custom functions.
function myFun <- function (ARG, OPT_ARGs ){ statement(s) } Naming: Variable naming rules apply. Avoid usage of existing (built-in) functions Arguments: Argument list can be empty. Some (or all) of the arguments can have a default value ( arg1 = TRUE ) The argument ‘…’ can be used to allow one function to pass on argument settings to another function. Return value: The value returned by the function is the last value computed, but you can also use return() statement.
function > # simple function: calculate (x+1)2 > myFun <- function (x) { + x^2 + 2*x + 1 + } > myFun(3) [1] 16 >
function > # function with optional arguments: calculate (x+a)2 > myFun <- function (x, a=1) { + x^2 + 2*x*a + a^2 + } > myFun(3) [1] 16 > myFun(3,2) [1] 25 > > # arguments can be called using their names ( and out of order!!!) > myFun( a = 2, x = 1) [1] 9
function > # Some optional arguments can be specified as ‘…’ to pass them to another function > myFun <- function (x, … ) { + plot (x, … ) + } > > # print all the words together in one sentence > myFun <- function ( … ) { + print(paste ( … ) ) > myFun("Hello", " R! ") [1] "Hello R! "
function Local and global variables: All variables appearing inside a function are treated as local, except their initial value will be of that of the global (if such variable exists). > # define a function > myFun <- function (x) { + cat ("u=", u, "\n") # this variable is local ! + u<-u+1 # this will not affect the value of variable outside f() + cat ("u=", u, "\n") + } > > u <- 2 # define a variable > myFun(5) #execute the function u= 2 u= 3 > cat ("u=", u, "\n") # print the value of the variable
function Local and global variables: If you want to access the global variable – you can use the super-assignment operator <<-. You should avoid doing this!!! > # define a function > myFun <- function (x) { + cat ("u=", u, "\n") # this variable is local ! + u <<- u+1 # this WILL affect the value of variable outside f() + cat ("u=", u, "\n") + } > > u <- 2 # define a variable > myFun(u) #execute the function u= 2 u= 3 > cat ("u=", u, "\n") # print the value of the variable
function Call vector variables: Functions do not change their arguments. > # define a function > myFun <- function (x) { + x <- 2 + print (x) + } > > x <- 3 # assign value to x > y <- myFun(x) # call the function [1] 2 > print(x) # print value of x [1] 3
function Call vector variables: If you want to change the value of the function’s argument, reassign the return value to the argument. > # define a function > myFun <- function (x) { + x <- 2 + print (x) + } > > x <- 3 # assign value to x > x <- myFun(x) # call the function [1] 2 > print(x) # print value of x
function Finding the source code: You can find the source code for any R function by printing its name without parentheses. > # get the source code of lm() function > lm function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...) { ret.x <- x ret.y <- y cl <- match.call() . . . z } <environment: namespace:stats> >
function Finding the source code: For generic functions there are many methods depending on the type of the argument. > # get the source code of mean() function > mean function (x, ...) UseMethod("mean") <environment: namespace:base> >
function Finding the source code: You can first explore different methods and then chose the one you need. > # get the source code of mean() function > methods("mean") [1] mean.Date mean.POSIXct mean.POSIXlt mean.data.frame [5] mean.default mean.difftime > > # get source code > mean.default function (x, trim = 0, na.rm = FALSE, ...) { if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) { . . . z } <environment: namespace:stats>
apply apply (OBJECT, MARGIN, FUNCTION, ARGs ) object: vector, matrix or a data frame margin: 1 – rows, 2 – columns, c(1,2) – both function: function to apply args: possible arguments Description: Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix
apply Example: Create matrix and apply different functions to its rows and columns. > # create 3x4 matrix > x <- matrix( 1:12, nrow = 3, ncol = 4) > x [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 >
apply Example: Create matrix and apply different functions to its rows and columns. > # create 3x4 matrix > x <- matrix( 1:12, nrow = 3, ncol = 4) > x [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > # find median of each row > apply (x, 1, median) [1] 5.5 6.5 7.5 >
apply Example: Create matrix and apply different functions to its rows and columns. > # create 3x4 matrix > x <- matrix( 1:12, nrow = 3, ncol = 4) > x [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > # find mean of each column > apply (x, 2, mean) [1] 2 5 8 11 >
apply Example: Create matrix and apply different functions to its rows and columns. > # create 3x4 matrix > x <- matrix( 1:12, nrow = 3, ncol = 4) > x [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > # create a new matrix with values 0 or 1 for even and odd elements of x > apply (x, c(1,2), function (x) x%%2) [1,] 1 0 1 0 [2,] 0 1 0 1 [3,] 1 0 1 0 >
lapply llapply() function returns a list: lapply(X, FUN, ...) > # create a list > x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE)) > # compute the list mean for each list element > lapply (x, mean) $a [1] 5.5 $beta [1] 4.535125 $logic [1] 0.3333333 >
sapply lsapply() function returns a vector or a matrix: sapply(X, FUN, ... , simplify = TRUE, USE.NAMES = TRUE) > # create a list > x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE)) > # compute the list mean for each list element > sapply (x, mean) a beta logic 5.5000000 4.5351252 0.3333333 >
code sourcing source ("file", … ) file: file with a source code to load (usually with extension .r ) echo: if TRUE, each expression is printed after parsing, before evaluation.
code sourcing katana:~ % emacs foo_source.r & # dummy function Linux prompt katana:~ % emacs foo_source.r & Text editor # dummy function foo <- function(x){ x+1 } R session > # load foo.r source file > source ("foo_source.r") > # create a vector > x <- c(3,5,7) > # call function > foo(x) [1] 4 6 8
code sourcing > # load foo.r source file > source ("foo_source.r", echo = TRUE) > # dummy function > foo <- function(x){ + x+1; + } > # create a vector > x <- c(3,5,7) > # call function > foo(x) [1] 4 6 8
code sourcing Exercise: - write a function that computes a logarithm of inverse of a number log(1/x) - save it in the file with .r extension - load it into your workspace - execute it - try execute it with input vector ( 2, 1, 0, -1 ).
debugging R package includes debugging tools. cat () & print () – print out the values browser () – pause the code execution and “browse” the code debug (FUN) – execute function line by line undebug (FUN) – stop debugging the function
debugging # dummy function inv_log <- function(x){ y <- 1/x inv_log.r # dummy function inv_log <- function(x){ y <- 1/x browser() y <- log(y) } > # load foo.r source file > source ("inv_log.r", echo = TRUE) > # dummy function > inv_log <- function(x){ + y<-1/x; + browser(); + y<-log(y); + } > inv_log (x) # call function Called from: inv_log(x) Browse[1]> y # check the values of local variables [1] 0.3333333 0.5000000 1.0000000 Inf -1.0000000
debugging <RET> Go to the next statement if the function is being debugged. Continue execution if the browser was invoked. c or cont Continue execution without single stepping. n Execute the next statement in the function. This works from the browser as well. where Show the call stack. Q Halt execution and jump to the top-level immediately. To view the value of a variable whose name matches one of these commands, use the print() function, e.g. print(n).
debugging # dummy function inv_log <- function(x){ y <- 1/x inv_log.r # dummy function inv_log <- function(x){ y <- 1/x browser() y <- log(y) } > # load foo.r source file > source ("inv_log.r", echo = TRUE) > # dummy function > inv_log <- function(x){ + y<-1/x; + browser(); + y<-log(y); + } > inv_log (x) # call function Called from: inv_log(x) Browse[1]> y [1] 0.3333333 0.5000000 1.0000000 Inf -1.0000000 Browse[1]> n debug: y <- log(y) Browse[2]> Warning message: In log(y) : NaNs produced >
debugging # dummy function inv_log <- function(x){ y <- 1/x inv_log.r # dummy function inv_log <- function(x){ y <- 1/x y <- log(y) } > # load foo.r source file > source ("inv_log.r", echo = TRUE) > # dummy function > inv_log <- function(x){ + y<-1/x; + y<-log(y); + } > debug(inv_log) # debug mode > inv_log (x) # call function Called from: inv_log(x) debugging in: inv_log(x) debug: { y <- 1/x y <- log(y) } Browse[2]> . . . > undebug(inv_log) # exit debugging mode
timing Use system.time() functions to measure the time of execution. > # make a function > g <- function(n) { + y = vector(length=n) + for (i in 1:n) y[i]=i/(i+1) + y + }
timing Use system.time() functions to measure the time of execution. > # make a function > myFun <- function(x) { + y = vector(length=x) + for (i in 1:x) y[i]=i/(i+1) + y + } > # execute the function, measuring the time of the execution > system.time( myFun(100000) ) user system elapsed 0.107 0.002 0.109
optimization How to speed up the code?
optimization How to speed up the code? Use vectors !
optimization How to speed up the code? Use vectors ! > # using loops > g1 <- function(x) { + y = vector(length=x) + for (i in 1:x) y[i]=i/(i+1) + y + } > # using vectors > x <- (1:100000) > g2 <- function(x) { + x/(x+1) + } >
optimization How to speed up the code? Use vectors ! > # using loops > g1 <- function(x) { + y = vector(length=x) + for (i in 1:x) y[i]=i/(i+1) + y + } > # execute the function > system.time( g1(100000) ) user system elapsed 0.107 0.002 0.109 > # using vectors > x <- (1:100000) > g2 <- function(x) { + x/(x+1) + } > # execute the function > system.time( g2(x) ) user system elapsed 0.002 0.000 0.003
optimization How to speed up the code? Avoid dynamically expanding arrays
optimization How to speed up the code? Avoid dynamically expanding arrays > vec1<-NULL > vec2 <- vector( + mode="numeric",length=100000)
optimization How to speed up the code? Avoid dynamically expanding arrays > vec1<-NULL > # execute the command > system.time( + for(i in 1:100000) + vec1 <- c(vec1,mean(1:100))) user system elapsed 58.181 0.193 58.417 > vec2 <- vector( + mode=“numeric”,length=100000) > # execute the command > system.time( + for(i in 1:100000) + vec2[i] <- mean(1:100)) user system elapsed 2.324 0.063 2.388
optimization How to speed up the code? Avoid dynamically expanding arrays > f1<-function(x){ + vec1 <- NULL + for(i in 1:100000) + vec1 <- c(vec1,mean(1:10)) + } > # execute the command > system.time( f1(0) ) user system elapsed 57.035 0.209 57.280 > f2<-function(x){ + vec2 <- vector( + mode="numeric",length=100000) + for(i in 1:100000) + vec2[i] <- mean(1:10) + } > # execute the command > system.time( f2(0) ) user system elapsed 2.096 0.067 2.163
optimization How to speed up the code? Use optimized R-functions, i.e. rowSums(), rowMeans(), table(), etc. In some simple cases – it is worth it to write your own!
optimization How to speed up the code? Use optimized R-functions, i.e. rowSums(), rowMeans(), table(), etc. In some simple cases – it is worth it to write your own! > matx <- matrix + (rnorm(1000000),100000,10) > # execute the command > system.time(apply(matx,1,mean)) user system elapsed 2.686 0.057 2.748 > matx <- matrix + (rnorm(1000000),100000,10) > # execute the command > system.time(rowMeans(matx)) user system elapsed 0.013 0.000 0.014
optimization How to speed up the code? Use optimized R-functions, i.e. rowSums(), rowMeans(), table(), etc. In some simple cases – it is worth it to write your own! > system.time( + for(i in 1:100000)mean(1:100)) user system elapsed 1.862 0.052 1.914 > system.time( + for(i in 1:100000) + sum(1:100) / length(1:100) ) user system elapsed 0.485 0.013 0.498
optimization How to speed up the code? Use vectors Avoid dynamically expanding arrays Use optimized R-functions, i.e. rowSums(), rowMeans(), table(), etc. In some simple cases – it is worth it to write your own implementation!
optimization How to speed up the code? Use vectors Avoid dynamically expanding arrays Use optimized R-functions, i.e. rowSums(), rowMeans(), table(), etc. In some simple cases – it is worth it to write your own implementation! Use R - compiler or C/C++ code
compiling Use library(compiler): cmpfun() - compile existing function cmpfile() - compile source file loadcmp() - load compiled source file
compiling # dummy function fsum <- function(x){ s <- 0 for ( n in x) s <- s+n s }
compiling # dummy function fsum <- function(x){ s <- 0 for ( n in x) s <- s+n s } > # load compiler library > library (compiler) >
compiling # dummy function fsum <- function(x){ s <- 0 fsum.r # dummy function fsum <- function(x){ s <- 0 for ( n in x) s <- s+n s } > # load compiler library > library (compiler) > # load function from a source file (if necessary) > source ("fsum.r") >
compiling # dummy function fsum <- function(x){ s <- 0 fsum.r # dummy function fsum <- function(x){ s <- 0 for ( n in x) s <- s+n s } > # load compiler library > library (compiler) > # load function from a source file (if necessary) > source (“fsum.r”) > fsumcomp <- cmpfun(fsum)
compiling Using compiled functions decreases the time of computation. > # run non-compiled version > system.time(fsum(1:100000)) user system elapsed 0.071 0.000 0.071 > # run compiled version > system.time(fsumcomp(1:100000)) user system elapsed 0.025 0.001 0.026
profiling Profiling is a tool, which can be used to find out how much time is spent in each function. Code profiling can give a way to locate those parts of a program which will benefit most from optimization. Rprof() – turn profiling on Rprof(NULL) – turn profiling off summaryRprof("Rprof.out") – Summarize the output of the Rprof() function to show the amount of time used by different R functions.
profiling Brownian Motion simulation. Input: x - initial position, steps - number of steps bm.R # slow version of BM function bmslow <- function (x, steps){ BM <- matrix(x, nrow=length(x)) for (i in 1:steps){ # sample from normal distribution z <- rnorm(2) # attach a new column to the output matrix BM <- cbind (BM,z) } return(BM)
profiling Brownian Motion simulation. Input: x - initial position, steps - number of steps bm.R # a faster version of BM function bm <- function (x, steps){ # allocate enough space to hold the output matrix BM <- matrix(nrow = length(x), ncol=steps+1) # add initial point to the matrix BM[,1] = x # sample from normal distribution (delX, delY) z <- matrix(rnorm(steps*length(x)),nrow=length(x)) for (i in 1:steps) BM[,i+1] <- BM[,i] + z[,i] return(BM) }
profiling > # load compiler library (if you have not done it before) > require (compiler) > # compile function from a source file > cmpfun ("bm.R") > # load function from a compiled file > loadcmp ("bm.Rc")
profiling > # simulate 100 steps > BMsmall <- bm(c(0,0),100) > # plot the result > plot(BMsmall[1,],BMsmall[2,],…)
profiling > # start profiling slow function > Rprof("bmslow.out") # optional – provide output file name > # run function > BMS <- bmslow(c(0,0), 100000) > # finish profiling > Rprof(NULL)
profiling > # start profiling faster function > Rprof("bm.out") # optional – provide output file name > # run function > BM <- bm(c(0,0), 100000) > # finish profiling > Rprof(NULL)
profiling > summaryRprof("bmslow.out") $by.self self.time self.pct total.time total.pct "cbind" 400.52 99.39 400.52 99.39 "rnorm" 1.70 0.42 1.70 0.42 "bmslow" 0.74 0.18 402.96 100.00 … > summaryRprof("bm.out") $by.self self.time self.pct total.time total.pct "bm" 0.62 75.61 0.82 100.00 "rnorm" 0.08 9.76 0.08 9.76 "matrix" 0.04 4.88 0.12 14.63 "+" 0.04 4.88 0.04 4.88 ":" 0.04 4.88 0.04 4.88 …
This tutorial has been made possible by Scientific Computing and Visualization group at Boston University. Katia Oleinik koleinik@bu.edu