The R language and its Dynamic Runtime Carlos Ordonez
Acknowledgments ATT Labs Simon Urbanek, (ATT Labs, R core team) Mike Stonebraker (MIT) Hadley Wickam (formerly at Rice U) Bryan Lewis (SciDB team) Divesh Srivastava (my “boss” at ATT)
Outline History R features R runtime R programming Research: analyzing streams
History Originally S language, invented at ATT Bell Labs (Chambers got Turing award) The core runtime subsystem is still based on S expressions 1st solid version 1979: ported to Unix and programmed in C Two branches: commercial=S-plus open-source=R (NZ)
Other analytic systems SAS: more a script language, but well tested libraries and external tools Matlab: numerical analysis, optimization, mathematical modeling DBMSs interacting with math libraries: SQL #1 to write queries Spark: new generation of MapReduce Pure C or C++; Java; Python growing (flat files)
Features Interpreted Functional; Recursion Object-oriented Lists, vectors and matrices Goal: Statistical computing, but also numerical analysis, data pre-processing Garbage collector
Pros Robust core interpreter system; portable More RAM => easier, 64-bit memory addresing (but still 32 bit ints) Growing user population: expected to surpass SAS in 2015; already passed S-plus Machine learning now uses R instead of Matlab, but Julia (MIT) growing Scalable systems and libraries exist Revolution bought by Microsoft pBDR snow, biglm
Drawbacks Syntax OK, but run-time R semantics not formally specified: GNU is the current standard Can be slow, especially because there are many ways to program the same task Difficult to integrate data structures (e.g. trees, hash tables, binary files) String manipulation acceptable, but sometimes cumbersome Dynamically typed: unexpected errors Highly variable quality of libraries in CRAN Does not scale well for large n; block-based processing feasible, but needs to be reprogrammed per library (IO tools)
R runtime Single threaded Text file I/O Garbage collector Environments; variable generations
R internals S expressions Data types: integer (32 bit), real, string, Posix timestamp Memory allocation: lists, vectors, matrices, data frames (most general) Memory deallocation: automatic, but can force calls to garbage collector in embedded Bash script-based interpreter: easy integration into diverse Unix environments
Programming in R Examples Interactive debugging Reusable and maintenable code Faster processing Extending R
Examples
Debugging Tracking variable contents List, vector, matrix sizes Ranges Environments
Tracking variable content Initialization commonly not needed; Data type can change any time with new assignment
Sizes
Reusable and maintanable code Functions Closures Functionals named arguments, defaults Libraries R embedded R embedded C
Functional
Faster processing Profiling code Direct calls to C math library Vectorized code Avoid type casting Chunk-based processing
Faster processing
Extending R New functions Libraries Embedded code
Research goal: analyzing network data streams Stream data warehouse, constantly refreshed every 1-5 minutes from multiples streams Time windows Intermittent feeds Enable complex analytics for network monitoring
Embedded code Main motivation: bypass ODBC, JDBC. JSON Embedding R code inside C code Vectors and matrices Exploit existing R functions May be faster than host language Embedding C code inside R code better performance more flexibility algorithm already programmed in C or C++
Embedded R inside C Setup libraries Setup Unix environment Convert external data to list, vector or data frame: memcpy() when possible retrieve results: transformed data set (most common) model (harder) associated statistical metrics (model-specific)
Embedded R inside C main guidelines Avoid reprogramming an existing R function Consider tradeoffs between data set size and RAM Two subsystems will compete for RAM Single threaded, but feasible to call R multiple times as different Unix processes
Embedded R
Embeded R generate time series
Embedded R create data frame
Embedded R final: call R from C
Embedded R direct binding to DBMS
Embedded R main
Embedded C code guidelines Identify bottlenecks Substitute nested interpreted loops Eliminate or reduce dynamic type checking
Embedded C code programming Understand data type manipulation, especially C arrays and ** pointers Memory management Function argument binding Linker
Improve efficiency of R Alternative 1: built-in matrix ops
Improve R efficiency Alternative 2: C code for the operator: 10X faster