Download presentation
Presentation is loading. Please wait.
1
The R language and its Dynamic Runtime
Carlos Ordonez
2
Acknowledgments ATT Labs Simon Urbanek, (ATT Labs, R core team)
Mike Stonebraker (MIT) Hadley Wickam (formerly at Rice U) Bryan Lewis (SciDB team) Divesh Srivastava (my “boss” at ATT)
3
Outline History R features R runtime R programming
Research: analyzing streams
4
History Originally S language, invented at ATT Bell Labs (Chambers got Turing award) The core runtime subsystem is still based on S expressions 1st solid version 1979: ported to Unix and programmed in C Two branches: commercial=S-plus open-source=R (NZ)
5
Other analytic systems
SAS: more a script language, but well tested libraries and external tools Matlab: numerical analysis, optimization, mathematical modeling DBMSs interacting with math libraries: SQL #1 to write queries Spark: new generation of MapReduce Pure C or C++; Java; Python growing (flat files)
6
Features Interpreted Functional; Recursion Object-oriented
Lists, vectors and matrices Goal: Statistical computing, but also numerical analysis, data pre-processing Garbage collector
7
Pros Robust core interpreter system; portable
More RAM => easier, 64-bit memory addresing (but still 32 bit ints) Growing user population: expected to surpass SAS in 2015; already passed S-plus Machine learning now uses R instead of Matlab, but Julia (MIT) growing Scalable systems and libraries exist Revolution bought by Microsoft pBDR snow, biglm
8
Drawbacks Syntax OK, but run-time R semantics not formally specified: GNU is the current standard Can be slow, especially because there are many ways to program the same task Difficult to integrate data structures (e.g. trees, hash tables, binary files) String manipulation acceptable, but sometimes cumbersome Dynamically typed: unexpected errors Highly variable quality of libraries in CRAN Does not scale well for large n; block-based processing feasible, but needs to be reprogrammed per library (IO tools)
9
R runtime Single threaded Text file I/O Garbage collector
Environments; variable generations
10
R internals S expressions
Data types: integer (32 bit), real, string, Posix timestamp Memory allocation: lists, vectors, matrices, data frames (most general) Memory deallocation: automatic, but can force calls to garbage collector in embedded Bash script-based interpreter: easy integration into diverse Unix environments
11
Programming in R Examples Interactive debugging
Reusable and maintenable code Faster processing Extending R
12
Examples
13
Debugging Tracking variable contents List, vector, matrix sizes Ranges
Environments
14
Tracking variable content Initialization commonly not needed; Data type can change any time with new assignment
15
Sizes
16
Reusable and maintanable code
Functions Closures Functionals named arguments, defaults Libraries R embedded R embedded C
17
Functional
18
Faster processing Profiling code Direct calls to C math library
Vectorized code Avoid type casting Chunk-based processing
19
Faster processing
20
Extending R New functions Libraries Embedded code
21
Research goal: analyzing network data streams
Stream data warehouse, constantly refreshed every 1-5 minutes from multiples streams Time windows Intermittent feeds Enable complex analytics for network monitoring
22
Embedded code Main motivation: bypass ODBC, JDBC. JSON
Embedding R code inside C code Vectors and matrices Exploit existing R functions May be faster than host language Embedding C code inside R code better performance more flexibility algorithm already programmed in C or C++
23
Embedded R inside C Setup libraries Setup Unix environment
Convert external data to list, vector or data frame: memcpy() when possible retrieve results: transformed data set (most common) model (harder) associated statistical metrics (model-specific)
24
Embedded R inside C main guidelines
Avoid reprogramming an existing R function Consider tradeoffs between data set size and RAM Two subsystems will compete for RAM Single threaded, but feasible to call R multiple times as different Unix processes
25
Embedded R
26
Embeded R generate time series
27
Embedded R create data frame
28
Embedded R final: call R from C
29
Embedded R direct binding to DBMS
30
Embedded R main
31
Embedded C code guidelines
Identify bottlenecks Substitute nested interpreted loops Eliminate or reduce dynamic type checking
32
Embedded C code programming
Understand data type manipulation, especially C arrays and ** pointers Memory management Function argument binding Linker
33
Improve efficiency of R Alternative 1: built-in matrix ops
34
Improve R efficiency Alternative 2: C code for the operator: 10X faster
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.