Integrating the R Language Runtime System with a Data Stream Warehouse Carlos Ordonez*, Ted Johnson, Simon Urbanek, Vlad Shkanpenyuk, Divesh Srivastava ATT Research Labs USA * visiting researcher with ATT 1
Talk Outline Motivation Past stream ATT systems System architecture Integrating R runtime with query processor Bidirectional calls: R calls SQL, SQL calls R Benchmark of data mapping & transfer
Network Data Streams Feeds: devices, logs Timestamps Intermittent Arrival our of order Varying speed Varying schema Active processing, but not real-time: <5 mins Sliding time window
Motivation: SQL Expressive, standardized. well understood Efficient, parallel, tunable Extensible via UDFs
Motivation: Scaling R Remove RAM limitation Go beyond 1-threaded processing in 1 node Parallel processing on multiple nodes Both worlds Manage big data in a DBMS Exploit R math capabilities
Past ATT systems Gigascope: ultra fast processing stream in NIC (packet level), restricted form of SQL language, no historic tables DataDepot: store summarized streams, band joins, POSIX file system, compiled SQL queries, integration with feed mgt system, UDFs TidalRace: Big Data trend, scale out, “V”ariety
TidalRace HDFS Large number of nodes Direct & fast copy of log files, no preprocessing Multiple asynchronous stream loading Eventual consistency: MVCC time-varying schema Light DBMS for metadata Integration with stream feed system Compiled SQL queries
Tidalrace Architecture Data loading and update propagation Queries Maintenance Tidalrace metadata Storage Manager (D3SM) MySQL Data partitions and indices File system (local, D3FS, HDFS)
Temporal Partitioning Index Data New data Time The primary partitioning field is the record timestamp Stream data is mostly sorted Most new data loads into a new partition Avoid rebuilding indices Simplified data expiration – roll off oldest partitions
R runtime: challenges Dynamic storage in RAM, variable generations Type checking at runtime RAM constraint to call functions Data structures: data frames, matrices, vectors Functional and OO language Dynamic processing; garbage collector Runtime based on S language, programmed in C Block-based processing requires refactoring R libs
Applications
STAR: STream Analytics in R Separate Unix process 64 bit memory space 32 bit int for arrays Packed binary file Pipes Embedded R in C Compiled query in exec()
Assumptions Stream velocity handled at ingestion/loading Acceptable 1-5 minute delay in stream load + analysis Small size materialized views: time range & aggregation Large RAM in analytic server Unlimited size historical tables Sliding time window: recent data Separate Unix process: R runtime, compiled query
Mapping Data Types Atomic time (POSIX) int float (real) string Data structures (challenge in SQL, not relational!) data frame vector matrix list
Data Transfer Bidirectional pipe: to transform streams into table or to transfer data set No parsing at all Packed binary file (varying length strings) Block-based processing in R (requires reprogramming and wrapping R calls) Programming: C vs R (speed vs abstraction)
Complexity Space Time O(dn) to transfer O(d2n) for many models Data set: O(dn) model O(d),O(d2) in RAM Time O(dn) to transfer O(d2n) for many models Time complexity lower than queries lower than computing a model, same as transforming data set
R calls SQL Query always has time range: reduce size Block-based processing to fit in RAM Packed binary file resembles a packet: header+payload Always, log data set has timestamps Every table in SQL can be processed in R, but not every R result can be sent back to DBMS
SQL calls R Via aggregate UDF, which builds data set Assumption: most math models take matrices as input. Therefore, given two data set layouts they are converted to matrix form (dense or sparse). Conversion: table rows with floats are converted to vectors, most tables are converted to matrices, or in general tables with diverse data types are converted to data frames
Examples: use cases R calls SQL: statistical analyst needs some specific data from the DBMS extracted with comples query. Then computes descriptive statistics and math models SQL calls R: BI person needs to call some mathematical function in R on a data frame (e.g. smooth a time series) or matrix (get correlation matrix)
Benchmark: low end equipment Hardware: 4 cores 2Ghz, 4 GB RAM, 1 TB disk (real server much bigger) Software: Linux, R, HDFS, MySQL, GNU C++ Compare read/transfer speed in C and R Compare text (csv) vs binary files Measure throughput (10X faster than query processing)
Discussion on performance Binary files required for high performance: 100X faster than csv files C 1000X faster than R, but difficult to debug Disk I/O does not matter for large file because it is sequential access Data transfer is not a bottleneck (SQL query or R call take >5 seconds on large data set)
Conclusions Combine SQL queries and R functions seamlessly Data transfer at maximum speed: reach streaming speed coming from a row DBMS R can process streams coming from the DBMS, the DBMS can call R in a streaming fashion Function calls can be fully bidirectional Any table can be transferred in blocks to R, but only data frames can be transferred from R to DBMS (asymmetric)
Future work Portable interfacing program with other DBMSs; challenge: source code Consider alternative storage in DBMS: column, array => data type mapping plus storage conversion Parallel processing with R on multiple nodes Evolving models on a stream (time window, visualize) Debugging dynamic R code in a compiled SQL query