Integrating the R Language Runtime System with a Data Stream Warehouse

Integrating the R Language Runtime System with a Data Stream Warehouse
Carlos Ordonez*, Ted Johnson, Simon Urbanek, Vlad Shkanpenyuk, Divesh Srivastava ATT Research Labs USA * visiting researcher with ATT 1

Talk Outline Motivation Past stream ATT systems System architecture
Integrating R runtime with query processor Bidirectional calls: R calls SQL, SQL calls R Benchmark of data mapping & transfer

Network Data Streams Feeds: devices, logs Timestamps Intermittent
Arrival our of order Varying speed Varying schema Active processing, but not real-time: <5 mins Sliding time window

Motivation: SQL Expressive, standardized. well understood
Efficient, parallel, tunable Extensible via UDFs

Motivation: Scaling R Remove RAM limitation
Go beyond 1-threaded processing in 1 node Parallel processing on multiple nodes Both worlds Manage big data in a DBMS Exploit R math capabilities

Past ATT systems Gigascope: ultra fast processing stream in NIC (packet level), restricted form of SQL language, no historic tables DataDepot: store summarized streams, band joins, POSIX file system, compiled SQL queries, integration with feed mgt system, UDFs TidalRace: Big Data trend, scale out, “V”ariety

TidalRace HDFS Large number of nodes
Direct & fast copy of log files, no preprocessing Multiple asynchronous stream loading Eventual consistency: MVCC time-varying schema Light DBMS for metadata Integration with stream feed system Compiled SQL queries

Tidalrace Architecture
Data loading and update propagation Queries Maintenance Tidalrace metadata Storage Manager (D3SM) MySQL Data partitions and indices File system (local, D3FS, HDFS)

Temporal Partitioning
Index Data New data Time The primary partitioning field is the record timestamp Stream data is mostly sorted Most new data loads into a new partition Avoid rebuilding indices Simplified data expiration – roll off oldest partitions

R runtime: challenges Dynamic storage in RAM, variable generations
Type checking at runtime RAM constraint to call functions Data structures: data frames, matrices, vectors Functional and OO language Dynamic processing; garbage collector Runtime based on S language, programmed in C Block-based processing requires refactoring R libs

Applications

STAR: STream Analytics in R
Separate Unix process 64 bit memory space 32 bit int for arrays Packed binary file Pipes Embedded R in C Compiled query in exec()

Assumptions Stream velocity handled at ingestion/loading
Acceptable 1-5 minute delay in stream load + analysis Small size materialized views: time range & aggregation Large RAM in analytic server Unlimited size historical tables Sliding time window: recent data Separate Unix process: R runtime, compiled query

Mapping Data Types Atomic
time (POSIX) int float (real) string Data structures (challenge in SQL, not relational!) data frame vector matrix list

Data Transfer Bidirectional pipe: to transform streams into table or to transfer data set No parsing at all Packed binary file (varying length strings) Block-based processing in R (requires reprogramming and wrapping R calls) Programming: C vs R (speed vs abstraction)

Complexity Space Time O(dn) to transfer O(d2n) for many models
Data set: O(dn) model O(d),O(d2) in RAM Time O(dn) to transfer O(d2n) for many models Time complexity lower than queries lower than computing a model, same as transforming data set

R calls SQL Query always has time range: reduce size
Block-based processing to fit in RAM Packed binary file resembles a packet: header+payload Always, log data set has timestamps Every table in SQL can be processed in R, but not every R result can be sent back to DBMS

SQL calls R Via aggregate UDF, which builds data set
Assumption: most math models take matrices as input. Therefore, given two data set layouts they are converted to matrix form (dense or sparse). Conversion: table rows with floats are converted to vectors, most tables are converted to matrices, or in general tables with diverse data types are converted to data frames

Examples: use cases R calls SQL: statistical analyst needs some specific data from the DBMS extracted with comples query. Then computes descriptive statistics and math models SQL calls R: BI person needs to call some mathematical function in R on a data frame (e.g. smooth a time series) or matrix (get correlation matrix)

Benchmark: low end equipment
Hardware: 4 cores 2Ghz, 4 GB RAM, 1 TB disk (real server much bigger) Software: Linux, R, HDFS, MySQL, GNU C++ Compare read/transfer speed in C and R Compare text (csv) vs binary files Measure throughput (10X faster than query processing)

Discussion on performance
Binary files required for high performance: 100X faster than csv files C 1000X faster than R, but difficult to debug Disk I/O does not matter for large file because it is sequential access Data transfer is not a bottleneck (SQL query or R call take >5 seconds on large data set)

Conclusions Combine SQL queries and R functions seamlessly
Data transfer at maximum speed: reach streaming speed coming from a row DBMS R can process streams coming from the DBMS, the DBMS can call R in a streaming fashion Function calls can be fully bidirectional Any table can be transferred in blocks to R, but only data frames can be transferred from R to DBMS (asymmetric)

Future work Portable interfacing program with other DBMSs; challenge: source code Consider alternative storage in DBMS: column, array => data type mapping plus storage conversion Parallel processing with R on multiple nodes Evolving models on a stream (time window, visualize) Debugging dynamic R code in a compiled SQL query

Integrating the R Language Runtime System with a Data Stream Warehouse

Similar presentations

Presentation on theme: "Integrating the R Language Runtime System with a Data Stream Warehouse"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Integrating the R Language Runtime System with a Data Stream Warehouse

Similar presentations

Presentation on theme: "Integrating the R Language Runtime System with a Data Stream Warehouse"— Presentation transcript:

Similar presentations

About project

Feedback