Download presentation
Presentation is loading. Please wait.
Published byTamsyn Preston Modified over 6 years ago
1
Integrating the R Language Runtime System with a Data Stream Warehouse
Carlos Ordonez*, Ted Johnson, Simon Urbanek, Vlad Shkanpenyuk, Divesh Srivastava ATT Research Labs USA * visiting researcher with ATT 1
2
Talk Outline Motivation Past stream ATT systems System architecture
Integrating R runtime with query processor Bidirectional calls: R calls SQL, SQL calls R Benchmark of data mapping & transfer
3
Network Data Streams Feeds: devices, logs Timestamps Intermittent
Arrival our of order Varying speed Varying schema Active processing, but not real-time: <5 mins Sliding time window
4
Motivation: SQL Expressive, standardized. well understood
Efficient, parallel, tunable Extensible via UDFs
5
Motivation: Scaling R Remove RAM limitation
Go beyond 1-threaded processing in 1 node Parallel processing on multiple nodes Both worlds Manage big data in a DBMS Exploit R math capabilities
6
Past ATT systems Gigascope: ultra fast processing stream in NIC (packet level), restricted form of SQL language, no historic tables DataDepot: store summarized streams, band joins, POSIX file system, compiled SQL queries, integration with feed mgt system, UDFs TidalRace: Big Data trend, scale out, “V”ariety
7
TidalRace HDFS Large number of nodes
Direct & fast copy of log files, no preprocessing Multiple asynchronous stream loading Eventual consistency: MVCC time-varying schema Light DBMS for metadata Integration with stream feed system Compiled SQL queries
8
Tidalrace Architecture
Data loading and update propagation Queries Maintenance Tidalrace metadata Storage Manager (D3SM) MySQL Data partitions and indices File system (local, D3FS, HDFS)
9
Temporal Partitioning
Index Data New data Time The primary partitioning field is the record timestamp Stream data is mostly sorted Most new data loads into a new partition Avoid rebuilding indices Simplified data expiration – roll off oldest partitions
10
R runtime: challenges Dynamic storage in RAM, variable generations
Type checking at runtime RAM constraint to call functions Data structures: data frames, matrices, vectors Functional and OO language Dynamic processing; garbage collector Runtime based on S language, programmed in C Block-based processing requires refactoring R libs
11
Applications
12
STAR: STream Analytics in R
Separate Unix process 64 bit memory space 32 bit int for arrays Packed binary file Pipes Embedded R in C Compiled query in exec()
13
Assumptions Stream velocity handled at ingestion/loading
Acceptable 1-5 minute delay in stream load + analysis Small size materialized views: time range & aggregation Large RAM in analytic server Unlimited size historical tables Sliding time window: recent data Separate Unix process: R runtime, compiled query
14
Mapping Data Types Atomic
time (POSIX) int float (real) string Data structures (challenge in SQL, not relational!) data frame vector matrix list
15
Data Transfer Bidirectional pipe: to transform streams into table or to transfer data set No parsing at all Packed binary file (varying length strings) Block-based processing in R (requires reprogramming and wrapping R calls) Programming: C vs R (speed vs abstraction)
16
Complexity Space Time O(dn) to transfer O(d2n) for many models
Data set: O(dn) model O(d),O(d2) in RAM Time O(dn) to transfer O(d2n) for many models Time complexity lower than queries lower than computing a model, same as transforming data set
17
R calls SQL Query always has time range: reduce size
Block-based processing to fit in RAM Packed binary file resembles a packet: header+payload Always, log data set has timestamps Every table in SQL can be processed in R, but not every R result can be sent back to DBMS
18
SQL calls R Via aggregate UDF, which builds data set
Assumption: most math models take matrices as input. Therefore, given two data set layouts they are converted to matrix form (dense or sparse). Conversion: table rows with floats are converted to vectors, most tables are converted to matrices, or in general tables with diverse data types are converted to data frames
19
Examples: use cases R calls SQL: statistical analyst needs some specific data from the DBMS extracted with comples query. Then computes descriptive statistics and math models SQL calls R: BI person needs to call some mathematical function in R on a data frame (e.g. smooth a time series) or matrix (get correlation matrix)
20
Benchmark: low end equipment
Hardware: 4 cores 2Ghz, 4 GB RAM, 1 TB disk (real server much bigger) Software: Linux, R, HDFS, MySQL, GNU C++ Compare read/transfer speed in C and R Compare text (csv) vs binary files Measure throughput (10X faster than query processing)
21
Discussion on performance
Binary files required for high performance: 100X faster than csv files C 1000X faster than R, but difficult to debug Disk I/O does not matter for large file because it is sequential access Data transfer is not a bottleneck (SQL query or R call take >5 seconds on large data set)
22
Conclusions Combine SQL queries and R functions seamlessly
Data transfer at maximum speed: reach streaming speed coming from a row DBMS R can process streams coming from the DBMS, the DBMS can call R in a streaming fashion Function calls can be fully bidirectional Any table can be transferred in blocks to R, but only data frames can be transferred from R to DBMS (asymmetric)
23
Future work Portable interfacing program with other DBMSs; challenge: source code Consider alternative storage in DBMS: column, array => data type mapping plus storage conversion Parallel processing with R on multiple nodes Evolving models on a stream (time window, visualize) Debugging dynamic R code in a compiled SQL query
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.