Integrating the R Language Runtime System with a Data Stream Warehouse

Slides:



Advertisements
Similar presentations
Big Data Working with Terabytes in SQL Server Andrew Novick
Advertisements

Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Introduction to Operating Systems CS-2301 B-term Introduction to Operating Systems CS-2301, System Programming for Non-majors (Slides include materials.
The Gamma Operator for Big Data Summarization
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Goodbye rows and tables, hello documents and collections.
Bigtable: A Distributed Storage System for Structured Data 1.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
1 HDF5 Life cycle of data Boeing September 19, 2006.
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
IT Architectures for Handling Big Data in Official Statistics: the Case of Scanner Data in Istat Gianluca D’Amato, Annunziata Fiore, Domenico Infante,
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Chapter 4 Logical & Physical Database Design
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Parallel IO for Cluster Computing Tran, Van Hoai.
The purpose of a CPU is to process data Custom written software is created for a user to meet exact purpose Off the shelf software is developed by a software.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 3 Data Representation
Image taken from: slideshare
Databases and DBMSs Todd S. Bacastow January 2005.
Practical Database Design and Tuning
Module 11: File Structure
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Real-time analytics using Kudu at petabyte scale
Advanced QlikView Performance Tuning Techniques
Database Performance Tuning &
Informix Red Brick Warehouse 5.1
Data Warehouse.
Software Architecture in Practice
Database Performance Tuning and Query Optimization
Databases.
Main Memory Management
Tutorial 8 Objectives Continue presenting methods to import data into Access, export data from Access, link applications with data stored in Access, and.
The R language and its Dynamic Runtime
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Operating System Concepts
1 Demand of your DB is changing Presented By: Ashwani Kumar
Filesystems 2 Adapted from slides of Hank Levy
Practical Database Design and Tuning
Main Memory Background Swapping Contiguous Allocation Paging
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Parallel Analytic Systems
Overview of big data tools
View and Index Selection Problem in Data Warehousing Environments
Lecture 3: Main Memory.
Database Systems Instructor Name: Lecture-3.
File Storage and Indexing
Chapter 8: Memory Management strategies
Chapter 11 Database Performance Tuning and Query Optimization
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Performance And Scalability In Oracle9i And SQL Server 2000
The Gamma Operator for Big Data Summarization
Wellington Cabrera Advisor: Carlos Ordonez
Carlos Ordonez, Javier Garcia-Garcia,
The Gamma Operator for Big Data Summarization on an Array DBMS
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
CSE 542: Operating Systems
Best Practices in Higher Education Student Data Warehousing Forum
Database management systems
Presentation transcript:

Integrating the R Language Runtime System with a Data Stream Warehouse Carlos Ordonez*, Ted Johnson, Simon Urbanek, Vlad Shkanpenyuk, Divesh Srivastava ATT Research Labs USA * visiting researcher with ATT 1

Talk Outline Motivation Past stream ATT systems System architecture Integrating R runtime with query processor Bidirectional calls: R calls SQL, SQL calls R Benchmark of data mapping & transfer

Network Data Streams Feeds: devices, logs Timestamps Intermittent Arrival our of order Varying speed Varying schema Active processing, but not real-time: <5 mins Sliding time window

Motivation: SQL Expressive, standardized. well understood Efficient, parallel, tunable Extensible via UDFs

Motivation: Scaling R Remove RAM limitation Go beyond 1-threaded processing in 1 node Parallel processing on multiple nodes Both worlds Manage big data in a DBMS Exploit R math capabilities

Past ATT systems Gigascope: ultra fast processing stream in NIC (packet level), restricted form of SQL language, no historic tables DataDepot: store summarized streams, band joins, POSIX file system, compiled SQL queries, integration with feed mgt system, UDFs TidalRace: Big Data trend, scale out, “V”ariety

TidalRace HDFS Large number of nodes Direct & fast copy of log files, no preprocessing Multiple asynchronous stream loading Eventual consistency: MVCC time-varying schema Light DBMS for metadata Integration with stream feed system Compiled SQL queries

Tidalrace Architecture Data loading and update propagation Queries Maintenance Tidalrace metadata Storage Manager (D3SM) MySQL Data partitions and indices File system (local, D3FS, HDFS)

Temporal Partitioning Index Data New data Time The primary partitioning field is the record timestamp Stream data is mostly sorted Most new data loads into a new partition Avoid rebuilding indices Simplified data expiration – roll off oldest partitions

R runtime: challenges Dynamic storage in RAM, variable generations Type checking at runtime RAM constraint to call functions Data structures: data frames, matrices, vectors Functional and OO language Dynamic processing; garbage collector Runtime based on S language, programmed in C Block-based processing requires refactoring R libs

Applications

STAR: STream Analytics in R Separate Unix process 64 bit memory space 32 bit int for arrays Packed binary file Pipes Embedded R in C Compiled query in exec()

Assumptions Stream velocity handled at ingestion/loading Acceptable 1-5 minute delay in stream load + analysis Small size materialized views: time range & aggregation Large RAM in analytic server Unlimited size historical tables Sliding time window: recent data Separate Unix process: R runtime, compiled query

Mapping Data Types Atomic time (POSIX) int float (real) string Data structures (challenge in SQL, not relational!) data frame vector matrix list

Data Transfer Bidirectional pipe: to transform streams into table or to transfer data set No parsing at all Packed binary file (varying length strings) Block-based processing in R (requires reprogramming and wrapping R calls) Programming: C vs R (speed vs abstraction)

Complexity Space Time O(dn) to transfer O(d2n) for many models Data set: O(dn) model O(d),O(d2) in RAM Time O(dn) to transfer O(d2n) for many models Time complexity lower than queries lower than computing a model, same as transforming data set

R calls SQL Query always has time range: reduce size Block-based processing to fit in RAM Packed binary file resembles a packet: header+payload Always, log data set has timestamps Every table in SQL can be processed in R, but not every R result can be sent back to DBMS

SQL calls R Via aggregate UDF, which builds data set Assumption: most math models take matrices as input. Therefore, given two data set layouts they are converted to matrix form (dense or sparse). Conversion: table rows with floats are converted to vectors, most tables are converted to matrices, or in general tables with diverse data types are converted to data frames

Examples: use cases R calls SQL: statistical analyst needs some specific data from the DBMS extracted with comples query. Then computes descriptive statistics and math models SQL calls R: BI person needs to call some mathematical function in R on a data frame (e.g. smooth a time series) or matrix (get correlation matrix)

Benchmark: low end equipment Hardware: 4 cores 2Ghz, 4 GB RAM, 1 TB disk (real server much bigger) Software: Linux, R, HDFS, MySQL, GNU C++ Compare read/transfer speed in C and R Compare text (csv) vs binary files Measure throughput (10X faster than query processing)

Discussion on performance Binary files required for high performance: 100X faster than csv files C 1000X faster than R, but difficult to debug Disk I/O does not matter for large file because it is sequential access Data transfer is not a bottleneck (SQL query or R call take >5 seconds on large data set)

Conclusions Combine SQL queries and R functions seamlessly Data transfer at maximum speed: reach streaming speed coming from a row DBMS R can process streams coming from the DBMS, the DBMS can call R in a streaming fashion Function calls can be fully bidirectional Any table can be transferred in blocks to R, but only data frames can be transferred from R to DBMS (asymmetric)

Future work Portable interfacing program with other DBMSs; challenge: source code Consider alternative storage in DBMS: column, array => data type mapping plus storage conversion Parallel processing with R on multiple nodes Evolving models on a stream (time window, visualize) Debugging dynamic R code in a compiled SQL query