Continuous Analytics Over Discontinuous Streams Sailesh Krishnamurthy, Michael Franklin, Jeff Davis, Daniel Farina, Pasha Golovko, Alan Li, Neil Thombre.

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Transaction.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Approaches to EJB Replication. Overview J2EE architecture –EJB, components, services Replication –Clustering, container, application Conclusions –Advantages.
Chapter 9 Designing Systems for Diverse Environments.
Information Retrieval in Practice
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
PSoup Kevin Menard CS 561 4/11/2005. Streaming Queries over Streaming Data Sirish Chandrasekaran UC Berkeley August 20, 2002 with Michael J. Franklin.
Beyond data modeling Model must be normalised – purpose ? Outcome is a set of tables = logical design Then, design can be warped until it meets the realistic.
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
Streaming Data, Continuous Queries, and Adaptive Dataflow Michael Franklin UC Berkeley NRC June 2002.
1 04/18/2005 Flux Flux: An Adaptive Partitioning Operator for Continuous Query Systems M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin UC.
Overview of Search Engines
ETL By Dr. Gabriel.
Locking Key Ranges with Unbundled Transaction Services 1 David Lomet Microsoft Research Mohamed Mokbel University of Minnesota.
Ch 4. The Evolution of Analytic Scalability
Word Wide Cache Distributed Caching for the Distributed Enterprise.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
MD240 - MIS Oct. 4, 2005 Databases & the Data Asset Harrah’s & Allstate Cases.
The Worlds of Database Systems Chapter 1. Database Management Systems (DBMS) DBMS: Powerful tool for creating and managing large amounts of data efficiently.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Magellan: Experiences from a Science Cloud Lavanya Ramakrishnan.
The Worlds of Database Systems From: Ch. 1 of A First Course in Database Systems, by J. D. Pullman and H. Widom.
Chapter 1 Introduction Yonsei University 1 st Semester, 2015 Sanghyun Park.
Data Streams: Lecture 101 Window Aggregates in NiagaraST Kristin Tufte, Jin Li Thanks to the NiagaraST PSU.
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
W. Hong & S. Madden – Implementation and Research Issues in Query Processing for Wireless Sensor Networks, ICDE 2004.
1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Stream Reasoning with Linked Data Open Data Open Day 2013 Sina Samangooei, Nick Gibbins 26 June 2013.
OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.
1 Out of Order Processing for Stream Query Evaluation Jin Li (Portland State Universtiy) Joint work with Theodore Johnson, Vladislav Shkapenyuk, David.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Spark Presentation.
Modern Data Management
Data stream as an unbounded table
PREGEL Data Management in the Cloud
Data Platform and Analytics Foundational Training
Ishan Sharma Abhishek Mittal Vivek Raj
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
The Design of an Acquisitional Query Processor For Sensor Networks
A developers guide to Azure SQL Data Warehouse
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Uber How to Stream Data with StorageTapper
A developers guide to Azure SQL Data Warehouse
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Ch 4. The Evolution of Analytic Scalability
The Dataflow Model.
H-store: A high-performance, distributed main memory transaction processing system Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex.
Idle Waiting for slides
Presentation transcript:

Continuous Analytics Over Discontinuous Streams Sailesh Krishnamurthy, Michael Franklin, Jeff Davis, Daniel Farina, Pasha Golovko, Alan Li, Neil Thombre June 10, 2010 SIGMOD, Indianapolis

Founded in 2005 Roots in TelegraphCQ project from UC Berkeley HQ in Foster CIty, CA Focus on “Continuous Analytics” Fortune 100 and web-based Big Data Customers

Data Records / “Events” Update Display Real-Time Analysis CQ Processor Source Data Stream Query Processing (Traditional View) 3

SQL Execution On Streaming Data 4 A stream is an unbounded sequence of records A table is a set of records Window operators convert streams to tables SQL queries apply to tables Window Operator Each window produces a set of records (a table) Semantics: Repeatedly apply generic SQL to the results of window operators Results are continuously appended to the output stream

Example: SQL Queries over Streams 5 SELECT I.Advertiser, SUM(I.price*I.volume) FROM Impressions I, Campaigns C WHERE I.campaign_id = C.campaign_id and C.type = ‘CPM’ GROUP BY I.Advertiser “I want to look at 5 seconds worth of impressions” “I want results every 3 seconds” Every 3 seconds, compute the revenue by advertiser based on impression data, over a 5 second “sliding window” Result(s) Impression Data Stream Result(s) … Window Window Operator Clause

Assumptions About Streams 6 Continuous sequences Arriving mostly in order , 2

The Reality Minutes, Hours, Days, late arriving Data Multiple streams out of sync, with gaps, … 1, 5, ?

Traditional (in Order) Solution #1: “Slack” , ,2, ,2,2, ,2,2, , , , ,9 Time Stamp 3-Second Slack Buffer OUTPUTTuple #

Slack 9 Pros Simple Handles “jitter” (slightly out of order arrival) Cons Introduces delay Permanently drops arrivals later than buffer Unbounded buffer size Permanently drops arrivals if lulls in multiple input streams

Traditional (in Order) Solution #2: “Drift” 10 (A,1) (a,2) (A,1) (B,2) (b,3) (a,2), (B,2) (C,3) (c,4) (b,3), (C,3) (G,4) (d,5) (c,4), (G,4) (D,6) (d,5) (E,7) (D,6),(E,7) (R,8) (E,7),(R,8) (D,6) (F,9) (x,5) (R,8),(F,9) (E,7) (z,10) (z,10) (R,8), (F,9) Source 2 2-Second Drift Buffer OUTPUT Source 1

Drift 11 Pros Simple Handles multiple streams with short “lulls” in arrival Cons Doesn’t handle streams with dramatically different arrival rates Permanently drops data that arrives after drift window has expired

Traditional Solution #3: Order-agnostic Operators 12 Slack and Drift aim to order streams before presenting them to order-sensitive operators Many operators don’t care about order SELECT count(*), cq_close(*) ts FROM S

Out of Order Processing: Count Example (4,t=5) (3,t=10) Time Stamp Count State OUTPUT Tuple # Heart- Beat

Order-agnostic Operators 14 Pros No buffering No extra delays Handles out-of-order tuples that make it before heart-beat Cons Some operators do care about order Permanently drops data that arrives after heartbeat Note: Lost data also impacts bigger “roll up queries” e.g. with sharing

So, how to handle very late data and discontinuous streams? 15

Integration Framework Shared Stream Query Processor Persistent Data Store SQL Interface Raw Data Aggregates “Stream-Relational” Architecture [CIDR 09] 16  JDBC / JMS  XML  Flat files  ETL tools  SOAP  APIs Data Warehouse App Logic / UDFs Other TrucQ’s

Order-Independent Processing: Overview 17 Answers that have already been delivered can only be compensated Need to preserve all arriving data Queries return answers based on all relevant data that has arrived: CQ’s: Continuous Queries SQ’s: SQL queries on archived streams & answers Approach: Leverage benefits of SQL(!): Data-Parallel processing w/on-demand consolidation Powerful “View” mechanisms Basically, create parallel partitions for late data Rewrite queries as views over partial results

Out of Order Processing: Count Example (6,t=5) Data TS Control Count State Partitions OUTPUT Tuple #

Out of Order Processing: Count Example (3,t=10) (2,t=5) (1,t=15) 20 flush-2 2 (2,t=10) 21 flush-3 (2,t=5) Data TS Control Count State Partitions OUTPUT Tuple # (6,t=5)

Out of Order Processing: Count Example 20 (6,t=5) (3,t=10) (2,t=5) (1,t=15) (2,t=10) (2,t=5) OUTPUT Treat output as “Partial State Records” Rewrite queries using views over PSRs i.e., consolidate On-Demand Paper goes into substantial detail on how rewrites work Same answer as Order-Insensitive as roll-up Answer contains all data Subsequent SQs over archived results and raw data contain all data too!

Handles Very Late Data, Plus You Get… 21 Parallel Processing – Multicore and Cluster U U U U D D D D D D D D D D Client High-bandwidth Network Interconnect D = Distributed Processing Node U = Unified Processing Node

Other Details in the Paper 22 Beyond late data and parallelism, approach also is key to supporting: Fault Tolerance using replication High-Availability via fast restart “Nostalgic” continuous queries that start in the past and catch up to the present Fast concurrent creation of archives for new CQs Algorithmic/Systems details on Integration with overall system architecture Interaction with Transaction Mechanism Need for Background Reducer task Hybrid Plans for non-parallelizable parts of queries

Conclusions 23 Early Stream Processing Systems were based on simplistic assumptions about ordering Truviso’s 3.2 engine incorporates a new mechanism so no data is permanently dropped Approach leverages strengths of SQL Data-parallel processing models Sophisticated and efficient view functionality Key is On-Demand Consolidation Of course, you can only do it if you have an integrated stream-relational system For more info: or