Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
DEXA 2005 Control-based Quality Adaptation in Data Stream Management Systems (DSMS) Yicheng Tu†, Mohamed Hefeeda‡, Yuni Xia†, Sunil Prabhakar†, and Song.
Dynamic Bayesian Networks (DBNs)
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
The Design of the Borealis Stream Processing Engine Brandeis University, Brown University, MIT Magdalena BalazinskaNesime Tatbul MIT Brown.
Load Shedding in a Data Stream Manager Kevin Hoeschele Anurag Shakti Maskey.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Self-Tuning and Self-Configuring Systems Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 16, 2005.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
Chapter 10: Stream-based Data Management Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Authors:
Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS Peer-to-Peer Systems 12/9/03.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
1 Load Shedding in a Data Stream Manager Slides edited from the original slides of Kevin Hoeschele Anurag Shakti Maskey.
Bandwidth Allocation in a Self-Managing Multimedia File Server Vijay Sundaram and Prashant Shenoy Department of Computer Science University of Massachusetts.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Data Management for Sensor Networks Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 4, 2005.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
Using Probabilistic Models for Data Management in Acquisitional Environments Sam Madden MIT CSAIL With Amol Deshpande (UMD), Carlos Guestrin (CMU)
Systems analysis and design, 6th edition Dennis, wixom, and roth
© 2006 IBM Corporation Adaptive Self-Tuning Memory in DB2 Adam Storm, Christian Garcia-Arellano, Sam Lightstone – IBM Toronto Lab Yixin Diao, M. Surendra.
The Design of the Borealis Stream Processing Engine CIDR 2005 Brandeis University, Brown University, MIT Kang, Seungwoo Ref.
Query Processing, Resource Management, and Approximation in a Data Stream Management System.
Chapter 10: Stream-based Data Management Title: Retrospective on Aurora Authors: Hari Balakrishnan, et. al.
Master’s Thesis (30 credits) By: Morten Lindeberg Supervisors: Vera Goebel and Jarle Søberg Design, Implementation, and Evaluation of Network Monitoring.
A new model and architecture for data stream management.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
1 STREAM: The Stanford Data Stream Management System STanfordstREamdatAManager 陳盈君 吳哲維 林冠良.
Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
Load Shedding in Stream Databases – A Control-Based Approach Yicheng Tu, Song Liu, Sunil Prabhakar, and Bin Yao Department of Computer Science, Purdue.
Accommodating Bursts in Distributed Stream Processing Systems Yannis Drougas, ESRI Vana Kalogeraki, AUEB
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
A new model and architecture for data stream management.
CS4432: Database Systems II Query Processing- Part 2.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Triggers and Streams Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 28, 2005.
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Distributed Ranked Data Dissemination in Social Networks Joint work with: Mo Sadoghi Vinod Muthusamy Hans-Arno.
Control-Based Load Shedding in Data Stream Management Systems Yicheng Tu and Sunil Prabhakar Department of Computer Sciences, Purdue University April 3,
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Control-based Quality Adaptation in Data Stream Management Systems (DSMS) Yicheng Tu†, Song Liu‡, Sunil Prabhakar†, and Bin Yao‡ † Department of Computer.
Control-Based Load Shedding in Data Stream Management Systems Yicheng Tu and Sunil Prabhakar Department of Computer Sciences, Purdue University April 3,
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
OPERATING SYSTEMS CS 3502 Fall 2017
Applying Control Theory to Stream Processing Systems
Evaluation of Relational Operations
Data Stream Management System (DSMS)
Presenter Kyungho Jeon 11/17/2018.
Load Shedding Techniques for Data Stream Systems
Multimedia Data Stream Management System
Load Shedding in Stream Databases – A Control-Based Approach
Database management concepts
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
P2P Integration, Concluded, and Data Stream Processing
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Query Processing CSD305 Advanced Databases.
Adaptive Query Processing (Background)
An Analysis of Stream Processing Languages
Presentation transcript:

Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

2 Administrivia  Thursday, L101, 3PM:  Muthian Sivathanu, U. Wisc., Semantically Smart Disk Systems  Next readings:  Monday – read and review the Madden paper  Wednesday – read and summarize the Brin and Page paper

3 Today’s Trivia Question

4 Data Stream Management  Basic idea: static queries, dynamic data  Applications:  Publish-subscribe systems  Stock tickers, news headlines  Data acquisition, e.g., from sensors, traffic monitoring, …  The main two projects that are purely “stream processors”:  Stanford STREAM  MIT/Brown/Brandeis Aurora/Medusa

5 Summary from Last Time  Streams are time-varying data series  STREAM maps them into timestamped sets  (Aurora doesn’t seem to do this)  Most operations on streams resemble normal DB queries:  Filtering, projection; grouping and aggregation; join  (Though the latter few are over windows)  STREAM started with an SQL-like language called CQL  All stream operations go “through” relations  Query plan operators have queues and synopses

6 Some Tricks for Performance  Sharing synopses across multiple operators  In a few cases, more than one operator may join with the same synopsis  Can exploit punctuations or “k-constraints”  Analogous to interesting orders  Referential integrity k-constraint: bound of k between arrival of “many” element and its corresponding “one” element  Ordered-arrival k-constraint: need window of at most k to sort  Clustered-arrival k-constraint: bound on distance between items with same grouping attributes

7 Query Processing – “Chain Scheduling”  Similar in many ways to eddies  Combination of locally greedy and FIFO scheduling  Apply operator to data as follows:  Assume we know how many tuples can be processed in a time unit  Cluster groups of operators into “chains” that maximize reduction in queue size per unit time (i.e., most selective operators per time unit)  Greedily forward tuples into the most selective chain  Within a chain, process the data in FIFO order  STREAM also does a form of join reordering

8 Scratching the Surface: Approximation  They point out two areas where we might need to approximate output:  CPU is limited, and we need to drop some stream elements according to some probabilistic metric  Collect statistics via a profiler  Use Hoeffding inequality to derive a sampling rate in order to maintain a confidence interval  This is generally termed load shedding  May need to do similar things if memory usage is a constraint  Are there other options? When might they be useful?

9 STREAM in General  “Logical semantics first”  Starts with a basic data model: streams as timestamped sets  Develops a language and semantics  Heavily based on SQL  Proposes a relatively straightforward implementation  Interesting ideas like k-constraints  Interesting approaches like chain scheduling  No real consideration of distributed processing

10 Aurora  “Implementation first; mix and match operations from past literature”  Basic philosophy: most of the ideas in streams existed in previous research  Sliding windows, load shedding, approximation, …  So let’s borrow those ideas and focus on how to build a real system with them!  Emphasis is on building a scalable, robust system  Distributed implementation: Medusa

11 Queries in Aurora  Oddly: no declarative query language!  Queries are workflows of physical query operators (SQuAl)  Many operators resemble relational algebra ops

12 Example Query

13 Some Interesting Aspects  A relatively simple adaptive query optimizer  Can push filtering and mapping into many operators  Can reorder some operators (e.g., joins, unions)  Need built-in error handling  If a data source fails to respond in a certain amount of time, create a special alarm tuple  This propagates through the query plan  Incorporate built-in load-shedding, RT sched. to support QoS  Have a notion of combining a query over historical data with data from a stream  Switches from a pull-based mode (reading from disk) to a push-based mode (reading from network)

14 The Medusa Processor  Distributed coordinator between many Aurora nodes  Scalability through federation and distribution  Fail-over  Load balancing

15 Main Components  Lookup  Distributed catalog – schemas, where to find streams, where to find queries  Brain  Query setup, load monitoring via I/O queues and stats  Load distribution and balancing scheme is used  Very reminiscent of Mariposa!

16 Load Balancing  Migration – an operator can be moved from one node to another  Initial implementation didn’t support moving of state  The state is simply dropped, and operator processing resumes  Implications on semantics?  Plans to support state migration  “Agoric system model to create incentives”  Clients pay nodes for processing queries  Nodes pay each other to handle load – pairwise contracts negotiated offline  Bounded-price mechanism – price for migration of load, spec for what a node will take on  Does this address the weaknesses of the Mariposa model?

17 Some Applications They Tried  Financial services (stock ticker)  Main issue is not volume, but problems with feeds  Two-level alarm system, where higher-level alarm helps diagnose problems  Shared computation among queries  User-defined aggregation and mapping  Linear road (sensor monitoring)  Traffic sensors in a toll road – change toll depending on how many cars are on the road  Combination of historical and continuous queries  Environmental monitoring  Sliding-window calculations

18 The Big Application?  Military battalion monitoring  Positions & images of friends and foes  Load shedding is important  Randomly drop data vs. semantic, predicate-based dropping to maintain QoS  Based on a QoS utility function

19 Lessons Learned  Historical data is important – not just stream data  (Summaries?)  Sometimes need synchronization for consistency  “ACID for streams”?  Streams can be out of order, bursty  “Stream cleaning”?  Adaptors and XML are important  … But we already knew that!  Performance is critical  They spent a great deal of time using microbenchmarks and optimizing

20 Borealis  Aurora is now commercial  Borealis follows up with some new directions:  Dynamic revision of results, i.e., corrections to stream data  Dynamic query modification – change on the fly  “Control lines”: change parameters  “Time travel”: support execution of multiple queries, starting from different points in time (past thru future)  Distributed optimization  Combine stream and sensor processing ideas (we’ll talk about sensor nets next time)  Sensor-heavy vs. server-heavy optimization

21 Streams and Integration  How do streams and data integration relate?  Are streams the future, or just an interesting vista point on the side of the road?