S. Sudarshan CS632 Course, Mar 2004 IIT Bombay

Slides:



Advertisements
Similar presentations
Load Management and High Availability in Borealis Magdalena Balazinska, Jeong-Hyon Hwang, and the Borealis team MIT, Brown University, and Brandeis University.
Advertisements

Database Architectures and the Web
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
The Volcano/Cascades Query Optimization Framework
The Design of the Borealis Stream Processing Engine Daniel J. Abadi1, Yanif Ahmad2, Magdalena Balazinska1, Ug ̆ur C ̧ etintemel2, Mitch Cherniack3, Jeong-Hyon.
User-Level Interprocess Communication for Shared Memory Multiprocessors Bershad, B. N., Anderson, T. E., Lazowska, E.D., and Levy, H. M. Presented by Akbar.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
Distributed Process Scheduling Summery Distributed Process Scheduling Summery BY:-Yonatan Negash.
The Design of the Borealis Stream Processing Engine Brandeis University, Brown University, MIT Magdalena BalazinskaNesime Tatbul MIT Brown.
1 SAFIRE Project DHS Update – July 15, 2009 Introductions  Update since last teleconference Demo Video - Fire Incident Command Board (FICB) SAFIRE Streams.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
Zero-programming Sensor Network Deployment 學生:張中禹 指導教授:溫志煜老師 日期: 5/7.
Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005.
Scalable Distributed Stream System Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Don Carney, Uğur Çetintemel, Ying Xing, and Stan Zdonik Proceedings.
Scalable Distributed Stream Processing Presented by Ming Jiang.
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
1 04/18/2005 Flux Flux: An Adaptive Partitioning Operator for Continuous Query Systems M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin UC.
CSE 490dp Resource Control Robert Grimm. Problems How to access resources? –Basic usage tracking How to measure resource consumption? –Accounting How.
Architectural Design.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
STREAM The Stanford Data Stream Management System.
Database Design – Lecture 16
The Design of the Borealis Stream Processing Engine CIDR 2005 Brandeis University, Brown University, MIT Kang, Seungwoo Ref.
Providing Resiliency to Load Variations in Distributed Stream Processing Ying Xing, Jeong-Hyon Hwang, Ugur Cetintemel, Stan Zdonik Brown University.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Query Processing, Resource Management, and Approximation in a Data Stream Management System.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Aurora – system architecture Pawel Jurczyk. Currently used DB systems Classical DBMS: –Passive repository storing data (HADP – human-active, DBMS- passive.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
1 STREAM: The Stanford Data Stream Management System STanfordstREamdatAManager 陳盈君 吳哲維 林冠良.
Distributed DBMSs- Concept and Design Jing Luo CS 157B Dr. Lee Fall, 2003.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Static Process Scheduling
1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Distributed, real-time actionable insights on high-volume data streams
COMP3211 Advanced Databases
Large-scale file systems and Map-Reduce
Applying Control Theory to Stream Processing Systems
Load Shedding CS240B notes.
Software Design and Architecture
PREGEL Data Management in the Cloud
Database System Implementation CSE 507
CHAPTER 3 Architectures for Distributed Systems
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ch > 28.4.
Benchmarking Modern Distributed Stream Processing Systems
Presenter Kyungho Jeon 11/17/2018.
Multimedia Data Stream Management System
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
By Brian N. Bershad, Thomas E. Anderson, Edward D
Software models - Software Architecture Design Patterns
Presented by Neha Agrawal
with Raul Castro Fernandez* Matteo Migliavacca+ and Peter Pietzuch*
Deadlock Detection for Distributed Process Networks
Evaluation of Relational Operations: Other Techniques
Load Shedding CS240B notes.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Congestion Control (from Chapter 05)
5/7/2019 Map Reduce Map reduce.
Outline System architecture Current work Experiments Next Steps
Adaptive Query Processing (Background)
Congestion Control (from Chapter 05)
Presentation transcript:

S. Sudarshan CS632 Course, Mar 2004 IIT Bombay Data Streams S. Sudarshan CS632 Course, Mar 2004 IIT Bombay

Overview Two approaches Motwani et al [CIDR03] Concentrate on query processing, approximations Cherniak et al. [CIDR03] Concentrate on system architecture with data-flow style processing DAG of operators to be created by users

Motwani et al. Query language (SQL extension) Semantics of stream queries Query plans with sharing and approximation Reducing memory requirements for query processing by exploiting constraints on data Techniques for static and dynamic approximation of query results Resource allocation to queries Techniques for data compression

Motwani: Language and Semantics Relations and streams Streams are timestamped Tuple s arrives at time t: <t, s> Istream and Dstream operators Create stream from relation insert/delete Query language implicitly converts from streams to relations and vice versa Any query with stream at outer level gives stream output Window operations convert streams to relations

Motwani: Operators and Queues Create, changemem, run Synopsis Summarizes tuples seen on a stream Create, changemem, insert/delete, query Resource sharing Sharing of synopses

Motwani: Constraints Constraints: Many-one join and referential integrity constraints between two streams Clustered/ordered arrival constraints Can exploit to avoid storing/examining history Implicit addition of now window

Motwani: Scheduling Greedy approach: Schedule operator that consumes largest number of tuples per time unit and produces the fewest tuples Minimize queue lengths

Motwani: Approximation Static and dynamic approximation Static: modify the query Window reduction Sampling rate reduction Dynamic Synopsis compression Sampling Load shedding Resource allocation to maximize precision Simple precision metric: (FP, FN)

Motwani: Implementation Entities Operators, queues, synopses Control tables Attribute value pairs used to control the entity Query plan Network of entities

Cherniak et al. Query model Distribution model Set of relational operations Push based query processing “tumble” aggregate operation outputs results whenever window is complete Distribution model Aurora* vs Medusa

Cherniak: Communication Infrastructure Naming and discovery Catalogs to find streams/queries Each Aurora network binds its inputs and outputs to streams Tuples are routed based on who consumes the stream Single copy of stream per participant Single TCP connection between sites Streams multiplexed within connection Allows control on QOS Remote definition of streams

Cherniak: Economic Model Economic model of load management and sharing Agoric system Source is paid, sink pays

Cherniak: Load Management Redistribution of computation while system is active Periodic repartitioning of computation

Cherniak: Re-partitioning Operator nodes can be reallocated across processor boundaries when required (box sliding) Pairwise interaction Box splitting to parallelize operation Filters to decide which tuples go where subsequent re-merging of streams Merge has to handle groups partially aggregated before split Deciding on split criterion

High Availability Basic idea: K-safe Each server can act as backup for its downstream server Low overhead Tuples are discarded after ensuring they are saved elsewhere if they may be needed again K-safe Data is safe as long as < k sites fail Keep copies of in-transit tuples at k upstream servers

Cherniak: Availability Remote queue truncation technique Flow messages Process pair Sequence numbering of tuples Earliest tuple that a box depends on E.g stateless (e.g. filter) and stateful (e.g. aggregates) stored and sent back periodically to upstream server More complex with split operators Split operator “merges” messages When to truncate: min of all truncation msgs Failure detection and recovery

Cherniak: Availability Process pair techniques Traditional way in distributed systems More msgs during operation, less work during recovery Further extension to virtual machines

Charniak: Policy Specification and Guidelines QoS based control in Aurora* Economic contract based control in Medusa