The Design of the Borealis Stream Processing Engine Daniel J. Abadi1, Yanif Ahmad2, Magdalena Balazinska1, Ug ̆ur C ̧ etintemel2, Mitch Cherniack3, Jeong-Hyon.

Slides:



Advertisements
Similar presentations
Load Management and High Availability in Borealis Magdalena Balazinska, Jeong-Hyon Hwang, and the Borealis team MIT, Brown University, and Brandeis University.
Advertisements

Class-constrained Packing Problems with Application to Storage Management in Multimedia Systems Tami Tamir Department of Computer Science The Technion.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
WSUS Presented by: Nada Abdullah Ahmed.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
® IBM Software Group © 2006 IBM Corporation Rational Software France Object-Oriented Analysis and Design with UML2 and Rational Software Modeler 04. Other.
The Design of the Borealis Stream Processing Engine Brandeis University, Brown University, MIT Magdalena BalazinskaNesime Tatbul MIT Brown.
The Design of the Borealis Stream Processing Engine CIDR 2005 Brandeis University, Brown University, MIT Kang, Seungwoo Ref.
Load Shedding in a Data Stream Manager Kevin Hoeschele Anurag Shakti Maskey.
Fault-Tolerance in the Borealis Distributed Stream Processing System Magdalena Balazinska, Hari Balakrishnan, Samuel Madden, and Michael Stonebraker MIT.
A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
Chapter 10: Stream-based Data Management Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Authors:
1 Load Shedding Algorithm Evaluation Step –When to shed load? Load Shedding Road Map (LSRM) –Where to shed load? –How much load to shed?
Scalable Distributed Stream System Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Don Carney, Uğur Çetintemel, Ying Xing, and Stan Zdonik Proceedings.
Scalable Distributed Stream Processing Presented by Ming Jiang.
MPDS 2003 San Diego 1 Reducing Execution Overhead in a Data Stream Manager Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack.
Monitoring Streams -- A New Class of Data Management Applications Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack Brandeis.
1 Load Shedding in a Data Stream Manager Slides edited from the original slides of Kevin Hoeschele Anurag Shakti Maskey.
Monitoring Streams -- A New Class of Data Management Applications Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack Brandeis.
Computer Measurement Group, India Reliable and Scalable Data Streaming in Multi-Hop Architecture Sudhir Sangra, BMC Software Lalit.
MONITORING STREAMS: A NEW CLASS OF DATA MANAGEMENT APPLICATIONS DON CARNEY, U Ğ UR ÇETINTEMEL, MITCH CHERNIACK, CHRISTIAN CONVEY, SANGDON LEE, GREG SEIDMAN,
Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1.
The Design of the Borealis Stream Processing Engine CIDR 2005 Brandeis University, Brown University, MIT Kang, Seungwoo Ref.
Providing Resiliency to Load Variations in Distributed Stream Processing Ying Xing, Jeong-Hyon Hwang, Ugur Cetintemel, Stan Zdonik Brown University.
Database Management 9. course. Execution of queries.
A new model and architecture for data stream management.
FI-WARE Points of Interest (POI) Data Provider Short Introduction Nonprofit educational material. Fair use of copyrighted content, if any, is assumed.
C-Store: Concurrency Control and Recovery Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun. 5, 2009.
Aurora – system architecture Pawel Jurczyk. Currently used DB systems Classical DBMS: –Passive repository storing data (HADP – human-active, DBMS- passive.
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
1 REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
CSC 600 Internetworking with TCP/IP Unit 7: IPv6 (ch. 33) Dr. Cheer-Sun Yang Spring 2001.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
1/22/08 RTR Project Presentation to TPTF RTR Project Michael Daskalantonakis & Brian Cook.
FlexTable: Using a Dynamic Relation Model to Store RDF Data IDS Lab. Seungseok Kang.
P2P Group Meeting (ICS/FORTH) Monday, 28 March, 2005 A Scalable Content-Addressable Network Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp,
Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data.
Control-Based Load Shedding in Data Stream Management Yicheng Tu †, Song Liu ‡, Sunil Prabhakar †, Bin Yao ‡ † Indiana Center of Database Systems, Department.
A new model and architecture for data stream management.
Protocol Requirements draft-bryan-p2psip-requirements-00.txt D. Bryan/SIPeerior-editor S. Baset/Columbia University M. Matuszewski/Nokia H. Sinnreich/Adobe.
Dr Mohamed Menacer College of Computer Science and Engineering, Taibah University CE-321: Computer.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Aurora: a new model and architecture for data stream management Daniel J. Abadi 1, Don Carney 2, Ugur Cetintemel 2, Mitch Cherniack 1, Christian Convey.
Network-Aware Query Processing for Stream- based Application Yanif Ahmad, Ugur Cetintemel - Brown University VLDB 2004.
Control-Based Load Shedding in Data Stream Management Systems Yicheng Tu and Sunil Prabhakar Department of Computer Sciences, Purdue University April 3,
1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.
Monitoring Streams -- A New Class of Data Management Applications based on paper and talk by authors below, slightly adapted for CS561: Don Carney Brown.
Fair and Efficient multihop Scheduling Algorithm for IEEE BWA Systems Daehyon Kim and Aura Ganz International Conference on Broadband Networks 2005.
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
Control-based Quality Adaptation in Data Stream Management Systems (DSMS) Yicheng Tu†, Song Liu‡, Sunil Prabhakar†, and Bin Yao‡ † Department of Computer.
Control-Based Load Shedding in Data Stream Management Systems Yicheng Tu and Sunil Prabhakar Department of Computer Sciences, Purdue University April 3,
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
PATH DIVERSITY WITH FORWARD ERROR CORRECTION SYSTEM FOR PACKET SWITCHED NETWORKS Thinh Nguyen and Avideh Zakhor IEEE INFOCOM 2003.
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Load Shedding CS240B notes.
Parallel Programming By J. H. Wang May 2, 2017.
Work-in-Progress: Wireless Network Reconfiguration for Control Systems
An overview of Data Streaming
Parallel Programming in C with MPI and OpenMP
Advanced Database Management System
Presenter Kyungho Jeon 11/17/2018.
Software models - Software Architecture Design Patterns
Load Shedding CS240B notes.
REED : Robust, Efficient Filtering and Event Detection
Parallel Programming in C with MPI and OpenMP
Stream-Lined Data Management
Presentation transcript:

The Design of the Borealis Stream Processing Engine Daniel J. Abadi1, Yanif Ahmad2, Magdalena Balazinska1, Ug ̆ur C ̧ etintemel2, Mitch Cherniack3, Jeong-Hyon Hwang2, Wolfgang Lindner1, Anurag S. Maskey3, Alexander Rasin2, Esther Ryvkina3, Nesime Tatbul2, Ying Xing2, and Stan Zdonik2 1 MIT Cambridge, MA 2 Brown University 3 Brandeis University Providence, RI Waltham, MA Presenter: Le Xu

Second Generation Stream Engine Developed from Aurora (first generation of stream processing engine) – sharing input format and similar system architecture New feature: – Dynamic modification of operator – Query revision

Aurora Processing Network Source: The Aurora and Borealis Stream Processing Engines:

System Architecture

A Borealis Query Processor

Data interface: - Stream Data Input Control interface - Control messages Box Processor - Main operation (Aggregate, Filter, Join, read, write, etc.)

A Borealis Query Processor Local Optimizer Load Shedder Priority Scheduler

Take a loser look at the Borealis node architecture

Borealis, after Aurora Dynamic Revision of Query Results -diagram history replay -stateless and stateful operation replay -challenges: Cost & Storage; proliferation Query Modification (dynamic!) -control line Dynamic System Optimization

Dynamic revising query results - Motivation: wrong/missing input, shed load… -Each box (operator) has a diagram history stored in the connection point of the input (has a history bound, of course) -Start revise while a revision message received (add, delete, replace) -Dynamic revision only generates the “delta” reflecting the change of result to save space

Stateless revision Stateless operator (e.g. Filter) only affects the revised message itself Dynamic revision only generates message of operation to revise the old result x> Replace: 4 Delete: 6

Stateful revision Stateful operator(e.g. Aggregation by window) revision require all messages involve in computation Dynamic revision only generates message of operation to revise the old result Aggregation M, T, W T, W, R W,R,F REVISED

Dynamic Revision Challenge Revision Proliferation (misalignment in size- based operation) Before:After insert: All messages (start from revision point to present) need to be revised! - Revision message need to be ignored sometimes

Dynamic Modification of Queries Control Lines Triggered while receiving control message specifying pair Timing: - Control message before data - Control message after data

Time Travel Connection Point (CP) View CP view has two operations to enable time travel: - replay - undo CP box 1 box 2 box 2

Borealis Optimization

1. Initial Diagram Distribution - Read/Write close to database site - Run correlation algorithm to find best operator/node match 2. Dynamic Optimization - Local Optimization load shedding/query delay, scheduling - Neighborhood Optimization Edge box sliding (limited bandwidth), correlation maximization, upstream load shedding

Edge box slide Before slide (left node overload) After slide Example of downstream slide: while network bandwidth is limited, this benefits the neighborhood while box 2 produce more output than input. And vice versa for upstream slide (e.g. box 2: join) Node 1 Node 2Node 1Node 2

Neighborhood load shedding Source: Presentation: The Design of the Borealis stream Processing Engine

Neighborhood load shedding (less total loss) Source: Presentation: The Design of the Borealis stream Processing Engine

Discussion and Open questions Progress on time travel and dynamic modification of query Possible high latency in the set up stage that reduce the flexibility of the system Revision-heavy application stall the processing Centralized global optimization Is it possible that the sharing load between the nodes never stops?

Dynamic Load Distribution in the Borealis Stream Processor * Ying Xing Brown University Stan Zdonik Brown University Jeong-Hyon Hwang Brown University

More on Load Balancing Pairwise Load Balancing Define score of Operator o while node 1 offload to node 2 Denotes the correlation coefficient between load of operator o and the load of all other operators in the node N.

Global Load Balancing 1. Settle non-removable node 2. (initial distribution)Greedy algorithm assigning the node with lowest node with operator with largest score of the node 3. Dynamic pairwise load balancing – Score:

Experiment Latency Ratio: end-to-end latency/end-to-end processing delay

Experiment