Fault-Tolerance in the Borealis Distributed Stream Processing System Magdalena Balazinska, Hari Balakrishnan, Samuel Madden, and Michael Stonebraker MIT.

Slides:

Advertisements

Similar presentations

Load Management and High Availability in Borealis Magdalena Balazinska, Jeong-Hyon Hwang, and the Borealis team MIT, Brown University, and Brandeis University.

Advertisements

Data and Computer Communications

Impossibility of Distributed Consensus with One Faulty Process

BY PAYEL BANDYOPADYAY WHAT AM I GOING TO DEAL ABOUT? WHAT IS AN AD-HOC NETWORK? That doesn't depend on any infrastructure (eg. Access points, routers)

Cooperative Overlay Networking for Streaming Media Content Feng Wang 1, Jiangchuan Liu 1, Kui Wu 2 1 School of Computing Science, Simon Fraser University.

Announcements. Midterm Open book, open note, closed neighbor No other external sources No portable electronic devices other than medically necessary medical.

Consensus Hao Li.

Consensus Routing: The Internet as a Distributed System John P. John, Ethan Katz-Bassett, Arvind Krishnamurthy, and Thomas Anderson Presented.

Towards a Logic for Wide-Area Internet Routing Nick Feamster and Hari Balakrishnan M.I.T. Computer Science and Artificial Intelligence Laboratory Kunal.

The Design of the Borealis Stream Processing Engine Brandeis University, Brown University, MIT Magdalena BalazinskaNesime Tatbul MIT Brown.

The Design of the Borealis Stream Processing Engine CIDR 2005 Brandeis University, Brown University, MIT Kang, Seungwoo Ref.

Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)

Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.

1 The Time-Triggered Model of Computation Lior Zimet.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

A Progressive Fault Detection and Service Recovery Mechanism in Mobile Agent Systems Wong Tsz Yeung Aug 26, 2002.

ABCSG - Distributed Database 1 Data Management Distributed Database Data Replication.

1 Availability Study of Dynamic Voting Algorithms Kyle Ingols and Idit Keidar MIT Lab for Computer Science.

2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.

Overview Distributed vs. decentralized Why distributed databases

Scalable Distributed Stream System Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Don Carney, Uğur Çetintemel, Ying Xing, and Stan Zdonik Proceedings.

Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.

CS294, YelickConsensus, p1 CS Consensus

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Performance Comparison of Existing Leader Election Algorithms for Dynamic Networks Mobile Ad Hoc (Dynamic) Networks: Collection of potentially mobile computing.

A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.

1 Software Testing and Quality Assurance Lecture 5 - Software Testing Techniques.

Thesis Proposal Data Consistency in DHTs. Background Peer-to-peer systems have become increasingly popular Lots of P2P applications around us –File sharing,

Fault Tolerance via the State Machine Replication Approach Favian Contreras.

The Design of the Borealis Stream Processing Engine CIDR 2005 Brandeis University, Brown University, MIT Kang, Seungwoo Ref.

Providing Resiliency to Load Variations in Distributed Stream Processing Ying Xing, Jeong-Hyon Hwang, Ugur Cetintemel, Stan Zdonik Brown University.

Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.

© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.

Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.

Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 

CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.

CMPE.516 TERM PRESENTATION ERKAN ÇETINER Fault-tolerant Stream Processing using a Distributed, Replicated File System.

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.

1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.

DISTRIBUTED SYSTEMS II REPLICATION Prof Philippas Tsigas Distributed Computing and Systems Research Group.

Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.

Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.

Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.

2012 Research Expo PhD Topics Professor Peng Shi.

Chord Advanced issues. Analysis Theorem. Search takes O (log N) time (Note that in general, 2 m may be much larger than N) Proof. After log N forwarding.

Fine-Grained Failover Using Connection Migration Alex C. Snoeren, David G. Andersen, Hari Balakrishnan MIT Laboratory for Computer Science.

UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department

1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.

Superstabilizing Protocols for Dynamic Distributed Systems Authors: Shlomi Dolev, Ted Herman Presented by: Vikas Motwani CSE 291: Wireless Sensor Networks.

Distributed Error- Confinement Shay Kutten (Technion) with Boaz Patt-Shamir (Tel Aviv U.) Yossi Azar (Tel Aviv U.)

Fault Tolerance (2). Topics r Reliable Group Communication.

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.12.1 FAULT TOLERANT SYSTEMS Part 12 - Networks.

Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.

PERFORMANCE MANAGEMENT IMPROVING PERFORMANCE TECHNIQUES Network management system 1.

Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.

Magdalena Balazinska, Hari Balakrishnan, and David Karger

PREGEL Data Management in the Cloud

Ninja Meeting 2/15/2000 Sam Madden

Implementing Consistency -- Paxos

EECS 498 Introduction to Distributed Systems Fall 2017

Chord Advanced issues.

Student: Fang Hui Supervisor: Teo Yong Meng

A Fusion-based Approach for Tolerating Faults in Finite State Machines

Chord Advanced issues.

Implementing Consistency -- Paxos

Presentation transcript:

Fault-Tolerance in the Borealis Distributed Stream Processing System Magdalena Balazinska, Hari Balakrishnan, Samuel Madden, and Michael Stonebraker MIT computer science & Artificial Intelligence Lab. Original Slides: Youngki Lee Modified by: Bao Huy Ung

Abstract Present a replication-based approach to fault- tolerant distributed stream processing in the face of node failures, network failures, and network partitions. Aims to reduce degree of inconsistency in system while guaranteeing available inputs are processed within a specified time threshold.

Time Threshold User defined delay constraint is X Data processing delay is P A node cannot buffer inputs longer than αX, where αX < X – P

Network Computing Lab. KAIST Motivation scenario SPE FAILURE X: 3 seconds SPE X: 60 seconds X: 1 second Downstream neighbors want 1. new tuples to be processed within time threshold X 2. to get eventual correct result X: 3 seconds Upstream neighbor Downstream neighbor

Network Computing Lab. KAIST Fault-Tolerance Approach If an input stream fails, find another replica No replica available, produce tentative tuples Correct tentative results after failures STABLE UPSTREAM FAILURE STABILIZATION Missing or tentative inputs Failure heals Another upstream failure in progress Reconcile state Corrected output

Network Computing Lab. KAIST Fault-Tolerance Approach : STABLE Only need to keep consistency among replicas – Deterministic operators – SUNION s1s1 s2s2 Node 1 SUNION  TCP connection Node 1’ SUNION  s3s3

Network Computing Lab. KAIST Fault-Tolerance Approach : UPSTREAM FAILURE If an upstream neighbor is no longer in the STABLE state or is unreachable – Switch to another STABLE replica – If no STABLE replica exists, it continues with data from a replica in the UP_FAILURE state Suspend processing until failure heals and stable data is produced from upstream neighbors Delay new tuples as much as possible(X-P) and process Or just process without any delay

Network Computing Lab. KAIST Fault-Tolerance Approach : STABILIZATION State reconciliation – Checkpoint/redo – Undo/redo Stabilizing output streams Processing new tuples during reconciliation – If (Reconciliation time < X-P) then suspend else delay, or process Failed node recovery

Network Computing Lab. KAIST Experimental results

Network Computing Lab. KAIST Experimental results Reconciliation (performance & overhead)

Network Computing Lab. KAIST Questions? What kind of advantages can using a content distribution stream network provide? Replicas communicate with each other in the event of long failures to reach a mutually consistent state. Are there any benefits to having them always be communicating with each other?