Modeling Stream Processing Applications for Dependability Evaluation

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Load Management and High Availability in Borealis Magdalena Balazinska, Jeong-Hyon Hwang, and the Borealis team MIT, Brown University, and Brandeis University.
Tamper-Tolerant Software: Modeling and Implementation International Workshop on Security (IWSEC 2009) October 28-30, 2009 – Toyama, Japan Mariusz H. Jakubowski.
Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
1 PERFORMANCE EVALUATION H Often one needs to design and conduct an experiment in order to: – demonstrate that a new technique or concept is feasible –demonstrate.
Bugra Gedik, Henrique Andrade, Kun-Lung Wu, Philip S Yu, MyungCheol Doo Presented by: Zhou Lu SPADE: The System S Declarative Stream Processing Engine.
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
University of Coimbra, DEI-CISUC
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.
Streamflow - Programming Model for Data Streaming in Scientific Workflows Chathura Herath.
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan D EPARTMENT OF E LECTRICAL AND C OMPUTER E NGINEERING T HE U NIVERSITY OF B RITISH C OLUMBIA.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Introduction to Fault Tolerance By Sahithi Podila.
Week#3 Software Quality Engineering.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
MillWheel Fault-Tolerant Stream Processing at Internet Scale
Planning for Application Recovery
Overview Modern chip designs have multiple IP components with different process, voltage, temperature sensitivities Optimizing mix to different customer.
E-Storm: Replication-based State Management in Distributed Stream Processing Systems Xunyun Liu, Aaron Harwood, Shanika Karunasekera, Benjamin Rubinstein.
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
OPERATING SYSTEMS CS 3502 Fall 2017
Primary-Backup Replication
Hardware & Software Reliability
John D. McGregor Session 9 Testing Vocabulary
Approaches to ---Testing Software
The Development Process of Web Applications
Distributed Systems – Paxos
SOFTWARE DESIGN AND ARCHITECTURE
Chapter 8 – Software Testing
EEC 688/788 Secure and Dependable Computing
Fault Tolerance In Operating System
John D. McGregor Session 9 Testing Vocabulary
CS/ECE Computer Systems Analysis
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Supporting Fault-Tolerance in Streaming Grid Applications
Boyang Peng, Le Xu, Indranil Gupta
Gabor Madl Ph.D. Candidate, UC Irvine Advisor: Nikil Dutt
John D. McGregor Session 9 Testing Vocabulary
Stochastic Activity Networks
Shanna-Shaye Forbes Ben Lickly Man-Kit Leung
Hwisoo So. , Moslem Didehban#, Yohan Ko
Load Shedding in Stream Databases – A Control-Based Approach
EECS 498 Introduction to Distributed Systems Fall 2017
Outline Announcements Fault Tolerance.
Model Checking for an Executable Subset of UML
Testing and Test-Driven Development CSC 4700 Software Engineering
EEC 688/788 Secure and Dependable Computing
The Extensible Tool-chain for Evaluation of Architectural Models
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Consistent Regions: Guaranteed Tuple Processing in IBM Streams Gabriela Jacques da Silva, Fang Zheng, Daniel Debrunner, Kun-Lung Wu, Victor Dogaru, Eric.
Active replication for fault tolerance
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
20 minutes lecture + 10 min QnA Francis Palma Lakehead University
InCheck: An In-application Recovery Scheme for Soft Errors
Dynamic Program Analysis
with Raul Castro Fernandez* Matteo Migliavacca+ and Peter Pietzuch*
Automated Analysis and Code Generation for Domain-Specific Models
EEC 688/788 Secure and Dependable Computing
Control Theory in Log Processing Systems
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Abstractions for Fault Tolerance
University of Wisconsin-Madison Presented by: Nick Kirchem
Presentation transcript:

Modeling Stream Processing Applications for Dependability Evaluation Gabriela Jacques-Silva†‡, Zbigniew Kalbarczyk†, Bugra Gedik‡, Henrique Andrade‡, Kun-Lung Wu‡, Ravishankar K. Iyer† †University of Illinois at Urbana-Champaign ‡IBM Research – T. J. Watson Research Center

Outline Streaming applications Modeling a streaming application Stream operator, stream connections and tuples Representation of faults and error propagation Extending model to include fault tolerance techniques Evaluation

Extract knowledge from live data streams on-the-fly. Percentage of positive feedback stream operators 9.57% 5.42% 3.16% 2.52% 1.28% tuples 2

Different approaches to fault tolerance have different resource consumption and performance impact. Some techniques aims at providing no data loss an no data duplication guarantees Percentage of positive feedback 9.57% 5.42% 3.16% 2.52% 1.28%

Different approaches to fault tolerance have different resource consumption and performance impact. partial fault tolerance Decreases time to achieve stable output as compared to no recovery Achieves approximate results, which are tolerable by some streaming applications Percentage of positive feedback 9.32% 5.11% 2.84% 2.27% 1.09% 4

An evaluation framework helps to understand the relative merits of different techniques. Previous approaches focus on performance evaluation Fault injection may be expensive, mainly when evaluating the application under different setups and parameters Checkpoint Partial graph replication

Summary of contributions Modeling framework for evaluating streaming applications under faults that lead to data loss and data corruption Considers consequences of error propagation Based on generic models specified via Stochastic Activity Networks (SAN) Abstractions for stream operators, stream connections, and tuples Modeled three fault tolerance techniques Checkpointing, partial replication, and full replication

Modeling framework uses Stochastic Activity Network formalism. SANs can express the non-deterministic behavior and parallel execution of streaming application Nomenclature Place  container for a natural number Activity  transition between places Token  item in a place Input gate  enforce condition to activity Output gate  executes function after activity

Framework is based on the abstraction of three key components of a SPA. Stream operator  state transition model Captures arity, selectivity and processing time IG1 F1 Waiting for input int < 9

Framework is based on the abstraction of three key components of a SPA. Stream operator  state transition model Captures arity, selectivity and processing time Processing tuple IG1 F1 Waiting for input int < 9

Framework is based on the abstraction of three key components of a SPA. Stream operator  state transition model Captures arity, selectivity and processing time Processing tuple IG1 F1 Waiting for input int < 9

Framework is based on the abstraction of three key components of a SPA. Stream operator  state transition model Captures arity, selectivity and processing time input stream connections Processing tuple IG1 F1 Waiting for input int < 9

Framework is based on the abstraction of three key components of a SPA. Stream operator  state transition model Captures arity, selectivity and processing time input stream connections Processing tuple IG1 F1 Waiting for input int < 9 Sending output OG1 output buffer

Framework is based on the abstraction of three key components of a SPA. Stream operator  state transition model Captures arity, selectivity and processing time input stream connections Processing tuple IG1 F1 Waiting for input int < 9 output stream connections Sending output OG1 OG2 output buffer

Framework is based on the abstraction of three key components of a SPA. Stream connections  state sharing between output and input streams

Framework is based on the abstraction of three key components of a SPA. Stream connections  state sharing between output and input streams

Framework is based on the abstraction of three key components of a SPA. Tuples  tokens flying through input and output streams Representation of tuple sizes, but no attribute values

Stream operator failure model considers crashes and SDCs. Crash  data loss for partial fault tolerance techniques 9.32% 5.11% 2.84% 2.27% 1.09%

Stream operator failure model considers crashes and SDCs. Crash  data loss for partial fault tolerance techniques Silent data corruption  corruption of attribute values 9.53% 5.42% 3.14% 2.52% 1.28%

Base model is augmented to represent error propagation. Once a failure occurs, operators may generate inaccurate data Represented via tainted tuples and tainted stream connections input stream connection Processing tuple Waiting for input output stream connection Sending output

Base model is augmented to represent error propagation. Once a fault occurs, operators may generate inaccurate data Represented via tainted tuples and tainted stream connections input stream connection Processing tainted tuple tainted input stream connection Processing tuple Waiting for input output stream connection tainted output stream connection Sending output is tainted

Stateless operators do not generate tainted tuples after crash and restore. No crash Crash 10 5 6 3 10 5 F1 F1 X X int < 9 6 3 int < 9 Once operator recovers, the data is accurate

Stateful operators generate tainted tuples after crash and restore. No crash Crash – after restore 1 2 8 7 6 5 10 6 9 8 16 10 F1 2 F1 3 3 4  X X  6 5 4 7 After recovery, operator produces tainted tuples until its internal state refreshes

Stateful operators generate tainted data upon crash of any operator in the upstream set. No crash Change in internal state 1 7 6 5 4 3 6 5 10 6 F1 F1 2 3  int < 9 4

Stateful operators generate tainted data upon crash of any operator in the upstream set. Internal state is unchanged 2 9 8 7 9 8 16 10 F1 F1 3 4 X X  6 5 int < 9 7 After crashed operator recovers, operator produces tainted tuples until its internal state refreshes

Checkpoint of Operator State Model is parameterized to capture how long it takes to produce good results after a failure No crash Crash – after restore 1 2 8 7 6 5 10 6 9 8 16 10 F1 2 F1 3 3 4  X X  6 5 4 7 G. Jacques-Silva et al. “Language Level Checkpointing Support for Stream Processing Applications”. DSN 2009.

Partial Graph Replication Replicated operators and stream connections on composed application model Extra logic in replicated operators to perform replica failover active op1,A op1,B backup failover deactivate G. Jacques-Silva et al. “Language Level Checkpointing Support for Stream Processing Applications”. DSN 2009.

Full graph replication Extra logic for operators to perform de-duplication on tuples coming from redundant streams Aims at no tuple loss and non duplicate delivery J.-H. Hwang et al. “Fast and highly-available stream processing over wide area networks”. ICDE 2008.

Checkpoint vs. Partial Replication Under Crashes Target  Bargain Discovery Stateless - source, sink, 4 filters Stateful – aggregate and join Operator MTTF - 30, 50, 70 and 90 min Model parameters taken from application executing in IBM System S Checkpoint Partial replication + Checkpoint f(x) 1 2 f(x)2 f(x)1 f(x) 

Evaluation Metrics Availability Total number of tainted tuples All operators are alive and are not producing tainted data Total number of tainted tuples Total number of tainted tuples stored by the sink operator Percentage of tainted tuples Fraction of tainted tuples stored by the sink over total number of tuples produced by the golden run

Partial replication provides better availability than checkpoint.

Partial replication produces less tainted tuples than checkpoint.

Impact of SDC on Full Replication Technique Target  Bargain Discovery Operator MTTF – 120 min 1 2 f(x)2 f(x)1 1 2

Impact of SDC on application availability is small. 120 min

Percentage of tainted tuples is small when compared to golden run. 120 min

SDC breaks non-duplication guarantee promised by full replication technique. tainted tuples + non-tainted tuples > non-tainted tuples of golden run + confidence interval 120 min

Summary Modeling framework to evaluate the dependability provided by different techniques Assemble applications by composing stream operators, stream connections and tuples Demonstrated framework with three fault tolerance techniques Validation by comparing results with real fault injections and application executing in IBM System S Future Automatic model composition based on application source code and physical deployment

Modeling Stream Processing Applications for Dependability Evaluation Gabriela Jacques-Silva†‡, Zbigniew Kalbarczyk†, Bugra Gedik‡, Henrique Andrade‡, Kun-Lung Wu‡, Ravishankar K. Iyer† †University of Illinois at Urbana-Champaign ‡IBM Research – T. J. Watson Research Center