Enhanced Provenance Model (TAP): Time-aware Provenance for Distributed Systems Original Article: Wenchao Zhou, Ling Ding, Andreas Haeberlen, Zachary Ives,

Slides:



Advertisements
Similar presentations
Physical Database Design and Tuning R&G - Chapter 20 Although the whole of this life were said to be nothing but a dream and the physical world nothing.
Advertisements

Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
Implementing Declarative Overlays From two talks by: Boon Thau Loo 1 Tyson Condie 1, Joseph M. Hellerstein 1,2, Petros Maniatis 2, Timothy Roscoe 2, Ion.
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
1 Enviromatics Spatial database systems Spatial database systems Вонр. проф. д-р Александар Маркоски Технички факултет – Битола 2008 год.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Using Programmer-Written Compiler Extensions to Catch Security Holes Authors: Ken Ashcraft and Dawson Engler Presented by : Hong Chen CS590F 2/7/2007.
Chapter 12 Information Systems Chapter Goals Define the role of general information systems Explain how spreadsheets are organized Create spreadsheets.
Creating Architectural Descriptions. Outline Standardizing architectural descriptions: The IEEE has published, “Recommended Practice for Architectural.
Chapter 12 Information Systems Nell Dale John Lewis.
Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University.
Lecture Week 3 Introduction to Dynamic Routing Protocol Routing Protocols and Concepts.
Distributed Databases
Introduction to Systems Analysis and Design Trisha Cummings.
Chapter 16 – DNS. DNS Domain Name Service This service allows client machines to resolve computer names (domain names) to IP addresses DNS works at the.
DATABASE MANAGEMENT SYSTEMS BASIC CONCEPTS 1. What is a database? A database is a collection of data which can be used: alone, or alone, or combined /
DATABASE MANAGEMENT SYSTEMS BASIC CONCEPTS 1. What is a database? A database is a collection of data which can be used: alone, or alone, or combined /
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
ITEC224 Database Programming
5.1 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Y. WuHotNets-XII (Nov 22, 2013)1 Answering Why-Not Queries in Software-Defined Networks with Negative Provenance Yang Wu* Andreas Haeberlen* Wenchao Zhou.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Okalo Daniel Ikhena Dr. V. Z. Këpuska December 7, 2007.
1 Data Link Layer Lecture 23 Imran Ahmed University of Management & Technology.
1 Directed Graphs Chapter 8. 2 Objectives You will be able to: Say what a directed graph is. Describe two ways to represent a directed graph: Adjacency.
Introduction to Artificial Intelligence (G51IAI) Dr Rong Qu Blind Searches - Introduction.
Motivation: Finding the root cause of a symptom
Automated Network Repair with Meta Provenance
Randomized Kinodynamics Planning Steven M. LaVelle and James J
Introduction to Active Directory
SQL Triggers, Functions & Stored Procedures Programming Operations.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
Managing Data Resources File Organization and databases for business information systems.
Maitrayee Mukerji. INPUT MEMORY PROCESS OUTPUT DATA INFO.
1 Relates to Lab 4. This module covers link state routing and the Open Shortest Path First (OSPF) routing protocol. Dynamic Routing Protocols II OSPF.
Advanced Data Structures Lecture 1
Problem: Internet diagnostics and forensics
Authors: Jiang Xie, Ian F. Akyildiz
Efficient Evaluation of XQuery over Streaming Data
Dynamic Routing Protocols II OSPF
MS Access Forms, Queries, Reports Matt Martin
Data Indexing Herbert A. Evans.
Hadoop.
Indexing Structures for Files and Physical Database Design
Record Storage, File Organization, and Indexes
Lecture 11 Graph Algorithms
Data Structure Interview Question and Answers
Physical Structure of GDB
Datastructure.
Program based on pointers in C.
Chapter 12: Query Processing
任課教授:陳朝鈞 教授 學生:王志嘉、馬敏修
Chapter 15 QUERY EXECUTION.
Dynamic Routing Protocols II OSPF
MANAGING DATA RESOURCES
Introduction to Systems Analysis and Design
Physical Database Design
Chapter 20 Network Layer: Internet Protocol
Databases.
Database Systems Instructor Name: Lecture-3.
Query Processing CSD305 Advanced Databases.
Database Administration
Chapter 7 Structuring System Requirements: Conceptual Data Modeling
Indexing, Access and Database System Architecture
Dynamic Routing Protocols part3 B
Presentation transcript:

Enhanced Provenance Model (TAP): Time-aware Provenance for Distributed Systems Original Article: Wenchao Zhou, Ling Ding, Andreas Haeberlen, Zachary Ives, Boon Thau Loo TAP: Time-aware Provenance for Distributed Systems Athiq Ahamed, ITIS, TU-Braunschweig Supervised by: Dr. Lena Wiese Georg-August-University Göttingen

Agenda Introduction to Provenance Existing Systems and their problems Motivation Challenges TAP Provenance maintenance and tradeoffs Provenance querying and optimizations Limitations and Future work

Introduction Provenance, is the origin of something, truly a versatile concept It is useful for analyzing execution dynamics in systems It is applied to several different fields by fine tuning it Data provenance explains how a particular data arrived in databases Network provenance applied in the networking field Benefits, the system administrator for diagnosing the root cause, forensic queries, failure diagnosis Examples of provenance systems: ExSPAN, TAP, PASS

Existing System (PASS) Automatically completes history of a particular state PASS is very useful where system level provenance is maintained Avoiding manual work which tend to be a failure Complete provenance system to collect, store, manage and search provenance Maintains versions of provenance with which one can compare the versions

Why and Where model Data Provenance Why is the piece of data in a particular output or location ? Where did the respective data come from ? Helps the operator understand the arrival of the particular data They have followed several syntactic approaches in order to present results ? WHY WHERE

(ExSPAN) ExSPAN for maintaining network provenance at internet scale Answers queries like forensic analysis, failure diagnosis in a distributed environment Extensible framework feature Declarative networking platform, NDlog Does not support full range of functionality for analyzing distributed systems

Diagram from original paper At t1 At t1<t2 Figure 1: An three-node example network. The best path between node a and node c (highlighted) changes in response to a topology change.

Example calculating MINCOST mc1 cost(@S,D,C) :- link(@S,D,C). mc2 cost(@S,D,C) :- link(@Z,S,C1), mincost(@Z,D,C2), C=C1+C2.  mc3 mincost(@S,D,MIN<C>) :- cost(@S,D,C) Protocol runs continuously, calculates the minimum cost with links appearing and disappearing @ is to represent location specifier in a datalog language Location specifier is followed by a node which has the respective tuple

Problems in Existing System Use case If a network operator wants to analyzing the reason why the route of some large site like eBay changed from r1 to r2 a minute or sometime ago The existing systems cannot query this type of questions But, can get answer about the state change It does not ask about the current state but about state existed in the past Can answer queries only after stable state

Motivation To explicitly capture the causality To develop a new provenance model Maintenance strategies for capturing time, distribution and causalities of updates A new query processing and optimization techniques Novel provenance query language with declarative specification To answer provenance queries even when the system is in transient state

Challenges Challenge #1: Transient and inconsistent state Challenge #2: Explanations for state changes Challenge #3: Security without trusted nodes

System Model Individual systems are distributed across a specific geographical areas or a specific process in the same system Each of these systems are referred to as nodes States are expressed as tuples where they have a fixed schema Derivation rules, specifies a set of rules from which tuples are derived Rules can be implicit or explicit, example NDlog TAP can be applied to both implicit as well as explicit languages

TAP: 1 Added two new features which solved the limitations of the existing system They remembered dependencies which existed in the past at some point Answer provenance queries even when the system is in transient state They explicitly represent the tuple changes in the TAP´s model They represented the dependencies between them

TAP: 2 They have four types of vertices INSERT, DELETE,DERIVE AND UNDERIVE Added useful time dimension in the provenance graph One can query the different effects of state changes TAP captured the causalities explicitly TAP´s provenance graph also contains edges, which represent data flow TAP additionally captures causality flow, the causes between the updates

Diagram from original paper Figure 2: Comparison between classical provenance (left) and time-aware provenance (right)

Provenance Maintenance The graph representation need to be stored for provenance maintenance This graph representation are stored as relational tables In the relational table each vertex is a tuple Tuple also has a addition attribute that stores pointers to its contributing vertices When data dependency changes they maintain an addition time dimension (provenance versioning) Limitations: Storing cost of these tables is difficult in the distributed environment Optimization techniques to resolve the limitation

Optimization Provenance deltas System input logs Per-node input logs Instead of maintaining the whole provenance information in every version Deltas between adjacent vertices are recorded System input logs Only raw inputs are recorded, only the base tuples for the entire system Checkpoints of current state, later discard them to save space Per-node input logs Generate provenance on the fly for each node, deterministic replay There is no clear winner Choose technique according to the use case (Tradeoff)

Provenance Querying To query the TAP’s provenance graph they developed a new query language TapQL Provenance query language (ProQL) with compact graph based representation ProQL can be translated to SQL queries over an RDBMS TapQL is a an extension of ProQL Which addresses queries like tuple-based provenance TapQL as an extension from ProQL takes time and changes into consideration Snippet- Effects of link insertion at certain time FOR [-mincos t $X] <- + [+ l i n k $Y] WHERE $Y. time>t INCLUDE PATH $X <-+ $Y RETURN $X

Diagram from original paper Figure 3: The multi-staged querying strategy

Querying Strategy Optimized for replays of provenance deltas Macroquery Macroquery iterates to the respective candidate tuple from the top Microquery Gets the provenance of the respective tuple in a distributed recursive way Stops when base tuple is reached ( constraints in the query ) Vertex query It returns the set of vertices that are adjacent to the given vertex

Limitations and future work They started with the motivation of security features but they have not implemented it They stated security as a future work and moreover stated as impossible task They tested everything assuming all nodes as reliable which is not true all the time Only with weaker guarantees they provide reliable query results

Thank You!