Download presentation
Presentation is loading. Please wait.
PublishSteven Banks Modified over 6 years ago
1
Enhanced Provenance Model (TAP): Time-aware Provenance for Distributed Systems
Original Article: Wenchao Zhou, Ling Ding, Andreas Haeberlen, Zachary Ives, Boon Thau Loo TAP: Time-aware Provenance for Distributed Systems Athiq Ahamed, ITIS, TU-Braunschweig Supervised by: Dr. Lena Wiese Georg-August-University Göttingen
2
Agenda Introduction to Provenance Existing Systems and their problems
Motivation Challenges TAP Provenance maintenance and tradeoffs Provenance querying and optimizations Limitations and Future work
3
Introduction Provenance, is the origin of something, truly a versatile concept It is useful for analyzing execution dynamics in systems It is applied to several different fields by fine tuning it Data provenance explains how a particular data arrived in databases Network provenance applied in the networking field Benefits, the system administrator for diagnosing the root cause, forensic queries, failure diagnosis Examples of provenance systems: ExSPAN, TAP, PASS
4
Existing System (PASS)
Automatically completes history of a particular state PASS is very useful where system level provenance is maintained Avoiding manual work which tend to be a failure Complete provenance system to collect, store, manage and search provenance Maintains versions of provenance with which one can compare the versions
5
Why and Where model Data Provenance
Why is the piece of data in a particular output or location ? Where did the respective data come from ? Helps the operator understand the arrival of the particular data They have followed several syntactic approaches in order to present results ? WHY WHERE
6
(ExSPAN) ExSPAN for maintaining network provenance at internet scale
Answers queries like forensic analysis, failure diagnosis in a distributed environment Extensible framework feature Declarative networking platform, NDlog Does not support full range of functionality for analyzing distributed systems
7
Diagram from original paper
At t1 At t1<t2 Figure 1: An three-node example network. The best path between node a and node c (highlighted) changes in response to a topology change.
8
Example calculating MINCOST
mc1 :- mc2 :- C=C1+C2. mc3 :- Protocol runs continuously, calculates the minimum cost with links appearing and disappearing @ is to represent location specifier in a datalog language Location specifier is followed by a node which has the respective tuple
9
Problems in Existing System
Use case If a network operator wants to analyzing the reason why the route of some large site like eBay changed from r1 to r2 a minute or sometime ago The existing systems cannot query this type of questions But, can get answer about the state change It does not ask about the current state but about state existed in the past Can answer queries only after stable state
10
Motivation To explicitly capture the causality
To develop a new provenance model Maintenance strategies for capturing time, distribution and causalities of updates A new query processing and optimization techniques Novel provenance query language with declarative specification To answer provenance queries even when the system is in transient state
11
Challenges Challenge #1: Transient and inconsistent state
Challenge #2: Explanations for state changes Challenge #3: Security without trusted nodes
12
System Model Individual systems are distributed across a specific geographical areas or a specific process in the same system Each of these systems are referred to as nodes States are expressed as tuples where they have a fixed schema Derivation rules, specifies a set of rules from which tuples are derived Rules can be implicit or explicit, example NDlog TAP can be applied to both implicit as well as explicit languages
13
TAP: 1 Added two new features which solved the limitations of the existing system They remembered dependencies which existed in the past at some point Answer provenance queries even when the system is in transient state They explicitly represent the tuple changes in the TAP´s model They represented the dependencies between them
14
TAP: 2 They have four types of vertices
INSERT, DELETE,DERIVE AND UNDERIVE Added useful time dimension in the provenance graph One can query the different effects of state changes TAP captured the causalities explicitly TAP´s provenance graph also contains edges, which represent data flow TAP additionally captures causality flow, the causes between the updates
15
Diagram from original paper
Figure 2: Comparison between classical provenance (left) and time-aware provenance (right)
16
Provenance Maintenance
The graph representation need to be stored for provenance maintenance This graph representation are stored as relational tables In the relational table each vertex is a tuple Tuple also has a addition attribute that stores pointers to its contributing vertices When data dependency changes they maintain an addition time dimension (provenance versioning) Limitations: Storing cost of these tables is difficult in the distributed environment Optimization techniques to resolve the limitation
17
Optimization Provenance deltas System input logs Per-node input logs
Instead of maintaining the whole provenance information in every version Deltas between adjacent vertices are recorded System input logs Only raw inputs are recorded, only the base tuples for the entire system Checkpoints of current state, later discard them to save space Per-node input logs Generate provenance on the fly for each node, deterministic replay There is no clear winner Choose technique according to the use case (Tradeoff)
18
Provenance Querying To query the TAP’s provenance graph they developed a new query language TapQL Provenance query language (ProQL) with compact graph based representation ProQL can be translated to SQL queries over an RDBMS TapQL is a an extension of ProQL Which addresses queries like tuple-based provenance TapQL as an extension from ProQL takes time and changes into consideration Snippet- Effects of link insertion at certain time FOR [-mincos t $X] <- + [+ l i n k $Y] WHERE $Y. time>t INCLUDE PATH $X <-+ $Y RETURN $X
19
Diagram from original paper
Figure 3: The multi-staged querying strategy
20
Querying Strategy Optimized for replays of provenance deltas
Macroquery Macroquery iterates to the respective candidate tuple from the top Microquery Gets the provenance of the respective tuple in a distributed recursive way Stops when base tuple is reached ( constraints in the query ) Vertex query It returns the set of vertices that are adjacent to the given vertex
21
Limitations and future work
They started with the motivation of security features but they have not implemented it They stated security as a future work and moreover stated as impossible task They tested everything assuming all nodes as reliable which is not true all the time Only with weaker guarantees they provide reliable query results
22
Thank You!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.