A Generic Provenance Middleware for Database Queries, Updates, and Transactions Bahareh Sadat Arab 1, Dieter Gawlick 2, Venkatesh Radhakrishnan 2, Hao Guo 1, Boris Glavic 1 IIT DBGroup 1 Oracle 2
Outline ❶ Motivation and Overview ❷ GProM Vision ❸ Provenance for Transactions 2 GProM - Provenance for Queries, Updates, and Transactions
Introduction Data Provenance – Information about the origin and creation process data Provenance tracking for database operations – Considerable interest from database community in last decade The de-facto standard for database provenance [1,2,3,4,5] – model provenance as annotations on data (e.g., tuples) – compute the provenance by propagating annotations (query rewrite) SELECT DISTINCT Owner FROM CannAcc; [1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, Springer, [2] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, [3] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 14(4):373–396, [4] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. In VLDB, pages 1151–1154, [5] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 41(3):5–14, GProM - Provenance for Queries, Updates, and Transactions
Use Cases Debugging data and transformations (queries)[1] Probabilistic databases (queries)[5] Auditing and compliance (transactions and update statements)[6] Understanding data integration transformations (queries and transactions) Assessing data quality and trust (queries and transactions)[7] Computing provenance for updates and transactions is essential for many use cases. [1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, pringer, [5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. In VLDB, [6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using temporal databases. In TaPP, [7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, GProM - Provenance for Queries, Updates, and Transactions
Shortcomings of State-of-the-Art No practical implementation for updates No system or model supports transactions Inflexible provenance storage – Always on [2,3] – On-demand only [1] Query rewrite use atypical access patterns and operator sequences – -> leads to poor execution plans Most systems: only one type of provenance [1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, pringer, [2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, [3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, GProM - Provenance for Queries, Updates, and Transactions
Objectives 1.Vision: Generic Provenance Database Middleware (GProM). – Provenance for Queries, updates, and transactions – User decides when to compute and store provenance – Supports multiple provenance models – Database-independent 2.Tracking provenance of concurrent transactions – Reenactment Queries 6 GProM - Provenance for Queries, Updates, and Transactions
Contributions 1.First solution for provenance of transactions 2.Retroactive on-demand provenance computation – Using read-only reenactment 3.Only requires audit log + time travel – Supported by most DBMS – No additional storage and runtime overhead 4.Non-invasive provenance computation – query rewrite + annotation propagation 7 GProM - Provenance for Queries, Updates, and Transactions
Outline ❶ Motivation and Overview ❷ GProM Vision ❸ Provenance for Transactions 8 GProM - Provenance for Queries, Updates, and Transactions
System Architecture Database independent middleware – Plug-able parser and SQL code generator Internal query representation – Relational Algebra Graph Model (AGM) Core driver: Query rewrites – Provenance Computation – Flexible storage policies for provenance – Provenance import/export – AGM Optimizer (rewritten queries) – Extensibility: Rewrite Specification Language (RSL) Initial prototype build on-top of Oracle 9 GProM - Provenance for Queries, Updates, and Transactions
GProM Overview 10 GProM - Provenance for Queries, Updates, and Transactions
Provenance Computation Query rewrite – Take original query q and rewrite into q + Computes original results + provenance – Propagate provenance through operations 11 GProM - Provenance for Queries, Updates, and Transactions
Example Rewrite Input: SELECT DISTINCT u.Owner FROM Usacc u, CanAcc c WHERE u.ID = c.ID; Rewrite Parts: USacc SELECT ID, Owner, Balance, Type, ID AS P 1, Owner AS P 2, Balance AS P 3, Type AS P 4 FROM USacc CanAcc SELECT ID, Owner, Balance, Type, ID AS P 5, Owner AS P 6, Balance AS P 7, Type AS P 8 FROM CanAcc WHERE u.ID = c.ID SELECT DISTINCT Owner SELECT Owner, P 1, P 2, P 3, P 4, P 5, P 6, P 7, P 8 Output: SELECT u.Owner, P 1, P 2, P 3, P 4, P 5, P 6, P 7, P 8 FROM (SELECT ID, Owner, Balance, Type, ID AS P 1, Owner AS P 2, Balance AS P 3, Type AS P 4 FROM USacc) u (SELECT ID, Owner, Balance, Type, ID AS P 5, Owner AS P 6, Balance AS P 7, Type AS P 8 FROM CanAcc) c WHERE u.ID = c.ID; 12 GProM - Provenance for Queries, Updates, and Transactions
Provenance Computation Operates on relational algebra representation of queries – Fixed set of rewrite rules per provenance type: One per type of algebra operator Recursive top-down rewrite – For each relation access: duplicate attributes as provenance – For each operator: replace with algebra graph that propagates provenance annotations Composable 13 GProM - Provenance for Queries, Updates, and Transactions
Supporting Past Queries, Updates, and Transactions Only needs audit log and time travel – supported by most DBMS Sufficient for provenance of past queries [4] Our contribution – Sufficient for updates and transactions [4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, GProM - Provenance for Queries, Updates, and Transactions
Provenance Generation and Storage Policies GProM default – Only compute provenance if explicitly requested User can register storage policies – When to store which type of provenance POLICY storeOnR { FIRE ON Query, Insert q WHEN Root(q) +=> Table(R) COMPUTE PI-CS STORE AS NEW TABLE NAMING SCHEME Hash } 15 GProM - Provenance for Queries, Updates, and Transactions
Optimizing Rewritten Queries Query rewrite use atypical access patterns and operator sequences leads to poor execution plans Optimization for rewritten queries – Heuristic – Cost-based SELECT ID, Owner, Balance, CASE WHEN Balance > THEN 'Premium ' ELSE Type END AS Type, prov_CanAcc_ID, prov_CanAcc_Owner, prov_CanAcc_Balance, prov_CanAcc_Type, prov_USacc_ID, prov_USacc_Owner, prov_USacc_Balance, prov_USacc_Type FROM u1... SELECT ID, Owner, Balance, 'Premium ' AS Type, prov_CanAcc_ID, prov_CanAcc_Owner, prov_CanAcc_Balance, prov_CanAcc_Type, prov_USacc_ID, prov_USacc_Owner, prov_USacc_Balance, prov_USacc_Type FROM u1 WHERE Balance > UNION ALL SELECT * FROM u1 WHERE (Balance > ) IS NOT TRUE 16 GProM - Provenance for Queries, Updates, and Transactions
Rewrite Extensibility Extensible using Rewrite Specification Language (RSL) – Concise specification of rewrite rules RULE mergeSelections { FOR q => c => g WHERE q->type = selection AND c->type = selection REWRITE INTO selection [pred = q->pred AND c->pred] => g } 17 GProM - Provenance for Queries, Updates, and Transactions
Outline ❶ Motivation and Overview ❷ GProM Vision ❸ Provenance for Transactions 18 GProM - Provenance for Queries, Updates, and Transactions
Provenance of Transactions 19 GProM - Provenance for Queries, Updates, and Transactions
Provenance of Transactions INSERT INTO USacc (SELECT ID, Owner, Balance, ‘Standard’ AS Type FROM CanAcc WHERE Type = ‘US_dollar’); UPDATE USacc SET Type = ’Premium’ WHERE Balance > ; COMMIT; 20 GProM - Provenance for Queries, Updates, and Transactions
Provenance of Transactions INSERT INTO Usacc (SELECT ID, Owner, Balance, ‘Standard’ AS Type FROM CanAcc WHERE Type = ‘US_dollar’); UPDATE Usacc SET Type = ’Premium’ WHERE Balance > ; 21 GProM - Provenance for Queries, Updates, and Transactions u1u1 u1u1 u2u2 u2u2
Provenance of Transactions Our Approach: Reenactment + Provenance Propagation Currently supports – Snapshot Isolation – Statement-level Snapshot Isolation 22 GProM - Provenance for Queries, Updates, and Transactions Gather Transaction Information Construct Update Reenactment Query Rewrite For Provenance Computation Execute Query 1 Construct Transaction Reenactment Query 2345
1.Gather Transaction Information Retrieve SQL statements of transaction from audit log Update u 1 : INSERT INTO USacc (SELECT ID, Owner, Balance, ‘Standard’ AS Type FROM CanAcc WHERE Type = ‘US_dollar’); Update u 2 : UPDATE Usacc SET Type = ’Premium’ WHERE Balance > ; 23 GProM - Provenance for Queries, Updates, and Transactions
2. Translate Updates: Reenactment Update reads table version and outputs updated table version Multiple versions of the database – Each modification of a tuple t causes a new version to be created – Old tuple versions are kept (SI) – Add version annotation τ to provenance of each updated row Use semi-ring model 24 GProM - Provenance for Queries, Updates, and Transactions UPDATE Usacc SET Type=’Premium’ WHERE Balance> ;
2.Translate Updates Construct update reenactment query – Simulates effect of update – Read DB version seen by update using time travel – Query result = updated table (Annotation-Equivalent) SELECT ID, Owner, Balance, ’Standard’ AS Type FROM CanAcc AS OF SCN 3652 WHERE Type=‘US_dollar’ UNION ALL SELECT * FROM Usacc AS OF SCN 3652; 25 GProM - Provenance for Queries, Updates, and Transactions UPDATE Usacc SET Type = ’Premium’ WHERE Balance > ; SELECT ID, Owner, Balance, ’Premium’ AS Type FROM Usacc AS OF SCN 3652 WHERE Balance> UNION ALL SELECT * FROM Usacc AS OF SCN 3652 WHERE (Balance> ) IS NOT TRUE; INSERT INTO Usacc (SELECT ID, Owner, Balance, ‘Standard’ AS Type FROM CanAcc WHERE Type = ‘US_dollar’);
3. Construct Reenactment Query Simulates the whole transaction – Annotation-Equivalent to original transaction Merge reenactment queries based on concurrency control protocol – Each concurrency control requires a different merge process – SERIALIZABLE (Snapshot isolation) -> modifications before the transaction started + previous updates of the transaction – READ COMMITTED (Snapshot isolation) -> sees committed changes by concurrent transaction WHIT U1 AS (SELECT ID, Owner, Balance, ’Standard’ AS Type FROM CanAcc AS OF SCN 3652 WHERE Type=‘US_dollar’ UNION ALL SELECT * FROM Usacc AS OF SCN 3652); SELECT ID, Owner, Balance, ’Premium’ AS Type FROM U1 WHERE Balance> UNION ALL SELECT * FROM U1 WHERE (Balance> ) IS NOT TRUE; 26 GProM - Provenance for Queries, Updates, and Transactions
4. Rewrite For Provenance Computation Rewrite reenactment query to compute provenance using annotation propagation WITH u1 AS (SELECT ID, Owner, Balance, ’Standard ’ AS Type, ID AS prov_CanAcc_ID,... NULL AS prov_USacc_ID,... 1 AS updated, FROM CanAcc AS OF SCN 3652 WHERE Type = ’US dollar ’ UNION ALL SELECT ID, Owner, Balance, Type, NULL AS prov_CanAcc_ID,... ID AS prov_USacc_ID,... 0 AS updated FROM USacc AS OF SCN 3652),... u1 AS (SELECT GProM - Provenance for Queries, Updates, and Transactions
4. Execute Query Execute query to retrieve provenance Updated USacc TuplesProvenance from CanAccProvenance from USacc IDOwnerBalanceTypeP1P2P3P4P5P6 3Alice Bright1,500,000Premium3Alice Bright1,500,000NULL 5Mark Smith50Standard5Mark Smith50NULL 28 GProM - Provenance for Queries, Updates, and Transactions
Conclusions We present our vision for GProM – Database-independent middleware for computing provenance of queries, updates, and transactions. First solution for provenance of transactions Query rewrite techniques on steroids: – Provenance computation – Transaction reenactment – Provenance translation – Provenance storage – Optimization Extensible through RSL language 29 GProM - Provenance for Queries, Updates, and Transactions
Future Works Implementing additional provenance types Comprehensive study of heuristic and cost-based optimizations Design and implementation of RSL Implementing additional provenance formats Study reenactment for other concurrency control mechanisms – Locking protocols (2PL) Investigate additional Use-cases for Reenactment – Transaction backout – Retroactive What-if analysis 30 GProM - Provenance for Queries, Updates, and Transactions
Questions? Homepage: Bahareh: Boris: DBGroup: GProM Project (partially funded by Oracle) Perm 31 GProM - Provenance for Queries, Updates, and Transactions
References [1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, pages 291–320. Springer, [2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 14(4):373–396, [3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, 38(3): 19, [4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, pages 311– 322, [5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. In VLDB, pages 1151–1154, [6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using temporal databases. In TaPP, [7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 41(3):5–14,
Q-Bomb One pattern that arises from reenactment are long chains of SELECT clauses using CASE – Each level references attributes from next level multiple times – Subquery pull-up creates expressions of size exponential in the number of SELECT clauses – In praxis: optimization never finishes Minimal example using one row table SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b FROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b … FROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b FROM R 33
Example Provenance Computation 34
Example – Update Reenactment 35
Example – Trans. Reenactment 36
Rewrite Reenactment Query 37
Execute Rewritten Query 38
Types of Update Operations - Insert Insert executed at time t Updated version of R contains 1.All tuples from previous version 2.All newly inserted tuples Fixed tuple defined in VALUES clause Results of query over database version at t Union these two sets INSERT INTO R VALUES (v 1,...,v n ); INSERT INTO R (q); 39 (SELECT * FROM R AS OF t) UNION ALL (SELECT v 1 AS a 1,..., v n AS a n ); (SELECT * FROM R AS OF t) UNION ALL (q(t));
Types of Update Operations - Delete Delete executed at time t Tuples in updated version of R: – All tuples from for which Condition is not fulfilled DELETE FROM R WHERE C ;SELECT * FROM R AS OF t WHERE (C) IS NOT TRUE; 40
Types of Update Operations - Update Update executed at time t Find tuples where Condition holds and update the attribute values Find tuples where NOT Condition holds Union these two sets UPDATE R SET A WHERE C ; (SELECT A’ FROM R AS OF t WHERE C) UNION ALL (SELECT * FROM R AS OF t WHERE (C) IS NOT TRUE) 41
READ COMMITTED Statement of a transaction T sees committed changes by concurrent transaction For a given update we need to combine – tuples produced by previous statements of same transaction – tuples produced by transactions that committed before update Observations – Once a transaction T modifies a tuple t, no other transaction can access t until T commits – Let u i be the update executed at time x of T that first modifies t – u i will read the latest version committed x – If we know u i then updates of T before x do not have to look at t Consider the database version 1 time unit (C-1) before commit of T – This contains all the tuple versions seen by the first update of T updating each individual tuple – Let t be a tuple version in this version and it’s start time is y – We know that updates from T which executed before y cannot have updated t – We can use version C-1 as input for reenactment as long as we hide tuple version t at y from an reenactment of an updated executed at x with x < y 42
READ COMMITTED u1 AS (SELECT CASE WHEN Balance <= AND version <= 0 THEN 'Standard ' ELSE Type END AS Type, ID, Owner, Balance, CASE WHEN Balance <= AND version <= 0 THEN −1 ELSE version END AS version FROM USacc AS OF SCN 3652), u2 AS (SELECT CASE WHEN Balance > AND version <= 1 THEN 'Premium' ELSE Type END AS Type, ID, Owner, Balance, CASE WHEN Balance > AND version <= 1 THEN −1 ELSE version END AS version FROM u1 ) SELECT ID, Owner, Balance, Type FROM u2 WHERE version = −1; 43
Database Independence Encapsulate database-specific functionality in pluggable modules. What needs to be adapted are : 1)Parser 2)SQL code generator 3)Metadata access 4)Audit log access 5)Time travel activation. 44
Accessing Several Tables Transactions Accessing Several Tables – We require user to specify which table she is interested in – Replace access to table with query for last update that modified the table 45