Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew.

Similar presentations


Presentation on theme: "An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew."— Presentation transcript:

1 An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew Tan, Gaurav Vijayvargiya

2 2 A system that is able to propagate meta-data along with the data as the data is being moved around Main motivation To trace the provenance and flow of data Many other uses Annotation Management System transformation a2a2 a1a1 a2a2 a1a1 b2b2 b1b1 b3b3 b2b2 b1b1 b3b3 a1a1 a2a2 a3a3 transformation step: a query, an ETL rule, etc. transformation

3 3 Restaurant CostType Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar $$$French $$$Seafood $Chinese $ American Restaurant CostType Pacifica Soho Kitchen & Bar $Chinese $ American All Restaurants Cheap Restaurants Yummy chicken curry!! NYRestaurants Restaurant CostType Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Zip $$$French10022 $$$Seafood10022 $Chinese10013 $ American10022 Serves fine French Cuisine in elegant setting. Formal attire. Extensive wine list! Our Vision

4 4 Other Applications Keep information that cannot be otherwise stored in the current database design Highlight wrong data Erroneous data may be copied around but the comment that it is wrong goes along with it Security and quality metric Annotate security or quality levels of data items

5 5 Some Related Work Idea is not new though propagation of annotations was never explicitly stated as provenance-based: Wang & Madnick [VLDB 90], Lee, Bressan & Madnick [WIDM 98], Bernstein & Bergstraesser [IEEE Data Eng. 99] Superimposed Information. Maier and Delcambre [WebDB 99] Annotations of Web documents Annotations on genomic sequences Why-Provenance Cui, Widom, & Wiener [CWW00]

6 6 Outline pSQL queries Semantics CUSTOM propagation scheme DEFAULT propagation scheme DEFAULT-ALL propagation scheme Implementation System architecture Experimental results

7 7 pSQL – an extension of SQL A pSQL fragment: SELECT DISTINCT selectlist FROM fromlist WHERE wherelist PROPAGATE DEFAULT | DEFAULT-ALL | r 1.A 1 TO B 1, …, r n.A n TO B n A pSQL query is a union of pSQL fragments

8 8 The CUSTOM Scheme SELECT DISTINCT A FROM R r PROPAGATE r.A TO A UNION SELECT DISTINCT A FROM R r PROPAGATE r.B TO A AB 12 23 35 R a b c h 1 2 3 Result Propagate annotations according to user specification 1 2 3 Result 1 a b 1 2 3 Result 2 c h annotation UNION a b c h

9 9 The DEFAULT Scheme Propagate annotations according to where data is copied from r.B TO B s.B TO B AB 12 23 35 R a b c h 2 3 4 5 Result c e f h AB 42 53 64 S d f g e g SELECT DISTINCT B FROM R r PROPAGATE DEFAULT UNION SELECT DISTINCT B FROM S s PROPAGATE DEFAULT natural semantics for tracing the provenance of data

10 10 SELECT DISTINCT r.A, r.B, s.C FROM R r, S s WHERE r.B = s.B PROPAGATE DEFAULT versus SELECT DISTINCT * FROM R NATURAL JOIN S PROPAGATE DEFAULT =a=a Annotation Propagation under the DEFAULT Scheme AB 12 R a 123 Ans 1 BC 23 S b a 123 Ans 2 a b equivalent queries, but different annotated output Q1:Q1: Q2:Q2:

11 11 The DEFAULT-ALL scheme Propagate annotations according to where data is copied from according to all equivalent formulations of the given query User Query Q: Compute the results of Q on a database D – idea: E(Q) denotes the set of all queries that are equivalent to Q (more precisely, (*) ). Execute each query in E(Q) on the database D under the DEFAULT scheme, then combine the results under a. SELECT DISTINCT r.A, s.B, s.C FROM R r, S s WHERE r.B = s.B PROPAGATE DEFAULT-ALL (*) the SQL query corresponding to Q

12 12 Computing the results of a DEFAULT-ALL query Question: Given a pSQL query Q with DEFAULT-ALL propagation scheme and a database D, can we compute the result of Q(D) ? Problem: There are infinitely many queries in E(Q). It is therefore impossible to execute every query in E(Q) in order to obtain the result of Q(D). Solution: Compute a finite basis of E(Q) first

13 13 A Query Basis of Q A query basis of Q, denoted as B(Q), is a finite set of pSQL queries (with default propagation scheme) such that: U a q(D) = a U a q(D) Given B(Q), we can execute each query in B(Q) and combine the results to obtain the result of Q(D). Question: Given Q, does B(Q) always exist and how can we compute B(Q) ? q B(Q)q E(Q)

14 14 Generating a Query Basis of Q Given R(A,B) and S(B,C) User query Q : Representative Query Q 0 : Propagations under the default propagation scheme Additional propagation due to the equality r.B = s.B Ans(x,y,z) :- R(x,y), S(y,z). The representative query propagates annotations according to where data is copied from or equivalently copied from. SELECT DISTINCT r.A, s.B, s.C FROM R r, S s WHERE r.B = s.B PROPAGATE DEFAULT-ALL SELECT DISTINCT r.A, s.B, s.C FROM R r, S s WHERE r.B = s.B PROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B

15 15 Generating a Query Basis of Q Auxiliary Queries: Q 1 : Q 2 : Ans(x,y,z) :- R(x,y), S(y,z), R(x,w). SELECT DISTINCT r.A, s.B, s.C FROM R r, S s, R r WHERE r.B = s.B, r.A = r.A PROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B, r.A TO A Ans(x,y,z) :- R(x,y), S(y,z), S(w,z). SELECT DISTINCT r.A, s.B, s.C FROM R r, S s, S s WHERE r.B = s.B, s.C = s.C PROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B, s.C TO C

16 16 Generating a Query Basis of Q Auxiliary Queries: Q 3 : Q 4 : Ans(x,y,z) :- R(x,y), S(y,z), R(w,y). SELECT DISTINCT r.A, s.B, s.C FROM R r, S s, R r WHERE r.B = s.B, r.B = r.B PROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B, r.B TO B Ans(x,y,z) :- R(x,y), S(y,z), S(y,w). SELECT DISTINCT r.A, s.B, s.C FROM R r, S s, S s WHERE r.B = s.B, s.B = s.B PROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B, s.B TO B

17 17 Correctness of the Algorithm For the example, a query basis of Q consists of Q 0, Q 1, Q 2, Q 3, and Q 4. Theorem: Given a pSQL query Q with DEFAULT-ALL propagation scheme, the algorithm generates a query basis of Q. Proof Idea: Every query in B(Q) is an equivalent query of Q Every equivalent query of Q is annotation-contained in U a q(D) q B(Q)

18 18 Outline pSQL queries Semantics CUSTOM propagation scheme DEFAULT propagation scheme DEFAULT-ALL propagation scheme Implementation System architecture Experimental results

19 19 System Architecture Translator Module Input: a pSQL query Q Output: an SQL query Q written against the naïve storage scheme Q is sent to the RDBMS and executed Postprocessor Module Input: sorted tuples (returned by the RDBMS) Output: An annotated set of tuples. Annotations for the same output location are collected together Duplicate tuples are removed PostprocessorTranslator USER pSQL query SQL query sorted tuples final result RDBMS

20 20 For every attribute of every relation there is an additional attribute for storing the annotations Conceivably, there are other possible storage schemes A Naïve Storage Scheme AB 12 34 a b cd R AABB 1a2c 1d2- 3b4- R

21 21 The Translator module Generate a Query Basis pSQL query default-all scheme set of pSQL queries with custom scheme Translate default pSQL to custom pSQL pSQL query default scheme pSQL query custom scheme Translate custom pSQL to SQL SQL query SELECT DISTINCT r.A AS A, r.B AS B FROM R r PROPAGATE DEFAULT SELECT DISTINCT r.A AS A, r.B AS B FROM R r PROPAGATE r.A TO A, r.B TO B default pSQL querycustom pSQL query

22 22 Experiments Goals compare the performance of pSQL queries under different propagation schemes (DEFAULT, DEFAULT-ALL, or no propagation scheme) compare the performance of pSQL queries when the number of annotations in a database is varied

23 23 Experimental setup Implemented on top of Oracle 9i Datasets 100MB, 500MB, 1GB TPCH database Unannotated database on original schema 30%, 60%, 100% annotations on naïve schema buffer size: 256Mb Test queries SPJ queries Varied the number of joins (0 to 4 joins) Varied the number of selected attributes (1,3 or 5 attributes)

24 24 100MB dataset – 100% annotated Q i (j) denotes a query with i joins and j output attributes.

25 25 500MB dataset – 100% annotated Q i (j) denotes a query with i joins and j output attributes.

26 26 1GB dataset – 100% annotated Q i (j) denotes a query with i joins and j output attributes.

27 27 100MB dataset annotated in various degrees Q i (j) denotes a query with i joins and j output attributes.

28 28 Contributions an annotation management system for carrying annotations along as data is being transformed based on provenance pSQL query language for propagation annotations CUSTOM – user defined DEFAULT – where data was copied from? DEFAULT-ALL – invariant under equivalent queries Generate-Query-Basis algorithm an initial implementation

29 29 Future work Performance of our annotation management system on other storage schemes pSQL extensions Aggregates Bag Queries

30 30 END

31 31 The CUSTOM Scheme - Example SELECT DISTINCT B FROM R r PROPAGATE r.A TO B, r.B TO B AB 12 23 35 R a b c h 2 3 5 Result a b c h

32 32 Terminology A location is a triple (R, t, A) Definition: A query Q 1 is annotation contained in a query Q 2 if: Q 1 Q 2 for every database D, the set of annotations attached to every output location in Q 1 (D) is a subset of the set of annotations associated with the same location in the output of Q 2 (D). AB 12 R a The annotation a is attached to the location (R,(1,2),B)

33 33 Ans(x,y,z) :- R(x,y), S(y,z), y = y. { x ! 1, y ! 2, y ! 2, z ! 3 } Ans(x,y,z) :- R(x,y), S(y,z). { x ! 1, y ! 2, z ! 3 } Annotations of values that reside in different source locations but are bound to the same variable are unioned together. Ans(y) :- R(x,y). Ans(y) :- S(y,z). Ans(2 ). Annotations that belong to the same output location are unioned together. In a More Concise Notation ab ab ab

34 34 Containment vs. annotation-containment ABC 123 145 184 895 R a b c d 15 Ans 1 c 15 Ans 2 cd ab b Q 1 Ans(x,v) :- R(x,y,u), R(x,z,v), R(t,w,z). Q 2 Ans(x,v) :- R(p,q,v), R(x,z,v), R(t,w,z). Q 1 Q 2 but… Q 1 a Q 2 and Q 2 a Q 1

35 35 Translating a CUSTOM pSQL to SQL Q 1 : SELECT r.A, NULL, s.B, s.B, s.C, s.C FROM R r, S s WHERE r.B = s.B Q 2 : SELECT r.A, NULL, s.B, r.B, s.C, NULL FROM R r, S s WHERE r.B = s.B SELECT DISTINCT * FROM ( Q 1 UNION Q 2 ) t ORDER BY t.A, t.B, t.C SELECT DISTINCT r.A, s.B, s.C FROM R r, S s WHERE r.B = s.B PROPAGATE s.B TO B, s.C TO C, r.B TO B custom pSQL query: SQL query:


Download ppt "An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew."

Similar presentations


Ads by Google