Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing Provenance and Annotations of Derived Data Wang-Chiew Tan UC Santa Cruz.

Similar presentations


Presentation on theme: "Computing Provenance and Annotations of Derived Data Wang-Chiew Tan UC Santa Cruz."— Presentation transcript:

1 Computing Provenance and Annotations of Derived Data Wang-Chiew Tan UC Santa Cruz

2 2 Provenance of data When you see some data on the Web, do you know –where it came from? –why it is there? This information (provenance) is typically lost in the process of copying/transcribing/transforming databases Loss of provenance is an acute problem in some scientific databases

3 3 Complex interdependencies (Example from scientific databases) GenBank Swissprot TRRD GERD Transfac EpoDB EMBL DDBJ flow of data BEAD GAIA Various problems: Trace provenance of data Propagate annotations

4 4 Two kinds of provenance Hotel Restaurant Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Waldorf Astoria Holiday Inn DT Cost $$$ $ $ HotelZip Rating Waldorf Astoria Restaurant CostType Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Zip $$$French10022 $$$Seafood10022 $Chinese10013 $ American10022 Holiday Inn DT 10022 10013 4.5 4.0 JOIN, PROJECT NYRestaurants (Source table) NYHotels (Source table) Why? Where? View 4.5 Rating 4.5 4.0 (Where-provenance) (Why-provenance)

5 5 SDSS - Sloan Digital Sky Server Select Specobj.z, photoobj.g, photoobj.r From Specobj, photoobj Where Specobj.objid = photoobj.objid and Specobj.specclass = 3 and Specobj.zconf >.95

6 6 Compute provenance Question: Suppose a database is created by a query. Can we compute the why and where provenance of an element? Answer: Computing provenance (both why and where) is NP-hard in general.

7 7 Annotations Adds value to data –knowledge sharing : annotations can be read & reviewed by independent parties Annotations are loosely structured –Annotations on data at various levels of granularity, annotations on annotations Source Data: –proprietary –fixed schema A system that overlays annotations on existing data Useful tool for scientific databases Annotations should spread back to the source and forward to other databases

8 8 Restaurant CostType Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar $$$French $$$Seafood $Chinese $ American Restaurant CostType Pacifica Soho Kitchen & Bar $Chinese $ American All Restaurants (View 1) Cheap Restaurants (View 2) Yummy chicken curry!! NYRestaurants (Source Table) Restaurant CostType Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Zip $$$French10022 $$$Seafood10022 $Chinese10013 $ American10022 Serves fine French Cuisine in elegant setting. Jackets required. Extensive wine list! Propagating annotations

9 9 Location and Propagation Rules A location is a triple: (R, t, A) A1A1 A2A2 A3A3 A1A1 A2A2 A3A3 A3A3 A1A1 A2A2 A3A3 A1A1 A2A2 A2A2 A3A3 A1A1 A2A2 A3A3 A1A1 A2A2 A3A3 A1A1 A2A2 A3A3 A1A1 A2A2 A3A3 R R R1R1 R2R2 R1R1 R2R2 relation nametuple in RA is an attribute in schema of R Propagation Rules: –Select: –Project: –Join: –Union:

10 10 Computing annotation propagation Question: Suppose a database is created by a query over some source data, can we compute how to propagate an annotation on a data element back to the source with minimum side-effects? Answer: Computing the minimum side-effect annotation is NP-hard in general Source: Relational Database View : result of query applied on source Model: Query

11 11 Related Work on Annotations (not exhaustive!) Superimposed Information ( D. Maier, L. Delcambre [WebDB’99]) –data “placed over” existing information e.g. bookmark files, schema of a database Annotation Systems –Annotea ( W3C) annotate web pages –Multivalent Browser (R. Wilensky, T. A. Phelps. UC Berkeley DL Project) annotate on PDF files, HTML, etc. –BioDAS (Distributed Annotation Server) ( L.Stein et al. ) annotate on genome sequences No one has formally studied annotation placement problem

12 12 Where-provenance & annotation placement –where should the annotation be placed in the source in order to propagate the annotation to view data d ? Annotate the source data in one of the source locations in the where-provenance of d Provenance & Archiving –trace a piece of data to its correct source version Why-provenance & view deletion which source data should be deleted in order to delete view data d ? A combination of source data that altogether “disable” every witness for d Provenance and Annotations

13 13 How do we attach annotations to data? Relational tables: Identify a particular column of a particular table of a particular relation: (R, t, A) Tree-like data: Need a canonical path to the data element R t A

14 14 Lots more to do! Further study on provenance for queries that involve negation, aggregates select sum(sal) from Employee where sal > 50K Handle “irregular” annotations and on tree-like data. How about databases which are manually constructed and annotated? –Organize data with keys Use of constraints and special cases to derive efficient algorithms for propagating annotations back Language specific issues

15 15 Inconsistencies in “annotation-aware” language(s) The same query in different languages, but different annotation behavior Relational Algebra: Emp JOIN Department SQL: SELECT e.Name, e.Sal, e.Dept, d.Manager FROM Emp e, Department d WHERE e.Dept = d.Dept [Name:”Joe”, Sal:50k ] [Name:”Joe”, Sal:50K, Dept:”Marketing”, Manager:”Jane”] Q 1 = SELECT e.Name, e.Sal FROM Emp e WHERE e.Sal = “50K” Q 2 = SELECT e.Name, “50K” AS Sal FROM Emp e WHERE e.Sal = “50K” Equivalent queries in the same language, but different annotation behavior =a=a Name Sal Dept Joe 50K Marketing Emp Dept Manager Marketing Jane Department

16 16 Relational algebra suggests a natural set of propagation rules SQL suggests another natural propagation rule –based on variable bindings Question: Can we extend/design the the query language(s) so that –Equivalent queries have the same annotation behavior –Translation of a query from one language (e.g. SQL) into another (e.g. relational algebra) yields the same annotation behavior Perhaps a more fundamental question... –Should a query language be “annotation-aware” ? –Perhaps we should have language constructs to allow the user to explicitly control annotation propagation? Do we need an “annotation-aware” QL?

17 17 End


Download ppt "Computing Provenance and Annotations of Derived Data Wang-Chiew Tan UC Santa Cruz."

Similar presentations


Ads by Google