Computing Provenance and Annotations of Derived Data Wang-Chiew Tan UC Santa Cruz.

Slides:



Advertisements
Similar presentations
An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew.
Advertisements

On Propagation of Deletions and Annotations through Views Wang-Chiew Tan University of Pennsylvania Database Group Joint work with Peter Buneman and Sanjeev.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Introduction To SQL Lynnwood Brown President System Managers LLC Copyright System Managers LLC 2003 all rights reserved.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
Lecture Microsoft Access and Relational Database Basics.
Week 23 - Revision1 Week 23 Revision DSA. Week 23 - Revision2 Agenda Section A: Multiple choice Section B: Problem-oriented questions Topics for revision.
1 COS 425: Database and Information Management Systems XML and information exchange.
SQL SQL stands for Structured Query Language SQL allows you to access a database SQL is an ANSI standard computer language SQL can execute queries against.
Introduction to Database Systems 1 Relational Algebra Relational Model: Topic 3.
Cs3431 Relational Algebra : #I Based on Chapter 2.4 & 5.1.
CS405G: Introduction to Database Systems Final Review.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Rutgers University Relational Algebra 198:541 Rutgers University.
Graph Algebra with Pattern Matching and Aggregation Support 1.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
IST Databases and DBMSs Todd S. Bacastow January 2005.
Approximated Provenance for Complex Applications
E-Science: Stuart Anderson National e-Science Centre Stuart Anderson National e-Science Centre.
CS848: Topics in Databases: Foundations of Query Optimization Topics Covered  Databases  QL  Query containment  More on QL.
Relational Algebra 2 Chapter 5.2 V3.0 Napier University Dr Gordon Russell.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
Information Systems: Databases Define the role of general information systems Describe the elements of a database management system (DBMS) Describe the.
1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.
CHAPTER 8: MANAGING DATA RESOURCES. File Organization Terms Field: group of characters that represent something Record: group of related fields File:
Computer Science 101 Database Concepts. Database Collection of related data Models real world “universe” Reflects changes Specific purposes and audience.
1 ICS 184: Introduction to Data Management Lecture Note 10 SQL as a Query Language (Cont.)
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
Lecture 8 Database Theory & Practice (2) : The Relational Data Model UFCEKG-20-2 Data, Schemas & Applications.
MSc IT UFIE8K-15-M Data Management Prakash Chatterjee Room 3P16
Bdbms: A Database System for Scientific Data Management Mohamed Y. Eltabakh, Mourad Ouzzani, Walid G. Aref, Ahmed Elmagarmid, Yasin Silva, Umer Arshad,
University of Crete Department of Computer Science ΗΥ-561 Web Data Management XML Data Archiving Konstantinos Kouratoras.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
CS 4432query processing1 CS4432: Database Systems II Lecture #11 Professor Elke A. Rundensteiner.
1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)
1 Relational Algebra and Calculas Chapter 4, Part A.
Relational Algebra.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
SQL SeQueL -Structured Query Language SQL SQL better support for Algebraic operations SQL Post-Relational row and column types,
Guofeng Cao CyberInfrastructure and Geospatial Information Laboratory Department of Geography National Center for Supercomputing Applications (NCSA) University.
CS848: Topics in Databases: Information Integration Topics covered  Databases  QL  Query containment  An evaluation of QL.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
Insert & Delete Objectives of the Lecture : To consider the insertion of tuples into a relation; To consider the deletion of tuples from a relation; To.
Relational Algebra p BIT DBMS II.
Containment of Relational Queries with Annotation Propagation Wang-Chiew Tan University of California, Santa Cruz.
Databases Databases are collections of information; our study repeats a theme: Tell the computer the structure, and it can help you! © 2004, Lawrence Snyder.
CSE 326: Data Structures Lecture #22 Databases and Sorting Alon Halevy Spring Quarter 2001.
Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.
Chapter 3 An Introduction to Relational Databases.
XML: Extensible Markup Language
More SQL: Complex Queries,
Information Systems Today: Managing in the Digital World
Databases and Information Management
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Algebra.
Relational Calculus and QBE
Lecture 16 : The Relational Data Model
Relational Calculus and QBE
Lecture 16 : The Relational Data Model
CENG 351 File Structures and Data Managemnet
Presentation transcript:

Computing Provenance and Annotations of Derived Data Wang-Chiew Tan UC Santa Cruz

2 Provenance of data When you see some data on the Web, do you know –where it came from? –why it is there? This information (provenance) is typically lost in the process of copying/transcribing/transforming databases Loss of provenance is an acute problem in some scientific databases

3 Complex interdependencies (Example from scientific databases) GenBank Swissprot TRRD GERD Transfac EpoDB EMBL DDBJ flow of data BEAD GAIA Various problems: Trace provenance of data Propagate annotations

4 Two kinds of provenance Hotel Restaurant Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Waldorf Astoria Holiday Inn DT Cost $$$ $ $ HotelZip Rating Waldorf Astoria Restaurant CostType Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Zip $$$French10022 $$$Seafood10022 $Chinese10013 $ American10022 Holiday Inn DT JOIN, PROJECT NYRestaurants (Source table) NYHotels (Source table) Why? Where? View 4.5 Rating (Where-provenance) (Why-provenance)

5 SDSS - Sloan Digital Sky Server Select Specobj.z, photoobj.g, photoobj.r From Specobj, photoobj Where Specobj.objid = photoobj.objid and Specobj.specclass = 3 and Specobj.zconf >.95

6 Compute provenance Question: Suppose a database is created by a query. Can we compute the why and where provenance of an element? Answer: Computing provenance (both why and where) is NP-hard in general.

7 Annotations Adds value to data –knowledge sharing : annotations can be read & reviewed by independent parties Annotations are loosely structured –Annotations on data at various levels of granularity, annotations on annotations Source Data: –proprietary –fixed schema A system that overlays annotations on existing data Useful tool for scientific databases Annotations should spread back to the source and forward to other databases

8 Restaurant CostType Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar $$$French $$$Seafood $Chinese $ American Restaurant CostType Pacifica Soho Kitchen & Bar $Chinese $ American All Restaurants (View 1) Cheap Restaurants (View 2) Yummy chicken curry!! NYRestaurants (Source Table) Restaurant CostType Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Zip $$$French10022 $$$Seafood10022 $Chinese10013 $ American10022 Serves fine French Cuisine in elegant setting. Jackets required. Extensive wine list! Propagating annotations

9 Location and Propagation Rules A location is a triple: (R, t, A) A1A1 A2A2 A3A3 A1A1 A2A2 A3A3 A3A3 A1A1 A2A2 A3A3 A1A1 A2A2 A2A2 A3A3 A1A1 A2A2 A3A3 A1A1 A2A2 A3A3 A1A1 A2A2 A3A3 A1A1 A2A2 A3A3 R R R1R1 R2R2 R1R1 R2R2 relation nametuple in RA is an attribute in schema of R Propagation Rules: –Select: –Project: –Join: –Union:

10 Computing annotation propagation Question: Suppose a database is created by a query over some source data, can we compute how to propagate an annotation on a data element back to the source with minimum side-effects? Answer: Computing the minimum side-effect annotation is NP-hard in general Source: Relational Database View : result of query applied on source Model: Query

11 Related Work on Annotations (not exhaustive!) Superimposed Information ( D. Maier, L. Delcambre [WebDB’99]) –data “placed over” existing information e.g. bookmark files, schema of a database Annotation Systems –Annotea ( W3C) annotate web pages –Multivalent Browser (R. Wilensky, T. A. Phelps. UC Berkeley DL Project) annotate on PDF files, HTML, etc. –BioDAS (Distributed Annotation Server) ( L.Stein et al. ) annotate on genome sequences No one has formally studied annotation placement problem

12 Where-provenance & annotation placement –where should the annotation be placed in the source in order to propagate the annotation to view data d ? Annotate the source data in one of the source locations in the where-provenance of d Provenance & Archiving –trace a piece of data to its correct source version Why-provenance & view deletion which source data should be deleted in order to delete view data d ? A combination of source data that altogether “disable” every witness for d Provenance and Annotations

13 How do we attach annotations to data? Relational tables: Identify a particular column of a particular table of a particular relation: (R, t, A) Tree-like data: Need a canonical path to the data element R t A

14 Lots more to do! Further study on provenance for queries that involve negation, aggregates select sum(sal) from Employee where sal > 50K Handle “irregular” annotations and on tree-like data. How about databases which are manually constructed and annotated? –Organize data with keys Use of constraints and special cases to derive efficient algorithms for propagating annotations back Language specific issues

15 Inconsistencies in “annotation-aware” language(s) The same query in different languages, but different annotation behavior Relational Algebra: Emp JOIN Department SQL: SELECT e.Name, e.Sal, e.Dept, d.Manager FROM Emp e, Department d WHERE e.Dept = d.Dept [Name:”Joe”, Sal:50k ] [Name:”Joe”, Sal:50K, Dept:”Marketing”, Manager:”Jane”] Q 1 = SELECT e.Name, e.Sal FROM Emp e WHERE e.Sal = “50K” Q 2 = SELECT e.Name, “50K” AS Sal FROM Emp e WHERE e.Sal = “50K” Equivalent queries in the same language, but different annotation behavior =a=a Name Sal Dept Joe 50K Marketing Emp Dept Manager Marketing Jane Department

16 Relational algebra suggests a natural set of propagation rules SQL suggests another natural propagation rule –based on variable bindings Question: Can we extend/design the the query language(s) so that –Equivalent queries have the same annotation behavior –Translation of a query from one language (e.g. SQL) into another (e.g. relational algebra) yields the same annotation behavior Perhaps a more fundamental question... –Should a query language be “annotation-aware” ? –Perhaps we should have language constructs to allow the user to explicitly control annotation propagation? Do we need an “annotation-aware” QL?

17 End