Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,

Slides:

Advertisements

Similar presentations

Uncertainty in Data Integration Ai Jing

Advertisements

Forward Data Cache Integration Pattern

Chapter 1: The Database Environment

Chapter 27 Software Change.

Software Re-engineering

Chapter 7 System Models.

Requirements Engineering Process

Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 38.

Chapter 1 Image Slides Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

The use of SDMX at the ECB Xavier Sosnovsky European Central Bank Bonn,

Introduction to Product Family Engineering. 11 Oct 2002 Ver 2.0 ©Copyright 2002 Vortex System Concepts 2 Product Family Engineering Overview Project Engineering.

ASYCUDA Overview … a summary of the objectives of ASYCUDA implementation projects and features of the software for the Customs computer system.

Relational Database and Data Modeling

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Relational data objects 1 Lecture 6. Relational data objects 2 Answer to last lectures activity.

Limitations of the relational model 1. 2 Overview application areas for which the relational model is inadequate - reasons drawbacks of relational DBMSs.

Relational data integrity

1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.

Data Definition and Integrity Constraints

SQL: The Query Language Part 2

George Anadiotis, Spyros Kotoulas and Ronny Siebes VU University Amsterdam.

Introduction Lesson 1 Microsoft Office 2010 and the Internet

Configuration management

Software change management

Information Systems Today: Managing in the Digital World

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.

Database Performance Tuning and Query Optimization

OO databases 1 Object Oriented databases. OO databases 2 Developing OODBMS - motivation motivation more and more application areas require systems that.

1 Designing Hash Tables Sections 5.3, 5.4, Designing a hash table 1.Hash function: establishing a key with an indexed location in a hash table.

Multi-Tenant Databases for SaaS (Software as a Service)

1 Web-Enabled Decision Support Systems Access Introduction: Touring Access Prof. Name Position (123) University Name.

© Paradigm Publishing, Inc Access 2010 Level 1 Unit 1Creating Tables and Queries Chapter 2Creating Relationships between Tables.

ICIS-NPDES Plugin Design Preview Webinar ICIS-NPDES Full Batch OpenNode2 Plugin Project Presented by Bill Rensmith Windsor Solutions, Inc. 3/15/2012.

Chapter Information Systems Database Management.

State of Connecticut Core-CT Project Query 8 hrs Updated 6/06/2006.

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

© 2006 Cisco Systems, Inc. All rights reserved. MPLS v MPLS VPN Technology Introducing MPLS VPN Architecture.

15. Oktober Oktober Oktober 2012.

Differential Forms for Target Tracking and Aggregate Queries in Distributed Networks Rik Sarkar Jie Gao Stony Brook University 1.

We are learning how to read the 24 hour clock

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 4 Slide 1 Software processes 2.

Node Lessons Learned James Hudson Wisconsin Department of Natural Resources.

Macromedia Dreamweaver MX 2004 – Design Professional Dreamweaver GETTING STARTED WITH.

Addition 1’s to 20.

25 seconds left…...

Module 12 WSP quality assurance tool 1. Module 12 WSP quality assurance tool Session structure Introduction About the tool Using the tool Supporting materials.

Access Control Policy Translation and Verification Within Heterogeneous Data Federations Gregory Leighton Denilson Barbosa University of Alberta June 11,

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Java Software Solutions Foundations of Program Design Sixth Edition by Lewis.

University of Washington Database Group The Complexity of Causality and Responsibility for Query Answers and non-Answers Alexandra Meliou, Wolfgang Gatterbauer,

Chapter 13 The Data Warehouse

ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION.

Incremental Update for a Compositional SDN Hypervisor Xin Jin Jennifer Rexford, David Walker.

Benchmark Series Microsoft Excel 2013 Level 2

LIVE A lineage-supported, versioned DBMS  Anish Das Sarma  Martin Theobald  Jennifer Widom.

Introduction Peter Dolog dolog [at] cs [dot] aau [dot] dk Intelligent Web and Information Systems September 9, 2010.

1 UNIVERSITY of PENNSYLVANIAGrigoris Karvounarakis June 05 Answering queries across mappings Grigoris Karvounarakis University of Pennsylvania WPE-II Presentation.

O RCHESTRA : Rapid, Collaborative Sharing of Dynamic Data Zachary Ives, Nitin Khandelwal, Aneesh Kapur, University of Pennsylvania Murat Cakir, Drexel.

Incremental Maintenance for Non-Distributive Aggregate Functions work done at IBM Almaden Research Center Themis Palpanas (U of Toronto) Richard Sidle.

Implementing Mapping Composition Todd J. Green * University of Pennsylania with Philip A. Bernstein (Microsoft Research), Sergey Melnik (Microsoft Research),

1 Provenance in O RCHESTRA T.J. Green, G. Karvounarakis, Z. Ives, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA.

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

Next-generation databases Active databases: when a particular event occurs and given conditions are satisfied then some actions are executed. An active.

Update Exchange with Provenance Schemas are related by GLAV schema mappings (tgds) : M4: Domain_Ref(SrcID, 'Interpro', ITAcc), Entry2Meth(ITAcc, DBAcc,

Grigoris Karvounarakis Zachary G. Ives University of Pennsylvania Bidirectional Mappings for Data and Update Exchange WebDB 2008.

Presentation transcript:

Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna, Austria September 26, 2007

Adoption of data integration tools Structured information is pervasive in the Internet age, as is the need to access and integrate it… – Need to collect, transform, aggregate information – Need to import related data into an existing database instance … but after many years of research, few users of data integration tools Why? 2

Not because the problem is too hard! People are doing it anyway! (Just without help from DB research) – e.g., bioinformatics Ad-hoc solutions (Perl scripts) developed for specific domains – e.g., at Penn, a large staff of programmers maintains the Genomics Unified Schema (GUS) Point-to-point exchange between peers / collaborating sites To be adopted, data integration tools need to offer significant additional value... 3

Needs unmet by data integration tools Previous data integration tools do not offer: – Complete local control of data Decide which data is import / integrated Ability to modify any data, even data from elsewhere! – Support for different points of view Disagreements about data, mappings, schemas... Which sources are trusted / distrusted – Tracking of data provenance – Support for incremental updates Changes to data, mappings, schemas... Our system, O RCHESTRA, addresses these needs 4

Give peers full control using local instance Support different needs / perspectives Relate peers by mappings and trust policies Support update exchange Maintain data provenance Requirements for O RCHESTRA, a Collaborative Data Sharing System (CDSS) [Ives+05] DBMS Queries, edits PUBLISH A +/ B +/ C +/ 5 Peer A Peer B Peer C A +/

How O RCHESTRA addresses CDSS requirements σ PcPc PfPf m Local curation + Apply trust policies using provenance Translate through mappings with provenance Produce candidate updates Apply final updates to peer Updates from other peers Contributions of this paper P other r Resolve conflicts From one peers perspective: [TaylorIves06]

Roadmap Update exchange in a CDSS: – Schema mappings – Tracking of data provenance – Incremental propagation of updates – Provenance-based trust policies – Local curation via insertions / deletions Prototype implementation Experimental evaluation 7

CDSS setting: set of peers; set of declarative mappings (tgds) Given: setting, base data, updates Goal: local instance at each peer cf. data exchange paradigm [Fagin+03] – Universal solution yields the certain answers to queries – Can be computed using the chase Our contribution: how to do it incrementally, with provenance... Mappings and updates 8 G B U m1m1 m2m2 m3m3

` Incremental insertion G B (3, 2) (1, 3) U (2, 5) (3, 5, 2) + (1, 3, 3) + (3, 5) + m1m1 m1m1 m2m2 m3m3 m2m2 m3m3 9 (3, 3) + This graph represents the provenance information that O RCHESTRA maintains

Incremental deletion G B (3, 5) (3, 2) (1, 3) U (2, 5) (3, 3) m3m3 + + (3, 5, 2) + (1, 3, 3) + m1m1 m1m1 m2m2 m2m2 m3m3 10 Step 1: Use provenance graph to find derived tuples which can also be deleted Step 2: Test other affected tuples for derivability, and delete any not derivable Step 3: Repeat +

Other approaches to incremental deletion Many strategies (both research and commercial) for incremental deletions but we have to support recursion – Mappings can have cycles – Count-based algorithms dont work (infinite counts) Incremental maintenance for recursive datalog programs – DRed [GuptaMumick95] – DRed (delete and re-derive) computes superset of deletions, then corrects if needed – We use provenance to compute exact set of deletions 11

Trust policies (not every update should be propagated) Updates can be filtered automatically based on provenance and content – Peer A distrusts any tuple U(i,n) if the data came from Peer B and n 3, and trusts any tuple from Peer C – Peer A distrusts any tuple U(i,n) that came from mapping m 4 if n 2 Local curation: user can also manually accept/reject updates, or introduce new ones... 12

Local curations Extra tables for local insertions and deletions: Contribution: conforms to data exchange paradigm by using internal mappings with local insertions/deletions: PcPc PfPf Local curation + Candidate updates Final updates (Mappings, trust policies, etc.) 13

Prototype implementation Middleware layer on top of relational DBMS Mappings converted to datalog rules (as in Clio) Separate tables for provenance info Engine option 1: based on commercial DBMS (DB2) – Datalog fixpoints in Java and SQL (only linear recursion in DB2) – Labeled nulls supported via encoding scheme Engine option 2: using in-house query engine (Tukwila) – BerkeleyDB for auxiliary storage and indexes – Custom operators for fixpoints, built-in labeled nulls 30,000 lines of Java and C++ code 14

Experimental evaluation DB2-based and Tukwila-based implementations Workload typical of bioinformatics setting (at most 10s of peers, GBs of data) Synthetic update workload sampled from SWISS-PROT biological data set – Randomly-generated schemas and mappings Dual Xeon 5150 server, 8 GB RAM (2 GB for DB) Variables: number of peers, complexity of mappings, volume of data, type of data, size of updates Measured: time to join system, time to propagate updates, size of updated database 15

Non-incremental Incremental DRed Incremental deletion algorithm yields significant speedup Parameters: 5 peers, full acyclic mappings, string data, 1 GB database 16 Time to propagate deletions (sec)

System scales to realistic #s of peers Parameters: full acyclic mappings, integer data, up to 1 GB database 10% insertions (DB2) 1% insertions (DB2) 10% insertions (Tukwila) 1% insertions (Tukwila) 17 Time to propagate insertions (sec)

Contributions Orchestra innovatively performs update exchange (not just mediated/federated query answering) Tracks data provenance across a network of schema mappings Supports provenance-based trust policies Features algorithms for incremental propagation of updates Solutions have been validated by experimental prototype for typical bioinformatics settings 18

Related work Peer data management systems Piazza [Halevy+03, 04], Hyperion [Kementsietsidis+04], [Bernstein+02], [Calvanese+04],... Data exchange [Haas+99, Miller+00, Popa+02, Fagin+03], peer data exchange [Fuxman+05] Provenance / lineage [CuiWidom01], [Buneman+01], Trio [Widom+05], Spider [ChiticariuTan06], [Green+07],... Incremental maintenance [GuptaMumick95], … 19

CDSS as a research platform: promising future directions Ranking-based trust with provenance – Numeric weights and accumulation of evidence More expressive mappings – e.g., looking inside attributes using regular expressions Compact representations of provenance Mixing virtual and materialized peers – Related to view selection problem Supporting key dependencies / egds – Deletion propagation becomes challenging Incorporating probabilistic mappings / data 20

Ongoing work at Penn Deploying O RCHESTRA in the real world – Pilot project with Penn Center for Bioinformatics Bidirectional mappings – Propagating updates in both directions Mapping evolution problem – Handling updates to mappings (not just data) Fully distributed implementation – Using P2P database engine 21

Bioinformatics mappings example 23

Delta rules for insertions As in DRed [GuptaMumick95]: 24