Data Integration and Exchange for Scientific Collaboration DILS 2009 July 20, 2009 Zachary G. Ives University of Pennsylvania Funded by NSF IIS-0477972,

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,
Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
The 20th International Conference on Software Engineering and Knowledge Engineering (SEKE2008) Department of Electrical and Computer Engineering
Data integration and transformation Paolo Atzeni Dipartimento di Informatica e Automazione Università Roma Tre 29/09/2010.
© 2013 A. Haeberlen, Z. Ives Cloud Storage & Case Studies NETS 212: Scalable & Cloud Computing Fall 2014 Z. Ives University of Pennsylvania 1.
ISOM Distributed Databases Arijit Sengupta. ISOM Learning Objectives Understand the concept and necessity of distributed databases Understand the types.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
O RCHESTRA : Rapid, Collaborative Sharing of Dynamic Data Zachary Ives, Nitin Khandelwal, Aneesh Kapur, University of Pennsylvania Murat Cakir, Drexel.
Implementing Mapping Composition Todd J. Green * University of Pennsylania with Philip A. Bernstein (Microsoft Research), Sergey Melnik (Microsoft Research),
File Systems and Databases
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
1 Provenance in O RCHESTRA T.J. Green, G. Karvounarakis, Z. Ives, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Overview of Search Engines
IST Databases and DBMSs Todd S. Bacastow January 2005.
DATABASE MANAGEMENT SYSTEMS BASIC CONCEPTS 1. What is a database? A database is a collection of data which can be used: alone, or alone, or combined /
Database Design - Lecture 1
Systems analysis and design, 6th edition Dennis, wixom, and roth
1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.
CODD’s 12 RULES OF RELATIONAL DATABASE
Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.
Recent research : Temporal databases N. L. Sarda
DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.
Database Systems: Design, Implementation, and Management Tenth Edition
Database Systems: Design, Implementation, and Management Ninth Edition Chapter 6 Normalization of Database Tables.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Lecture # 3 & 4 Chapter # 2 Database System Concepts and Architecture Muhammad Emran Database Systems 1.
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
CYCLADES IST CYCLADES: A Personalised Collaborative Digital Library Environment Umberto Straccia I.S.T.I. - C.N.R. Pisa (ITALY)
Data Management for Decision Support Session-3 Prof. Bharat Bhasker.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
LRI Université Paris-Sud ORSAY Nicolas Spyratos Philippe Rigaux.
1 © Prentice Hall, 2002 Chapter 5: Logical Database Design and the Relational Model Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B.
10/24/09CK The Open Ontology Repository Initiative: Requirements and Research Challenges Ken Baclawski Todd Schneider.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS Instructor Ms. Arwa Binsaleh.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
A State Perspective Mentoring Conference New Orleans, LA 2/28/2005 RCRAInfo Network Exchange.
Module Coordinator Tan Szu Tak School of Information and Communication Technology, Politeknik Brunei Semester
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
Ing. Erick López Ch. M.R.I. Replicación Oracle. What is Replication  Replication is the process of copying and maintaining schema objects in multiple.
Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip.
Introduction Zachary G. Ives University of Pennsylvania CIS 700 – Internet-Scale Distributed Computing January 13, 2004.
Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,
Event-Based Model for Reconciling Digital Entities Ahmet Fatih Mustacoglu Ahmet E. Topcu Aurel Cami Geoffrey C. Fox Indiana University Computer Science.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Update Exchange with Provenance Schemas are related by GLAV schema mappings (tgds) : M4: Domain_Ref(SrcID, 'Interpro', ITAcc), Entry2Meth(ITAcc, DBAcc,
WP3: Data Provenance and Access Control Irini Fundulaki, FORTH December 11-12, 2012, Luxembourg.
1 © 2013 Cengage Learning. All Rights Reserved. This edition is intended for use outside of the U.S. only, with content that may be different from the.
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
Grigoris Karvounarakis Zachary G. Ives University of Pennsylvania Bidirectional Mappings for Data and Update Exchange WebDB 2008.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Architecture Review 10/11/2004
Visualization for Ontology Evolution
Rapid, Collaborative Sharing of Dynamic Data
Potter’s Wheel: An Interactive Data Cleaning System
Interactive repairing of inconsistent knowledge bases
P2P Integration, Concluded, and Data Stream Processing
Implementing Mapping Composition
NSDL Data Repository (NDR)
Computational Advertising and
Presentation transcript:

Data Integration and Exchange for Scientific Collaboration DILS 2009 July 20, 2009 Zachary G. Ives University of Pennsylvania Funded by NSF IIS , , with Todd Green, Grigoris Karvounarakis, Nicholas Taylor, Partha Pratim Talukdar, Marie Jacob, Val Tannen, Fernando Pereira, Sudipto Guha

A Pressing Need for Data Integration in the Life Sciences The ultimate goal: assemble all biological data into an integrated picture of living organisms If feasible, could revolutionize the sciences & medicine!  Many efforts to compile databases (warehouses) for specific fields, organisms, communities, etc. Genomics, proteomics, diseases (incl. epilepsy, diabetes), phylogenomics, …  Perhaps “too successful”: now 100s of DBs with portions of the data we need to tie together!

Basic Data Integration Makes the Wrong Assumptions Existing data sharing methods (scripts, FTP) are ad hoc, piecemeal, don’t preserve “fixes” made at local sites What about database-style integration (EII)? Unlike business or most Web data, science is in flux, with data that is subjective, based on hypotheses / diagnoses / analyses What is the right target schema? “clean” version? set of sources? We need to re-think data integration architectures and solutions in response to this! Source Sources Target schema Consistent data instance mappings (transformations) cleaning queries answers

Common Characteristics of Scientific Databases and Data Sharing A scientific database site is often not just a source, but a portal for a community:  Preferred terminologies and schemas  Differing conventions, hypotheses, curation standards Sites want to share data by “approximate synchronization” Every site wants to import the latest data, then revise, query it Change is prevalent everywhere:  Updates to data: curation, annotation, addition, correction, cleaning  Evolving schemas, due to new kinds of data or new needs  New sources, new collaborations with other communities Different data sources have different levels of authority  Impacts how data should be shared and how it is queried

Logical P2P network of autonomous data portals  Peers have control & updatability of own DB  Related by compositional mappings and trust policies Dataflow: occasional update exchange  Record data provenance to assess trust  Reconcile conflicts according to level of trust Global services:  Archived storage  Distributed data transformation  Keyword queries  Querying provenance & authority Collaborative Data Sharing System (CDSS) DBMS Queries, edits ∆ A +/− ∆ B +/− ∆ C +/− 5 Peer A Peer B Peer C ∆ A +/− [Ives et al. CIDR05; SIGMOD Rec. 08] Archive ∆ B +/− ∆ C +/− ∆ A +/−

How the CDSS Addresses the Challenges of Scientific Data Sharing A scientific database site is often not just a source, but a portal for a community:  Preferred terminologies and schemas  Differing conventions, hypotheses, curation standards Sites want to share data by “approximate synchronization” Every site wants to import the latest data, then revise, query it Change is prevalent everywhere:  Updates to data: curation, annotation, addition, correction, cleaning  Evolving schemas, due to new kinds of data or new needs  New sources, new collaborations with other communities Different data sources have different levels of authority  Impacts how data should be shared and how it is queried

Suppose we have a site focused on phylogeny (organism names & canonical names) Supporting Multiple Portals uBio U(nam, can) G(id,can,nam) GUS and we want to import data from another DB, primarily about genes, that also has organism common and canonical names

G(id,can,nam) GUS uBio U(nam,can) m Supporting Multiple Portals / Peers (combines [Halevy,Ives+03],[Fagin+04]) Tools exist to automatically find rough schema matches (Clio, LSD, COMA++, BizTalk Mapper, …) and link entities We add a schema mapping between the sites, specifying a transformation: m: U ( n, c ) :- G ( i, c, n ) (Via correspondence tables, can also map between identities)

Adding a Third Portal… Sharing data with another peer (uBio) simply requires mapping data to it: m 1 : B ( i, n ) :- G ( i, c, n ) m 2 : U ( n, c ) :- G ( i, c, n ) m 3 : B ( i, n ) :- B ( i, c ), U ( n, c ) B(id,nam) G(id,can,nam) GUS BioSQL U(nam,can) m2m2 m3m3 m1m1 uBio

Suppose BioSQL Changes Schemas Schema evolution is simply another schema + mapping: m 1 : B ( i, n ) :- G ( i, c, n ) m 2 : U ( n, c ) :- G ( i, c, n ) m 3 : B ( i, n ) :- B ( i, c ), U ( n, c ) m 4 : B’ ( n )  B ( i, c ) B(id,nam) G(id,can,nam) U(nam,can) m2m2 m3m3 m1m1 B’(nam) BioSQL’ GUS BioSQL uBio m4m4

A Challenge: Diverse Opinions, Different Curation Standards A down-side to compositionality: maybe we want data from friends, but not from their friends Each site should be able to have its own policy about which data it will admit – trust conditions  Based on site’s evaluation of the “quality” of the mappings and sources used to produce a result – its provenance Each site can delegate authority to others  “I import data from Bob, and trust anything Bob does” By default, “open” model – trust everyone unless otherwise stated

How the CDSS Addresses the Challenges of Scientific Data Sharing A scientific database site is often not just a source, but a portal Sites want to share data by “approximate synchronization” Change is prevalent everywhere Different data sources have different levels of authority

How a Peer Shares Data in the CDSS [Taylor & Ives 06], [Green + 07], [Karvounarakis & Ives 08] σ ∆P ⇘  + − Apply trust policies using data + provenance Translate through mappings with provenance: update exchange Updates for peer Updates from all peers ∆P other Apply local curation  Reconcile conflicts ⇗ ∆P pub CDSS archive Updates from this peer P Updates from all peers Publish Import (A permanent log using P2P replication [Taylor & Ives 09 sub]) Publish updates

The O RCHESTRA CDSS and Update Exchange [Green, Karvounarakis, Ives, Tannen 07] m 1 : B ( i, n ) :- G ( i, c, n ) m 2 : U ( n, c ) :- G ( i, c, n ) m 3 : B ( i, n ) :- B ( i, c ), U ( n, c ) B(id, nam) G(id,can, nam) m2m2 m3m3 m1m1 GUS uBio BioSQL U(nam, can) Sites make updates offline, that we want to propagate “downstream” (including deleting data) Approach: Encode edit history in relations describing net effects on data  Local contributions  Local contributions of new data to system (e.g., U l )  Local rejections  Local rejections of data imported from elsewhere (e.g., U r ) Schema mappings are extended to relate these relations Annotations called trust conditions specify what data is trusted, by whom GlGl BlBl BrBr UlUl UrUr uBio distrusts data from GUS along m 2

Computing an Instance in Update Exchange m 1 : B ( i, n ) :- G ( i, c, n ) m 2 : U ( n, c ) :- G ( i, c, n ) m 3 : B ( i, n ) :- B ( i, c ), U ( n, c ) B(id, nam) G(id,can, nam) m2m2 m3m3 m1m1 U(nam, can) Run extended mappings recursively until fixpoint, to compute target canonical universal solution W/o deletions: canonical universal solution [Fagin+04], as with chase GlGl BlBl BrBr UlUl UrUr G(i,c,n) :- G l (i,c,n) B(i,n) :- B l (i,n) m 1 B(i,n) :- G(i,c,n), ¬ B r (i,n) m 3 B(i,n) :- B(i,c), U(n,c), ¬ B r (i,n) U(n,c) :- U l (n,c) m 2 U(n,c) :- G(i,c,n), ¬ U r (n,c) To recompute target GUS uBio BioSQL

Beyond the Basic Update Exchange Program  Can generalize to perform incremental propagation given new updates  Propagate updates downstream [Green+07]  Propagate updates back to the original “base” data [Karvounarakis & Ives 08]  Can involve a human in the loop – Youtopia [Kot & Koch 09]  But what if not all data is equally useful? What if some sources are more authoritative than others?  We need a record of how we mapped the data (updates)

Provenance from Mappings Given our mappings: (m 1 ) G ( i, c, n )  B ( i, n ) (m 2 ) G ( i, c, n )  U ( n, c ) (m 3 ) B ( i, c )  U ( n, c )  B ( i, n ) And the local contributions: p 3 :G(3,A,Z)p 1 :B(3,A)p 2 :U(Z,A) GlGl BlBl UlUl B(id,nam) G(id,can,nam) m2m2 m3m3 m1m1 GUS uBio BioSQL U(nam,can)

(3,A) (Z,A) G B U Provenance from Mappings Given our mappings: (m 1 ) G ( i, c, n )  B ( i, n ) (m 2 ) G ( i, c, n )  U ( n, c ) (m 3 ) B ( i, c )  U ( n, c )  B ( i, n ) We can record a graph of tuple derivations: p 3 :G(3,A,Z)p 1 :B(3,A)p 2 :U(Z,A) (3,Z) m 3 (3,A,Z) m 2 m 1 B(id,nam) G(id,can,nam) m2m2 m3m3 m1m1 GUS uBio BioSQL U(nam,can) GlGl BlBl UlUl Can be formalized as polynomial expressions in a semiring [Green+07] Note U(Z,A) true if p 2 is correct, or m 2 is valid and p 3 is correct

From Provenance (and Data), Trust Each peer’s admin assigns a priority to incoming updates, based on their provenance (and value)  Examples of trust conditions for peer uBio:  Distrusts data that comes from GUS along mapping m 2  Trusts data derived from m 4 with id < 100 with priority 2  Trusts data directly inserted by BioSQL with priority 1 O RCHESTRA uses priorities to determine a consistent instance for the peer – high priority is preferred  But how does trust compose, along chains of mappings and when updates are batched into transactions?

Trust across Compositions of Mappings  An update receives the minimum trust along a sequence of paths, the maximum trust along alternate paths  e.g., uBio trusts GUS but distrusts mapping m 2 (3,A) (Z,A) G B U p 3 :G(3,A,Z)p 1 :B(3,A) (3,Z) m 3 (3,A,Z) m 2 m 1 GlGl BlBl

Trust across Transactions [Taylor, Ives 06] Updates may occur in atomic “transactions”  Set of updates to be considered atomically e.g., insertion of a tree-structured item; replacement of an object Each peer individually reconciles among the conflicting transactions that it trusts  We assign a transaction the priority of its highest-priority update  May have read/write dependencies on prev. transactions (antecedents)  Chooses transactions in decreasing order of priority  Effects of all antecedents must be applicable to accept the transaction  This automatically resolves conflicts for portions of data where a complete ordering can be given statically  The peer gets its own unique instance due to local trust policies

O RCHESTRA Engine [Green+07, Karvounarakis & Ives 08, Taylor & Ives 09] Mappings (Extended) Datalog Program SQL queries + recursion, sequence Data, provenance in RDBMS tables Updates from users Updates to data and provenance in RDBMS tables RDBMS or distrib. QP Fixpoint layer

How the CDSS Addresses the Challenges of Scientific Data Sharing A scientific database site is often not just a source, but a portal Sites want to share data by “approximate synchronization” Change is prevalent everywhere Different data sources have different levels of authority

Change Is the Only Constant As noted previously:  Data changes: updates, annotations, cleaning, curation  Schema changes: evolution to new concepts  Set of sources and mappings change

Change Is the Only Constant As noted previously:  Data changes: updates, annotations, cleaning, curation  Handled by update exchange, reconciliation  Schema changes: evolution to new concepts  Handled by adding each schema version as a peer, mapping to it  Set of sources and mappings change  May have a cascading effect on the contents of all peers!

The O RCHESTRA “Core” Enables Us to Consider Many New Questions  To this point: the basic “core” of O RCHESTRA –  Data and update transformations via update exchange  Provenance-based trust and conflict resolution  Handling of changes to the mappings  Many new questions are motivated by using this core  How do we assess and exploit sites’ authority?  How can we harness history and provenance?  How can we point users to the “right” data?

How the CDSS Addresses the Challenges of Scientific Data Sharing A scientific database site is often not just a source, but a portal Sites want to share data by “approximate synchronization” Change is prevalent everywhere Different data sources have different levels of authority

Authority Plays a Big Role in Science Some sites fundamentally have higher quality data, or data that agrees with “our” perspective more We’d like to be able to determine:  Whom each peer should trust  Whom we should use to answer a user’s “global” queries about information – i.e., queries where the user isn’t looking through the lens of a single portal Our approach: learn authority from user queries, potentially use that to determine trust levels

Querying When We Don’t Have a Preferred Peer: The Q System [Talukdar+ 08] Users may want to query across peers, finding the relations most relevant to them Query model: familiar keyword search  Keywords  ranked integration (join) queries  answers  Learn the source rankings, based on feedback on answers!

Q : Answering a Keyword Search with the Top Queries Given a schema graph  Relations as nodes  Associations (mappings, refs, etc.) as weighted edges And a set of keywords  Compute top-scoring trees matching keywords  Execute Q1 ⋃ Q2 as ranked join queries e acb fd Query Keywords a, e, f Rank = 2 Cost = 0.2 e acb fd Rank = 1 Cost = 0.1 e ab fd e ab fd e acb fd Q1 Q2

Getting User Feedback System determines “producer” queries using provenance Q1 Q1,2 Q2

e acb fd Learning New Weights e acb fd e ab fd e acb fd Change weights so Q2 is “cheaper” than Q1 – using MIRA algorithm [Crammer+ 06] Rank = 1 Cost = 0.1 Rank = 2 Cost = 0.2 Rank = 2 Cost = 0.1 Rank = 1 Cost = Q1 Q2

Does It Work? Evaluation on Bioinformatics Schemas  Can we learn to give the best answers, as determined by experts? Series of 25 queries, 28 relations from BioGuide [Cohen-Boulakia+07]  After feedback on 40-60% queries, Q finds the top query for all remaining queries on its first try!  For each individual query, a feedback on one item is enough to learn the top query  Can it scale?  Generated top queries at interactive rates for ~500 relations (the biggest real schemas we could get)  Now: goal is real user studies

Recap: The CDSS Paradigm Support loose, evolving confederations of sites, which each:  Freely determine their own schemas, curation, and updates  Exchange data they agree about; diverge where they disagree  Have policies about what data is “admitted,” based on authority and trust Feedback and machine learning – and data-centric interactions with users – are key

A Diverse Body of Related Work Incomplete and uncertain information [ Imielinski & Lipski 84], [Sadri 98], [Dalvi & Suciu 04], [Widom 05], [Antova+ 07] Integrated data provenance [Cui&Widom01], [Buneman+01], [Bagwat+04], [Widom+05], [Chiticariu & Tan 06], [Green+07] Mapping updates across schemas: View update [Dayal & Bernstein 82][Keller 84, 85], Harmony, Boomerang, … View maintenance [ Gupta & Mumick 95], [Blakeley 86, 89], … Data exchange [Miller et al. 01], [Fagin et al. 04, 05], … Peer data management [Halevy+ 03, 04], [Kementsietsidis+ 04], [Bernstein+ 02] [Calvanese+ 04], [Fuxman+ 05] Search in DBs: [Bhalotia+ 02], [Kacholia+ 05], [Hristidis & Papakonstantinou 02], [Botev&Shanmugasundaram 05] Authority and rank: [Balmin+ 04][Varadarajan+ 08][Kasneci+ 08] Learning mashups: [Tuchinda & Knoblock 08]