O RCHESTRA : Rapid, Collaborative Sharing of Dynamic Data Zachary Ives, Nitin Khandelwal, Aneesh Kapur, University of Pennsylvania Murat Cakir, Drexel.

Slides:



Advertisements
Similar presentations
Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,
Advertisements

Database Architectures and the Web
Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
©Silberschatz, Korth and Sudarshan4.1Database System Concepts Lecture-1 Database system,CSE-313, P.B. Dr. M. A. Kashem Associate. Professor. CSE, DUET,
Incremental Maintenance for Non-Distributive Aggregate Functions work done at IBM Almaden Research Center Themis Palpanas (U of Toronto) Richard Sidle.
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
Overview Distributed vs. decentralized Why distributed databases
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Object Naming & Content based Object Search 2/3/2003.
1 Distributed Databases Chapter What is a Distributed Database? Database whose relations reside on different sites Database some of whose relations.
Chapter 2 Database Environment Pearson Education © 2014.
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
1 Provenance in O RCHESTRA T.J. Green, G. Karvounarakis, Z. Ives, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA.
Chapter 1 Introduction to Databases
Distributed Databases
Peer-to-Peer Databases David Andersen Advanced Databases.
Objectives of the Lecture :
SQL Server Replication By Karthick P.K Technical Lead, Microsoft SQL Server.
IMS 4212: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida Distributed Databases Business needs.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Introduction to Databases
Database Design - Lecture 1
 Introduction Introduction  Purpose of Database SystemsPurpose of Database Systems  Levels of Abstraction Levels of Abstraction  Instances and Schemas.
Database Architectures and the Web Session 5
Database Design – Lecture 16
Introduction to MDA (Model Driven Architecture) CYT.
Jonathan Walpole CSE515 - Distributed Computing Systems 1 Teaching Assistant for CSE515 Rahul Dubey.
CODD’s 12 RULES OF RELATIONAL DATABASE
Session-9 Data Management for Decision Support
Chapter 1 : Introduction §Purpose of Database Systems §View of Data §Data Models §Data Definition Language §Data Manipulation Language §Transaction Management.
Session-8 Data Management for Decision Support
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
1 Data Warehouses BUAD/American University Data Warehouses.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Intro – Part 2 Introduction to Database Management: Ch 1 & 2.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
Triggers. Why Triggers ? Suppose a warehouse wishes to maintain a minimum inventory of each item. Number of items kept in items table Items(name, number,...)
Data Management for Decision Support Session-3 Prof. Bharat Bhasker.
DDBMS Distributed Database Management Systems Fragmentation
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Bayu Adhi Tama, M.T.I 1 © Pearson Education Limited 1995, 2005.
1 Distributed Databases Chapter 21, Part B. 2 Introduction v Data is stored at several sites, each managed by a DBMS that can run independently. v Distributed.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
An Overview of Issues in P2P database systems Presented by Ahmed Ataullah Wednesday, November 29 th 2006.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
Introduction to Active Directory
Introduction Zachary G. Ives University of Pennsylvania CIS 700 – Internet-Scale Distributed Computing January 13, 2004.
Distributed DBMS, Query Processing and Optimization
Chapter 2 Database Environment.
ASET 1 Amity School of Engineering & Technology B. Tech. (CSE/IT), III Semester Database Management Systems Jitendra Rajpurohit.
Update Exchange with Provenance Schemas are related by GLAV schema mappings (tgds) : M4: Domain_Ref(SrcID, 'Interpro', ITAcc), Entry2Meth(ITAcc, DBAcc,
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
1 Chapter 22 Distributed DBMSs - Concepts and Design Simplified Transparencies © Pearson Education Limited 1995, 2005.
Grigoris Karvounarakis Zachary G. Ives University of Pennsylvania Bidirectional Mappings for Data and Update Exchange WebDB 2008.
CS 325 Spring ‘09 Chapter 1 Goals:
Introduction to DBMS Purpose of Database Systems View of Data
Database Management:.
Chapter 1: Introduction
Peer-to-Peer Data Management
Data Warehouse—Subject‐Oriented
Chapter 2 Database Environment Pearson Education © 2009.
Database Environment Transparencies
Database Architecture
Presentation transcript:

O RCHESTRA : Rapid, Collaborative Sharing of Dynamic Data Zachary Ives, Nitin Khandelwal, Aneesh Kapur, University of Pennsylvania Murat Cakir, Drexel University 2 nd Conference on Innovative Database Systems Research January 5, 2005

Data Exchange among Bioinformatics Warehouses & Biologists Different bioinformatics institutes, research groups store their data in separate warehouses with related, “overlapping” data  Each source is independently updated, curated locally  Updates are published periodically in some “standard” schema  Each site wants to import these changes, maintain a copy of all data  Individual scientists also import the data and changes, and would like to share their derived results  Caveat: not all sites agree on the facts! Often, no consensus on the “right” answer!

A Clear Need for a General Infrastructure for Data Exchange Bioinformatics exchange is done with ad hoc, custom tools – or manually – or not at all!  (NOT an instance of file sync, e.g., Intellisync, Harmony; or groupware) It’s only one instance of managing the exchange of independently modified data, e.g.:  Sharing subsets of contact lists (colleagues with different apps)  Integrating and merging multiple authors’ bibTeX, EndNote files  Distributed maintenance of sites like DBLP, SIGMOD Anthology This problem has many similarities to traditional DBs/data integration:  Structured or semi-structured data  Schema heterogeneity, different data formats, autonomous sources  Concurrent updates  Transactional semantics

Challenges in Developing Collaborative Data Sharing “Middleware”  How do we coordinate updates between conflicting collaborators?  How do we support rapid & transient participation, as in the Web or P2P systems?  How do we handle the issues of exchanging updates across different schemas?  These issues are the focus of our work on the O RCHESTRA Collaborative Data Sharing System

Our Data Sharing Model  Participants create & independently update local replicas of an instance of a particular schema  Typically stored in a conventional DBMS  Periodically reconcile changes with those of other participants  Updates are accepted based on trust/authority – coordinated disagreement  Changes may need to be translated across mappings between schemas  Sometimes only part of the information is mapped

The O RCHESTRA Approach to the Challenges of Collaborative Data Sharing  Coordinating updates between disagreeing collaborators  Allow conflicts, but let each participant specify what data it trusts (based on origin or authority)  Supporting rapid & transient participation  Exchange updates across different schemas

The Origins of Disagreements (Conflicts)  Each source is individually consistent, but may disagree with others  Conflicts are the results of mutually incompatible updates applied concurrently to different instances, e.g.,:  Participants A and B have replicas containing different tuples with the same key  An item is removed from Participant A but modified in B  A transaction results in a series of values in Participant B, one of which conflicts with a tuple in A

Multi-Viewpoint Tables (MVTs) Allow unification of conflicting data instances:  Within each relation, allow participants p,p’ their own viewpoints that may be inconsistent  Add two special attributes:  Origin set: Set of participants whose data contributed to the tuple  Viewpoint set: Set of participants who accept the tuple (for trust delegation) Simple form of data provenance [Buneman+ 01] [Cui & Widom 01] and similar in spirit to Info Source Tracking [Sadri 94] After reconciliation, participant p receives a consistent subset of the tuples in the MVT that:  Originate in viewpoint p  Or originate in some viewpoint that participant p trusts

MVTs allow Coordinated Disagreement  Each shared schema has a MVT instance  Each individual replica holds a subset of the MVT  An instance mapping filters from the MVT, based on viewpoint and/or origin sets  Only non-conflicting data gets mapped

An Example MVT with 2 Replicas (Looking Purely at Data Instances) = RAD:Study(t), contains(origin(t), ArrayExp) = RAD:Study(t), contains(viewpoint(t), Penn) toriginviewpoint aPenn RAD:Study t a t a

An Example MVT with 2 Replicas (Looking Purely at Data Instances) = RAD:Study(t), contains(origin(t), ArrayExp) = RAD:Study(t), contains(viewpoint(t), Penn) toriginviewpoint aPenn bArrayExp csystemsbio RAD:Study } Insertions from elsewhere t a t a

An Example MVT with 2 Replicas (Looking Purely at Data Instances) = RAD:Study(t), contains(origin(t), ArrayExp) = RAD:Study(t), contains(viewpoint(t), Penn) toriginviewpoint aPenn bArrayExp csystemsbio RAD:Study t a b t a Reconciling participant

An Example MVT with 2 Replicas (Looking Purely at Data Instances) = RAD:Study(t), contains(origin(t), ArrayExp) = RAD:Study(t), contains(viewpoint(t), Penn) toriginviewpoint aPenn bArrayExpArrayExp,Penn csystemsbio RAD:Study t a b t a Accepted into viewpoint

An Example MVT with 2 Replicas (Looking Purely at Data Instances) = RAD:Study(t), contains(origin(t), ArrayExp) = RAD:Study(t), contains(viewpoint(t), Penn) toriginviewpoint aPenn bArrayExpArrayExp,Penn csystemsbio RAD:Study Reconciling participant t a b t a b

An Example MVT with 2 Replicas (Looking Purely at Data Instances) = RAD:Study(t), contains(origin(t), ArrayExp) = RAD:Study(t), contains(viewpoint(t), Penn) RAD:Study t a b t a b toriginviewpoint aPenn bArrayExpArrayExp,Penn,Sanger csystemsbio

The O RCHESTRA Approach to the Challenges of Collaborative Data Sharing  Coordinating updates between disagreeing collaborators  Supporting rapid & transient participation  Ensure data or updates, once published, are always available regardless of who’s connected  Exchanging updates across different schemas

Participation in O RCHESTRA is Peer-to-Peer in Nature Server and client roles for every participant p:  Maintain a local replica of the database interest at p  Maintain a subset of every global MVT relation; perform part of every reconciliation  Partition the global state and computation across all available participants  Ensures reliability and availability, even with intermittent participation Use peer-to-peer distributed hash tables (Pastry [Rowstron & Druschel 01] )  Relations partitioned by tuple, using  DHT dynamically reallocates MVT data as nodes join and leave  Replicates the data so it’s available if nodes disappear local RAD instance P1P1 P2P2 Study 1 Study 2 RAD: Study MVT Global RAD MVTs

Reconciliation of Deltas Publish, compare, and apply delta sequences  Find the set of non-conflicting updates  Apply them to a local replica to make it consistent with the instance mappings  Similar to what’s done in incremental view maintenance [Blakeley 86] Our notation for updates to relation r with tuple t:  insert: +r(t)  delete: -r(t)  replace:  r(t / t’)

Semantics of Reconciliation Each peer p publishes its updates periodically  Reconciliation compares these with all updates published from elsewhere, since the last time p reconciled What should happen with update “chains”?  Suppose p changes the tuple A  B  C and another system does D  B  E  In many models this conflicts – but we assert that intermediate steps shouldn’t be visible to one another  Hence we remove intermediate steps from consideration  We compute and compare the unordered sets of tuples removed from, modified within, and inserted into relations

Distributed Reconciliation in Orchestra Initialization:  Take every shared MVT relation, compute its contents, partition its data across the DHT participant p:  Publish all p’s updates to the DHT, based on the key of the data being affected; attach to each update its transaction ID  Each peer is given the complete set of updates applied to a key – it can compare to find conflicts at the level of the key, and of the transaction  Updates are applied if there are no conflicts in a transaction (More details in paper)

The O RCHESTRA Approach to the Challenges of Collaborative Data Sharing  Coordinating updates between disagreeing collaborators  Supporting rapid & transient participation  Exchanging updates across different schemas  Leverage view maintenance and schema mediation techniques to maintain mapping constraints between schemas

Reconciling Between Schemas We define update translation mappings in the form of views  Automatically (see paper) derived from data integration and peer data management-style schema mappings  Both forward and “inverse” mapping rules, analogous to forward and inverse rules  Define how to compute a set of deltas over a target relation that maintain the schema mapping, given deltas over the source  Disambiguates among multiple ways of performing the inverse mapping  Also user-overridable for custom behavior (see paper)

The Basic Approach (Many more details in paper)  For each relation r(t), and each type of operation, define a delta relation containing the set of operations of the specified type to apply: deletion: -r(t) insertion: +r(t) replacement:  r(t / t’)  Create forward and inverse mapping rules in Datalog (similar to mapping & inverse rules in data integration) between these delta relations  Based on view update [Dayal & Bernstein 82] [Keller 85] /maintenance [Blakeley 86] algorithms, derive queries over deltas to compute updates in one schema from updates (and values) in the other  A schema mapping between delta relations (sometimes joining with standard relations)

Example Update Mappings Schema mapping: r(a,b,c) :- s(a,b), t(b,c) Deletion mapping rules for Schema 1, relation r (forward): -r(a,b,c) :- -s(a,b), t(b,c) -r(a,b,c) :- s(a,b), -t(b,c) -r(a,b,c) :- -s(a,b), -t(b,c) Deletion mapping for Schema 2, relation t (inverse): -t(a,c) :- -r(a,_,c)

Using Translation Mappings to Propagate Updates across Schemas We leverage algorithms from Piazza [Tatarinov+ 03]  There: answer query in one schema, given data in mapped sources  Here: compute the set of updates to MVTs that need to be applied to a given schema, given mappings + changes over other schemas Peer p reconciles as follows:  For each relation r in p’s schema, compute the contents of –r, +r,  r  “Filter” the delta MVT relations according to the instance mapping rules  Apply the deletions in -r, replacements in  r, and insertions in +r

Translating the Updates across Schemas – with Transitivity MADAMTIGR RAD GO SML MAGE-ML  ’’ ’’ ’’   

Implementation Status and Early Experimental Results  The architecture and basic model – as seen in this paper – are mostly set  Have built several components that need to be integrated:  Distributed P2P conflict detection substrate (single schema):  Provides atomic reconciliation operation  Update mapping “wizard”:  Preliminary support for converting “conjunctive XQuery” as well as relational mappings to update mappings  Experiments with bioinformatics mappings (see paper):  Generally a limited number of candidate inverse mappings (~1-3) for each relation – easy to choose one  Number of “forward” rules is exponential in # joins  Main focus: “tweaking” the query reformulation algorithms of Piazza  Each reconciliation performs the same “queries” – can cache work  May be able to do multi-query optimization of related queries

Conclusions and Future Work O RCHESTRA focuses on trying to coordinate disagreement, rather than enforcing agreement  Significantly different from prior data sharing and synchronization efforts  Allows full autonomy of participants – offers scalability, flexibility Central ideas:  A new data model that supports “coordinated disagreement”  Global reconciliation and support for transient membership via P2P distributed hash substrate  Update translation using extensions to peer data management and view update/maintence Currently working on integrated system, performance optimization