Rapid, Collaborative Sharing of Dynamic Data

Rapid, Collaborative Sharing of Dynamic Data
with Nicholas Taylor, T. J. Green, Grigoris Karvounarakis, Val Tannen North Carolina State University October 6, 2006 Zachary G. Ives University of Pennsylvania Funded by NSF IIS , IIS

An Elusive Goal: Building a Web of Structured Data
A longtime goal of the computer science field: creating a “smarter” Web e.g., Tim Berners-Lee’s “Semantic Web”, 15 years of Web data integration Envisioned capabilities: Link and correlate data from different sources to answer questions that need semantics Provide a convenient means of exchanging data with business partners, collaborators, etc.

Why Is this So Hard? Semantics is a fuzzy concept
Different terminology, units, or ways of representing things e.g., in real estate, “full + half baths” vs. “bathrooms” Difficult to determine and specify equivalences e.g., conference paper vs. publication – how do they relate precisely? Linking isn’t simply a matter of hyperlinking and counting on a human Instead we need to develop and specify mappings (converters, synonyms) Real data is messy, uncertain, inconsistent Typos; uncertain data; non-canonical names; data that doesn’t fit into a standard form/schema But (we believe): the data sharing architecture is the big bottleneck

Data Sharing, DB-Style: One Instance to Rule Them All?
Data warehouse/exchange: one schema, one consistent instance Data integration / peer data management systems: Map heterogeneous data into one or a few virtual schemas Remove any data that’s inconsistent [Arenas+] Query Results Data Integration System Mediated Schema Source Want to find consistent instance of mediated schema (“virtual global data warehouse”) Often need to remove data to make consistent Catalog } Schema Mappings } Autonomous Data Sources

A Common Need: Partial, Peer-to-Peer Collaborative Data Exchange
Sometimes we need to exchange data in a less rigid fashion… Cell phone directory with our friend’s – with different nicknames Citation DBs with different conference abbreviations Restaurant reviews and ratings Scientific databases, where inconsistency or uncertainty are common “Peer to peer” in that no one DB is all-encompassing or authoritative Participation is totally voluntary, must not impede local work Each must be able to override or supplement data from elsewhere

Target Domain: Data Exchange among Bioinformatics DBs & Biologists
Bioinformatics groups and biologists want to share data in their databases and warehouses Data overlaps – some DBs are specialized, others general (but with data that is less validated) Each source is updated, curated locally Updates are published periodically We are providing mechanisms to: Support local queries and edits to data in each DB Allow on-demand publishing of updates made to the local DB Import others’ updates to each local DB despite different schemas Accommodate the fact that not all sites agree on the edits! (Not probabilistic – sometimes, no consensus on the “right” answer!) Crypto DB Plasmo EBI

Challenges Multi-“everything”: Voluntary participation:
Multiple schemas, multiple peers with instances, multiple possibilities for consistent overall instances Voluntary participation: Group may publish infrequently, drop off the network, etc. Inconsistency with “the rest of the world” must not prevent the user from doing an operation Unlike cvs or distributed DBs, where consistency with everyone else is always enforced Conflicts need to be captured at the right granularity: Tuples aren’t added independently – they are generally part of transactions and that may have causal dependencies

Collaborative Data Sharing
Philosophy: rather than enforcing a global instance, support many overlapping instances in many schemas (Conflicts are localized!) Collaborative Data Sharing System (CDSS): Accommodate disagreement with an extended data model Track provenance and support trust policies “Reconcile” databases by sharing transactions Detect conflicts via constraints and incompatible updates Define update translation mappings to get all data into the target schema Based on schema mappings and provenance We are implementing the ORCHESTRA CDSS

A Peer’s Perspective of the CDSS
DE PE CDSS (ORCHESTRA) Participant PC Updates (DC) DC PP Consistent, trusted subset of DE, in Pc’s schema Queries and Answers RDBMS User interacts with standard database CDSS coordinates with other participants Ensures availability of published updates Finds consistent set of trusted updates (reconciliation) Updates may first need to be mapped into the target schema

A CDSS Maps among Sources that Each Publish Updates in Transactions
PlasmoDB ( P ) EBI ( P ) P E R ( GUSv 1 ) R P ( MIAME ) E m C E - > P < + D E + - D P CryptoDB ( P ) C R ( GUSv 3 ) C - + D C

Along with Schema Mappings, We Add Prioritized Trust Conditions
PlasmoDB ( P ) EBI ( P ) P E R ( GUSv 1 ) R P ( MIAME ) E Priority 5 if 1 + m C E - > P + D Priority 1 always E + + - D P + CryptoDB ( P ) C m E < - > C Priority 3 always R ( GUSv 3 ) C - D + C +

The ORCHESTRA Approach to the Challenges of Collaborative Data Sharing
Accommodate disagreement with an extended data model Reconcile updates at the transaction level Define update translation mappings to get all data into the target schema

Multi-Viewpoint Tables (MVTs): Specialized Conditional Tables + Provenance
a b Peer1:tup1 Peer1,Peer2 c Peer1:tup5 Peer1 d Peer3:tup8 Peer3,Peer4 GUSv1:Study AB Each peer’s instance is the subset of tuples in which the peer’s name appears in the viewpoint set Reconciling peer’s trust conditions assign priorities based on data, provenance, viewpoint set: Peer2:Study(A,B) :- {(GUSv1:Study(A,B; prv, _, _)& contains(prv, Peer1:*); 5), (GUSv1:Study(A,B; _, vpt, _)& contains(vpt, Peer3); 2)} Datalog rule body Priority

Summary of MVTs Allow us to have one representation for disagreeing data instances – necessary for expressing constraints among different data sources Really, we focus on updates rather than data Relations of deltas (tuple edits), as opposed to tuples themselves

CDSS Reconciliation [Taylor+Ives SIGMOD06]
Operations are between one participant and “the system” Publishing Reconciliation Participant applies consistent subset of updates May get its own unique instance d ORCHESTRA System request Publish New Updates Participant Reconciliation Requests Published Updates d d Local Instance Update Log d

Challenges of Reconciliation
Updates occur in atomic transactions Transactions have causal dependencies (antecedents) Peers may participate intermittently (Requires us to make maximal progress at each step)

Ground Rules of Reconciliation
Clearly, we must not: Apply a transaction without having data it depends on (i.e., we need its antecedents) Apply a transaction chain that causes constraint violations Apply two transaction chains that affect the same tuple in incompatible ways Also, we believe we should: Exhibit consistent, predictable behavior to the user Monotonic treatment of updates: transaction acceptances are final Always prefer higher priority transactions Make progress despite conflicts with no clear winner Allow user conflict resolutions to be deferred

Reconciliation in ORCHESTRA
R(X,Y): XèY Accept highest priority transactions (and any necessary antecedents) Txn Priority: high medium low Reconciliation 1 Reconciliation 2 û +(A,4) +(B,4) ü +(A,3) key value A 4 B C 5 key value A 4 B A B C A D û +(B,3) +(C,5) +(A,2) û * Reconciliation 2 low priority transaction cannot be accepted by itself 6 +(D,8) Decision: ü Accept û Reject 6 Defer Deferred keys D Deferred keys D +(D,9) 6 +(C,6) ü

Transaction Chains Possible problem: transient conflicts
We flatten chains of antecedent transactions +(C,6) (C,6) è(D,6) +(D,6) Peer 1: +(C,5) Peer 3: Assume since updates made independently, only want to check compatibility at the end

Flattening and Antecedents
R(X,Y): XèY ü +(D,6) +(A,2) (D,6) è(D,7) Txn Priority: high medium low +(D,6) +(A,2) +(D,7) û A +(A,1) +(B,3) +(F,4) ü +(A,1) +(B,3) +(F,4) (B,3) è(B,4) (C,5) è(E,5) Decision: ü Accept û Reject 6 Defer ü +(C,5) +(C,5) +(A,1) +(B,4) +(F,4) +(E,5) ü

Reconciliation Algorithm: Greedy, Hence Efficient
Input: Flattened trusted applicable transaction chains Output: Set A of accepted transactions: For each priority p from pmax to 1: Let C be the set of chains for priority p If some t in C conflicts with a non-subsumed u in A, REJECT t If some t in C uses a deferred value, DEFER it conflicts with a non-subsumed, non-rejected u in C, DEFER t Otherwise, ACCEPT t by adding it to A

ORCHESTRA Reconciliation Module
Java reconciliation algorithm at each participant Poly-time in size of update load + antecedent chain length Distributed update store built upon Pastry + BerkeleyDB Store updates persistently Compute antecedent chains Participant Distributed Update Store Publish New Updates Participant Reconciliation Algorithm Reconciliation Requests Published Updates RDBMS

Experimental Highlight: Performance Is Adequate for Periodic Reconciliation
Simulated (Zipfian-skewed) update distribution over subset of SWISS-PROT at each peer (insert / replace workload) 10 peers each publish 500 single-update transactions Infrequent reconciliation more efficient Fetch times (i.e. network latency) dominate Centralized Impl. Distributed Impl.

Skewed Updates, Infrequent Changes Don’t Result in Huge Divergence
Effect of reconciliation interval on synchronicity synchronicity = avg. no. of values per key ten peers each publish 500 transactions of one update Infrequent reconciliation slowly changes synchronicity Intuitively, if someone doesn’t reconcile often, we would expect his or her state to diverge. We wanted to examine how badly. Mention Zipfian distribution of keys.

Summary of Reconciliation
Distributed implementation is practical We don’t really need “real-time” updates, and operation is reasonable (We are currently running 100s of virtual peers) Many opportunities for query processing research (caching, replication) Other experiments (in SIGMOD06 paper) How much disagreement arises? Transactions with > 2 updates have negligible impact Adding more peers has a sublinear effect Performance with more peers Increases execution time linearly Need all of the data in one target schema…

Reconciling with Many Schemas
DE PE CDSS (ORCHESTRA) Participant PC Updates (DC) DC PP Consistent, trusted subset of DE, in Pc’s schema RDBMS Reconciliation needs transactions over the target schema Break txns into constituent updates (deltas), tagged with txn IDs Translate the deltas using schema mappings Reassemble transactions by grouping deltas w/ same txn ID Reconcile!

Given a Set of Mappings, What Data Should be in Each Peer’s Instance?
CryptoDB ( ) m - > < PDMS Semantics [H+03]: each peer provides all certain answers

Schema Mappings from Data Exchange: A Basic Foundation
Data exchange (Clio group at IBM, esp. Popa and Fagin): Schema mappings are tuple generating dependencies (TGDs) R(x,y), S(y, z)  ∃w T(x,w,z), U(z,w,y) Chase [PT99] over sources, TGDs, to compute target instances Resulting instance: canonical universal solution [FKMP03], and queries over it give all certain answers Our setting adds some important twists…

Semantics of Consistency: Input, Edit, and Output Relations
Edit table (local updates) R o E + D - P + + Input relation + D E R i + P m E - > C i m C E - > P R E P C R i C - D m E < - C + C + Output relation R o C

Incremental Reconciliation in a CDSS [Green, Karvounarakis, Tannen, Ives submission]
Re-compute each peer’s instance individually, in accordance with the input-edit-output model Don’t re-compute from scratch Translate all “new” updates into the target schema, maintaining transaction and sequencing info Then perform reconciliation as we described previously This problem requires new twists on view maintenance

Mapping Updates: Starting Point
Given schema mappings: R(x,y), S(y, z)  ∃w T(x,z), U(z,w,y) Convert these into update translation mappings that convert “deltas” over relations (Similar to [GM95] count algorithm’s rules) -R(x,y), S(y, z)  ∃w -T(x,z), -U(z,w,y) R(x,y), -S(y, z)  ∃w -T(x,z), -U(z,w,y) -R(x,y), -S(y, z)  ∃w -T(x,z), -U(z,w,y)

A Wrinkle: Incremental Deletion
Suppose our mapping is R(x,y)  S(x) And we are given: R(x,y) 1,2 1,3 2,4 S(x) 1 2 Then:

A Wrinkle: Incremental Deletion
We want a deletion rule like: -R(x,y)  -S(x) But this doesn’t quite work: If we delete R(1,2), then S should be unaffected If we map –R(1,2) to –S(1), we can’t delete S(1) yet… Only if we also delete R(1,3) should we delete S(1) Source of the problem is that S(1) has several distinct derivations! (Similar to bag semantics) R(x,y) 1,2 1,3 2,4 S(x) 1 2

A First Try… Counting [GM95]
(Gupta and Mumick’s counting algorithm) When computing s, add a count for # derivations: When we use -R(x,y)  -S(x), for each deletion, decrement the count, and only remove when we get to 0 R(x,y) 1,2 1,3 2,4 S(x) c 1 2 2 1 1

Where this Fails… Suppose we have a cyclic definition (two peers want to exchange data): M1: R(x,y)  S(y,x) M2: S(x,y)  R(x,y) R 1,2 2,4 2,1 4,2 R 1,2 2,4 M1 S 2,1 4,2 M2 M1 S 2,1 4,2 1,2 2,4 M2 R 1,2 2,4 2,1 4,2 … How many times is each tuple derived? We need a finite fixpoint, or else this isn’t implementable! What happens if R deletes the tuple? If S does? …

Desiderata for a Solution
Record a trace of each distinct derivation of a tuple, w.r.t. its original relation and every mapping Different from, e.g., Cui & Widom’s provenance traces, which only maintain source info In cyclic cases, only count “identical loops” a finite number of times (say once) This gives us a least fixpoint, in terms of tuples and their derivations … It also requires a non-obvious solution, since we can’t use sets, trees, etc. to define provenance … An idea: think of the derivation as being analogous to a recurrence relation…

Our Approach: S-Tables
Trace tuple provenance as a semiring polynomial (S,+,*,0,1), to which we add mapping M(…): x + 0 = x x + x = x x * 0 = 0 (x+y)+z = x+(y+z) (x*y)*z = x*(y*z) x + y = y + x x(y + z) = xy + xz M(x + y) = M(x) + M(y) Tuple with provenance of 0 is considered to not be part of instance M1: R(x,y)  S(y,x) M2: S(x,y)  R(x,y) R Provenance 1,2 p0 = t0 + M2(P6) 2,4 p1 = t1 + M2(p7) 2,1 p2 = M2(p4) 4,2 p3 = M2(p5) R Provenance 1,2 p0 = t0 2,4 p1 = t1 2,1 p2 = M2(p4) 4,2 p3 = M2(p5) S Provenance 2,1 p4 = M1(p0) 4,2 p5 = M1(p1) 1,2 p6 = M1(p2) 2,4 p7 = M1(p3) S Provenance S Provenance 2,1 p4 = M1(p0) 4,2 p5 = M1(p1)

Incremental Insertion with S-tables
Inserting a tuple t: If there’s already an identical tuple t’, update the provenance of t to t’ to be prov(t) + prov(t’) Then, simplify – note result may be no change! Else insert t with its provenance

Deletion M1: R(x,y)  S(y,x) M2: S(x,y)  R(x,y) R Provenance
1,2 p0 = t0 + M2(P6) 2,4 p1 = t1 + M2(p7) 2,1 p2 = M2(p4) 4,2 p3 = M2(p5) R Provenance 2,4 p1 = t1 4,2 p3 = M2(p5) S Provenance 4,2 p5 = M1(p1) S Provenance 2,1 p4 = M1(p0) 4,2 p5 = M1(p1) 1,2 p6 = M1(p2) 2,4 p7 = M1(p3) Given –R(1,2) and –S(2,4) M1: -R(x,y)  -S(y,x) M2: -S(x,y)  -R(x,y) Set p0 and p7 := 0 Simplify (may be nontrivial if mutual recursion)

Summary: S-Tables and Provenance
More expressive than “why & where provenance,” [Buneman+ 01], lineage tracing [Cui + Widom 01], other formalisms Similar in spirit to “mapping routes” [Chiticaru+06], irrelevant rule elimination [Levy+ 92] If the set of mappings has a least fixpoint in datalog, it has one in our semantics Our polynomial captures all possible derivation paths “through the mappings” – a form of “how provenance” (Tannen) Gives us a means of performing incremental maintenance in a fully P2P model, even with cycles (that have least fixpoints)

Ongoing Work Implementing the provenance-based maintenance algorithm
Procedure can be cast as a set of datalog rules But: needs “slightly more” than SQL or stratified datalog semantics Inverse mappings We propagate updates “down” a mapping – what about upwards? Necessary to support mirroring… Provenance makes it quite different from existing view update literature Performance! Lots of opportunities for caching antecedents, reusing computations across reconciliations, answering queries using views, multi-query optimization!

SHARQ [with Davidson, Tannen, Stoeckert, White]
ORCHESTRA is the core engine of a larger effort in bioinformatics information management: SHARQ (Sharing Heterogeneous, Autonomous Resources and Queries) Develop a network of database instances, views, query forms, etc. that: Is incrementally extensible with new data, views, query templates Supports search for the “the right” query form to answer a question Accommodates a variety of different sub-communities Supports both browsing and searching modes of operation … Perhaps even supports text extraction and approximate matches SHAR

Related Work Incomplete information [Imielinski & Lipski 84], info source tracking [Sadri 98] Inconsistency repair [Bry 97], [Arenas+99] Provenance [Alagar+ 95][Cui & Widom 01][Buneman+ 01][Widom+ 05] Distributed concurrency control Optimistic CC [KR 81], Version vectors [PPR+83], … View update [Dayal & Bernstein 82][Keller 84, 85], … Incremental maintenance [Gupta & Mumick 95], [Blakeley 86, 89], … File synchronization and distributed filesystems Harmony [Foster + 04], Unison [Pierce + 01]; CVS, Subversion, etc. Ivy [MMGC 02], Coda [Braam 98,KS 95], Bayou [TTP+’96], … Treo [Widom+], MystiQ [Suciu+] Peer data management systems Piazza [Halevy + 03, 04], Hyperion [Kementsietsidis+ 04], [Calvanese+ 04], peer data exchange [Fuxman + 05], Trento/Toronto LRM [Bernstein+ 02]

Conclusions ORCHESTRA focuses on trying to coordinate disagreement, rather than enforcing agreement Accommodate disagreement with an extended data model and trust policies Reconcile updates at the transaction level Define update translation mappings to get all data into the target schema Ongoing work: implementing update mappings, caching, replication, biological applications

Rapid, Collaborative Sharing of Dynamic Data

Similar presentations

Presentation on theme: "Rapid, Collaborative Sharing of Dynamic Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Rapid, Collaborative Sharing of Dynamic Data

Similar presentations

Presentation on theme: "Rapid, Collaborative Sharing of Dynamic Data"— Presentation transcript:

Similar presentations

About project

Feedback