Change-Centric Management of Versions in an XML Warehouse Amélie Marian Columbia University Serge Abiteboul, Grégory Cobéna, Laurent Mignet INRIA-Rocquencourt
VLDB-Sept 2001Amélie Marian2 Overview The Xyleme Project Change Management Version Management –XIDs –XML Diff –Deltas –Storage of XML documents versions –Implementation and experiments
VLDB-Sept 2001Amélie Marian3 The Xyleme Project A dynamic XML Data Warehouse with high level services: –User-friendly Query Engine –Semantic Data Integration –Version Management –Query Subscription, Change Monitoring services Xyleme project is now finished Start-up also called Xyleme
VLDB-Sept 2001Amélie Marian4 Change Management Version Management Learning about Changes Monitoring Changes: Query Subscription Querying the Past:Temporal Queries
VLDB-Sept 2001Amélie Marian5 Version Management Our Requirements: Obtain the current version Get the modifications since time t Subscribe to change notifications, query changes Compute temporal queries Rebuild the version V i of a document at time t i
VLDB-Sept 2001Amélie Marian6 Getting the Documents XML documents are fetched from the web We only have snapshots of the documents Pr Catalog P Pr NPNNP Camera300TV100VCR200 Pr Catalog P Pr NPNNP TV100DVD500VCR150 Version 1 Version 2
VLDB-Sept 2001Amélie Marian7 XIDs Unique identifiers needed to track XML nodes through time: Track changes on a specific node (ex: a product in a catalog) Reconstruct the history of a node But physically adding an ID attribute to each node is expensive storage-wise XIDs: allow to attach persistent IDs to every node in a storage efficient manner
VLDB-Sept 2001Amélie Marian8 XIDs XIDs stored separately as a list (XID-map) –List of the nodes IDs in a postorder traversal of the tree –XIDnext: gives the next available XID Compact Representation Document is not modified XID-map (1-3,14-15,7-13|16)
VLDB-Sept 2001Amélie Marian9 XML Diff We implemented a XML diff algorithm to compute changes between two versions of a document: –Use of XML structure for matching –Content matching Linear in the size of the document XML diff has two roles: –Match nodes –Build the delta Ongoing work on improving the XML diff
VLDB-Sept 2001Amélie Marian Update Node Matching using a Diff Algorithm Delete Diff (V1,V2) delete(5) update(13,150) insert(16,2,(17-21)) New XID-map: (6-10,17-21,11-16|22) XID-map: (1-16|17) Insert Pr Catalog P Pr NPNNP Camera300TV100VCR200 Pr Catalog P Pr NPNNP TV100DVD500VCR150 Version 1 Version 2
VLDB-Sept 2001Amélie Marian11 Edit-Scripts = SEQUENCE Sequences of basic operations over XML trees: Delete(n) Update(n, v) Insert(m,k,T) Move(n,k,m) An Edit Script can be applied to a document D if its operations are consistent with D An Edit Script applied to a document D will result in a unique document D ’ Several Edit Scripts applied to a document D can result in the same document D ’
VLDB-Sept 2001Amélie Marian12 Deltas (Δ) = SET We introduce an alternative way of representing changes: Deltas Δ i,j (unit delta) contains the Set of operations needed to go from V i to V j ( Diff(V i,V j ) ) A Delta (Δ) over a document D is the sequence of unit deltas over D: Δ={Δ 1,2,..., Δ k-1,k } There is a (almost) unique delta from V i to V j We represent Deltas as XML documents
VLDB-Sept 2001Amélie Marian13 Shortcomings of Deltas Storage Policies a) V 1, Δ 1,2, … Δ now-1,now b) Δ 2,1, … Δ now,now-1, V now c) V 1, Δ 2,1, … Δ now,now-1 d) Δ 1,2, … Δ now-1,now, V now Only a) and b) lossless But we would like to have fast access to: – V now –Δ i,now Deltas are not reversible and cannot be composed (information on position is missing)
VLDB-Sept 2001Amélie Marian14 Completed Deltas (Δ + ) Completed deltas contain more information : Delete(m,k,T) Update(n, ov, nv) Insert(m,k,T) Move(n,k,m,p,q) Completed Deltas can be reversed and composed Completed Deltas are in the spirit of some logs in DB systems
15 … Camera 300 DVD 500 Example of XML Δ+
VLDB-Sept 2001Amélie Marian16 Operations on Deltas Compute with version: –V i o Δ + i,j = V j –V i o Δ i,j = V j Reverse: (Δ + i,j ) -1 = Δ + j,i Compose: Δ + i,j ;Δ + j,k =Δ + i,k Simplify: Δ + i,j → Δ i,j
VLDB-Sept 2001Amélie Marian17 Storage of Versions For a document D (or a query result Q), we store: –Current Version: V k –XID-map (as text) of V k –Current Δ + = {Δ + 1,2,..., Δ + k-1,k } When a new version k+1 arrives: –Compute XML diff between k and k+1, compute Δ + k,k+1 –Replace current version: V k+1 –Replace XID-map –Append Δ + k,k+1 to Δ +
VLDB-Sept 2001Amélie Marian18 Levels of Versioning Full versioning is expensive, we support different levels of versioning: –Full Versioning: V now + Δ + –Partial Versioning: V now + Δ –Last Version Update: V now + Δ now-1,now –Change Support: V now + XML diff computed for Query Subscription –Not Versioned: V now
VLDB-Sept 2001Amélie Marian19 Implementation Version Manager and XML diff implemented in C++ A change simulator was implemented for tests A GUI was implemented
20 GUI Interface
VLDB-Sept 2001Amélie Marian21 Deltas Statistics Reasonable when there are not many modifications Relatively expensive for small documents Depends on the quality of the diff
VLDB-Sept 2001Amélie Marian22 Deltas Statistics (2) 30% of modifications on the document From left to right –Snapshots –Completed Deltas –Deltas: composition and previous version reconstruction are not possible –Composed Completed Deltas: advantages of Completed Deltas but coarser granularity and higher cost.
VLDB-Sept 2001Amélie Marian23 Conclusion Management of Versions based on Change Representation: –Representation in tree data (XML) –Study of storage policies –Implementation of running prototypes Completed Deltas: a Set of Modifications –Mathematical properties on completed deltas (algebraic group) Current work on Query Subscription, Continuous Queries and Changes over Collections of Documents
VLDB-Sept 2001Amélie Marian24 References Version Management –Chien, Tsotras and Zaniolo. Efficient Management of Multiversion Documents by Object Referencing. VLDB –Chawathe, Abiteboul and Widom. Managing Historical Semistructured Data. TAPOS –Cellary and Jomier. Consistency of Versions in Object-Oriented Databases. VLDB –Adiba and Lindsay. Database Snapshots. VLDB Diff Algorithms –Chawathe and Garcia-Molina. Meaningful Change Detection in Structured Data. Sigmod –Cobena, Abiteboul and Marian. Detecting Changes in XML Documents. Technical report INRIA. Xyleme –Cluet, Veltri and Vodislav. Views in a Large Scale XML Repository. VLDB –Nguyen, Abiteboul, Cobena and Preda. Monitoring XML data on the Web. Sigmod 2001.
VLDB-Sept 2001Amélie Marian25 Example: Edit-Scripts vs. Deltas A Possible Edit-Script: Insert(B,1,P) Insert(C,1,P) The Delta: Insert(B,2,P) Insert(C,1,P) C P BA Version 1 P A Version 0 Edit-ScriptsDeltas Relative position (at time of operation) Absolute position (final)
VLDB-Sept 2001Amélie Marian26 Example: Missing Information for Delta Composition (Δ(0,2)) Deltas do not give information on parents and positions of deleted elements Positions of inserted elements in composition cannot be computed C P BA Version 1 B P DA Version 2 C P A Version 0 Δ (0,1) Δ (1,2) Δ + (1,2) Insert(B,2,P)Delete(C) Insert (D,2,P) Delete(C,1,P) Insert (D,2,P)