Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles Vassilis J. Tsotras Department of Computer Science and Engineering University of California, Riverside Carlo Zaniolo Computer Science Department University of California, Los Angeles
The Problem Managing (storing, querying) multiple versions documents is important for content providers and cooperative work Temporal DBs: transaction time, CAD/OO applications Web/XML changes/unifies everything Traditional schemes (RCS, SCCS): not optimized for secondary store---no temporal clustering DB-oriented approaches: not optimized for retrieval of complete documents Transport level: exchange and processing (browser side) of multiversion documents also critical—need to reconcile storage and exchange representations.
Version Management: Approaches Time stamping of objects Store all Snapshots: fast retrieval, excessive storage Edit-Based Schemes store the Deltas. Minimal storage but slow retrieval. Traditionally line-oriented DIFF, but semistructured objects in Lorel Our Scheme: Usefulness Based Copy Control (UBCC) - Separate edit scripts from the objects. - Temporal Clustering of objects using page usefulness.
Example: an Evolving XML Document VERSION 1... … … VERSION 2 … … … Order Order
Temporal Clustering by Page Usefulness Usefulness: percentage of page occupied by objects from the current version—the rest is occupied by ‘dead’ objects from previous versions We set a minimum usefulness requirement e.g. 50% When the usefulness of a page fall below this minimum we copy its live objects to a new page
Maintaining Page Usefulness above 70% by Copying Alive Objects O1O1 O2O2 O3O3 O4O4 O5O5 O6O6 O7O7 O8O8 VERSION 1 P1 VERSION 2 DEL,U(P1) =75%P2,U(P2) = 50% < U min =70% P3 Copied O5O5 O6O6 O9O9 O 10,U(P3) = 100%
Usefulness Based Copy Control (UBCC) rootch Asec Dsec Ech Bsec Fsec Gsec H VERSION 2 INS(sec J) DEL INS(sec G’) DEL INS(ch K), INS(sec L) STEP 1 : Determine page usefulness for copying., U(P1) = 75% VERSION 1, U(P2) = 50% < U min =70% STEP 2 : Append new/copied objects into new pages by their logical order. P3 sec J COPY ch Bsec Fsec G’ P4 ch Ksec L P1P2, U(P3)=100%, U(P4)=100%
Document Object Order sec A 2 sec E 4 ch Bsec Fsec Gsec H ch B 5 sec F 6 P3 sec J 3 sec G’ 7 sec L 9 P4 ch K 8 P1 P2 sec D Version 2 objects are not stored in sequence : Hence, we use the edit script. VERSION 2 = (root 1, sec A 2, sec J 3, sec E 4, ch B 5, sec F 6, sec G’ 7,ch K 8,sec L 9 ) root 1
Beyond Edit-Based Versioning The UBCC schemes achieves good storage and retrieval efficiency. But it is not suitable at the transport level and for query on content Thus, we propose a copy-based model which : –explores shared elements –needs no edit script –Yields a simple XML representation for the document history
The XML Version Model (XVM) XVM is a list of version nodes Each version node is an ordered tree consisting of four types of nodes : –element node –attribute node –text node –copy record node Minimal extensions to the Xpath data model—the copy record node is actually a link.
Copy-Based XML Version Model (XVM) V E T A C Version nodeElement node Text node Attribute node copy record node V EE E AA A TT T V E E A A T T C C Tree Addr Ref : V1.2.1
XVM --- Example V E chapter “Intro” E chapter “Tutorial” E section “Scope” E section “Concepts” E section “Context” V1 Changes : 1. DELETE chapter “Tutorial” 2. INSERT chapter “Second Ex” C V E chapter “Second Ex” V2 V1.1 E section “Test Data” Changes : 1. UPDATE the textual content of chapter “Second Ex” 2. COPY the “Concepts” section and insert after section “Test data”. E chapter “Intro” E section “Scope” E section “Concepts” C V E chapter “Second Ex” V3 C C V2.1 V2.2.1 V2.1.2
XVM Version Retrieval --- Example V E C chapter “Intro” E chapter “Tutorial” E section “Scope” E section “Concepts” E section “Context” V1 V E chapter “Second Ex” E section “Test Data” V2 E chapter “Intro” E section “Scope” E section “Concepts” C V E chapter “Second Ex” V3 C C V2.1 V2.2.1 V2.1.2 V1.1
XVM Benefits Transport Level: Represent XVM as an XML document—its DTD automatically generated from the document DTD Storage Level: we extended the usefulness-based temporal clustering scheme to XVM
XVM Implementation --- Use XML to Represent XVM DTD Transformation : –Define three new elements :, and. –For each element in the original DTD add to its content model a CopyRecord as an alternate. Example : Original DTD... Version DTD...
Performance and Storage Cost
Conclusion UBCC is efficient at the storage level. The copy-based scheme is effective as a storage representation and a transport representation Our current research focuses on efficient evaluation of queries on versions: –content queries, –snapshot queries, –history queries.