Download presentation
Presentation is loading. Please wait.
Published byTobias Joseph Modified over 9 years ago
1
Storing and Querying Multi-version XML Documents using Durable Node Numbers Shu-Yao Chien Dept. of CS UCLA csy@cs.ucla.edu Vassilis J. Tsotras Dept. of CS&E UC Riverside tsotras@cs.ucr.edu Carlo Zaniolo Dept. of CS UCLA zaniolo@cs.ucla.edu Donghui Zhang Dept. of CS&E UC Riverside donghui@cs.ucr.edu
2
Traditional applications migrating to the web: –Software configuration management –Cooperative work –CAD An array of web-based applications: –Web content providers and trackers –Link Permanence –WebDAV Document Version Management
3
An assortment of new and old applications seek from XML a shared technology and toolset to support their assorted requirements Main requirements and research challenges: –Efficient version retrieval –Storage efficiency –Complex query support Problem Definition
4
Naive approach stores each version in its entirety: minimizes retrieval but very inefficient storage. RCS (Revision Control System) : –stores the latest version in its entirety, and –old versions represented by deltas ---reverse edit script –minimizes storage cost –version retrieval cost grows linearly with version number SCCS (Source Code Control System) : –objects timestamped and stored by their document order –version retrieval cost as high as whole change history These schemes are used by most current systems--- but need improvements in storage management, retrieval, query, and support for complex objects. Traditional Versioning Schemes
5
DBs for CAD and for semi-structured information paid much attention to version support Temporal DBs: efficient support for transaction time by various indexing schemes, Snapshot Index, Multi- Version B+ -Trees, etc. But typical DBs do not support object ordering (since reconstruction of complete document is not a critical query) Numbering schemes are proposed to represent document structure and enhance efficiency in evaluating regular path expressions. Databases --- Temporal, OO, Semi-structured, XML DB, …
6
UBCC [WebDB200] enhances RCS with page management Flexibility of trading off storage and retrieval costs Using the concept of Page Usefulness Captures the information on the order of the object document in the (forward) edit script Storage Level Enhancement
7
DELT2DEL T3 ABCD75%ABCD25% ABCD T1 100% VersnPage Usefulness We set a minimum usefulness requirement U min, e.g. 70% (0 < U min <= 1). A page is useful/useless when its usefulness is above/ below U min. Useful Useless Page Usefulness – by Example
8
RootCh AFig DSec ECh BSec FFig GFig H VERSION 2 INS(Sec J) DEL INS(Fig M) DEL INS(Ch K), INS(Sec L) STEP 1 : Determine page usefulness for copying., U(P1)=75% VERSION 1, U(P2) = 50% < U min =70% STEP 2 : Append new/copied objects into new pages by their logical order. P3 Sec J COPY Ch BSec FFig M P4 Ch KSec L P1 P2, U(P3)=100%, U(P4)=100% Usefulness Based Copy Control (UBCC)
9
New Support are Needed … Complex Query Support: Temporal Selection Structural Projection Content-Based Selection Regular Path Expression Query on Diff UBCC is not efficient in supporting version queries. A new scheme is needed …
10
The SPaR Versioning Scheme SPaR numbering scheme Version model Complex query support Usefulness-based storage strategy
11
SPaR Numbering Scheme XML document structure are represented by: a Durable Node Number (DNN), and a Range DNN is a sparse numbering scheme that preserves element order. Range preserves parent-child relationships. Documents can be decomposed and stored as separate elements, then reconstructed (maybe partially) when needed. Indexes can be built upon DNN and Range for efficient XML query evaluation.
12
SPaR Numbering Scheme --- by Example DNN is a sparse numbering scheme that preserves element order as pre-order traversal (the same as document order). Range preserves parent-child containment relationship such that: dnn(P) < dnn(C) < dnn(C)+range(C) < dnn(P)+range(P). 5565 5180305 1001 2125 Root dnn=1 Ch A dnn=5 Ch B dnn=51 Fig G dnn=61 dnn=11 Sec E dnn=21 Sec F dnn=55dnn=71 Fig HFig D range=100 range=25range=30 range=2 range=5range=10range=2
13
Durability upon Updates Unused ranges are saved between consecutive elements for future insertions. When a new element Y is inserted between two consecutive elements X and Z, an unused SPaR range is assigned to Y according to the structural relationship between X, Y, and Z. Range overflow is handled by floating point numbers with variable length.
14
SPaR Version Model Elements are stored by their DNN order along with: Lifespan --- (T start, T end ) SPaR range Adding a new version, V N : Delete(E) – Set E.T end to V N-1 and free its SPaR range. Insert(E) – Set the lifespan of E to (N,now) and assign it an unused SPaR range. Update(E,new-value) – Delete(E) + Insert(new_value) using the same SPaR range. New elements of V N are appended into data pages by their DNN order. However, elements of V N may be scattered among low usefulness data pages …
15
Version Reconstruction To reconstruct version V N : Step 1 --- Locate useful data pages using the Sparse Page Index. Step 2 --- Ordering elements according to their DNN number. Step 3 --- Reconstruct the ordered-tree structure of the document.
16
Step 1 --- Locate Useful Pages Sparse Page Index 12345678910Version # P1 P2 P3 P4 P5 P6 P7 P8 P1(1,now) P2(1,6) P3(2,5) P4(3,now) P5(4,10) P6(7,8)P7(8,9) P8(9,now)List Llpr
17
Step 2 and 3 Ordering elements by their DNN numbers --- Valid elements inserted at the same version are already sorted by their DNN number, for instance : Merge-sort these sorted lists. Reconstructing ordered-tree structure --- Parent-child is determined by SPaR ranges. Sibling order is implied by the DNN order. Maintain a backward ancestor stack for back-tracking. ChSecFigSecFig SecFigSec Fig ChFigSec Fig V3 V7 V13 … … … … Fig …… …
18
Regular Path Expression Regular Path Query --- “For version 10, retrieve all figures contained by a chapter.” doc[version=10]/Ch/*/Fig Basic Ideas: Traditional algorithms trace tree structure to match path pattern. SPaR range makes it possible to evaluate path query simply using relational join operator. We use SPaR range of Ch elements to reduce the search space for Fig elements. Multi-version B+ Tree is built to help search based upon DNN numbers.
19
Dense Element Index Multi-version B+ Tree (MVBT) keeps history for B+ Tree. We use MVBT to build dense element indexes. Ch_MVBT … Fig_MVBT … SPaR : (200,300) Life : (1,now) Loc : Page P1 SPaR : (500,700) Life : (3,now) Loc : Page P1 … SPaR : (400,410) Life : (1,now) Loc : Page P5 SPaR : (480,490) Life : (1,15) Loc : Page P1 … SPaR : (250,260) Life : (2,10) Loc : Page P3 SPaR : (550,560) Life : (2,9) Loc : Page P1
20
Pages stored : size(RCS)/(1-U min ) Retrieval of single version : size(Version)/U min pages UBCC uses a separate edit script pointing to the data –to retrieve only useful pages –in the right order! SPaR scheme only needs SPaR ranges to reconstruct versions. SPaR is slightly better than UBCC in storage cost and version reconstruction. Performance
21
Performance and Storage Cost (10% inserted, 10% deleted)
22
The web changes everything—XML unifies everything. It’s time for a new technology that merges and overcomes the limitations of traditional versioning schemes and temporal databases. Usefulness-based clustering is effective and versatile: we applied it to edit script based schemes (UBCC) and spar scheme. Spar numbering scheme makes it possible to build document structural index and efficiently evaluate complex version queries. Emerging issues: –Query language support for version queries. –User interface for browsing versions and presenting query results Conclusion and Future Work
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.