Presentation is loading. Please wait.

Presentation is loading. Please wait.

Storing and Querying Multi-version XML Documents using Durable Node Numbers Shu-Yao Chien Dept. of CS UCLA Vassilis J. Tsotras Dept. of.

Similar presentations


Presentation on theme: "Storing and Querying Multi-version XML Documents using Durable Node Numbers Shu-Yao Chien Dept. of CS UCLA Vassilis J. Tsotras Dept. of."— Presentation transcript:

1 Storing and Querying Multi-version XML Documents using Durable Node Numbers Shu-Yao Chien Dept. of CS UCLA csy@cs.ucla.edu Vassilis J. Tsotras Dept. of CS&E UC Riverside tsotras@cs.ucr.edu Carlo Zaniolo Dept. of CS UCLA zaniolo@cs.ucla.edu Donghui Zhang Dept. of CS&E UC Riverside donghui@cs.ucr.edu

2  Traditional applications migrating to the web: –Software configuration management –Cooperative work –CAD  An array of web-based applications: –Web content providers and trackers –Link Permanence –WebDAV Document Version Management

3  An assortment of new and old applications seek from XML a shared technology and toolset to support their assorted requirements  Main requirements and research challenges: –Efficient version retrieval –Storage efficiency –Complex query support Problem Definition

4  Naive approach stores each version in its entirety: minimizes retrieval but very inefficient storage.  RCS (Revision Control System) : –stores the latest version in its entirety, and –old versions represented by deltas ---reverse edit script –minimizes storage cost –version retrieval cost grows linearly with version number  SCCS (Source Code Control System) : –objects timestamped and stored by their document order –version retrieval cost as high as whole change history  These schemes are used by most current systems--- but need improvements in storage management, retrieval, query, and support for complex objects. Traditional Versioning Schemes

5  DBs for CAD and for semi-structured information paid much attention to version support  Temporal DBs: efficient support for transaction time by various indexing schemes, Snapshot Index, Multi- Version B+ -Trees, etc.  But typical DBs do not support object ordering (since reconstruction of complete document is not a critical query)  Numbering schemes are proposed to represent document structure and enhance efficiency in evaluating regular path expressions. Databases --- Temporal, OO, Semi-structured, XML DB, …

6  UBCC [WebDB200] enhances RCS with page management  Flexibility of trading off storage and retrieval costs  Using the concept of Page Usefulness  Captures the information on the order of the object document in the (forward) edit script Storage Level Enhancement

7 DELT2DEL T3 ABCD75%ABCD25% ABCD T1 100% VersnPage Usefulness  We set a minimum usefulness requirement U min, e.g. 70% (0 < U min <= 1).  A page is useful/useless when its usefulness is above/ below U min. Useful Useless Page Usefulness – by Example

8 RootCh AFig DSec ECh BSec FFig GFig H VERSION 2 INS(Sec J) DEL INS(Fig M) DEL INS(Ch K), INS(Sec L) STEP 1 : Determine page usefulness for copying., U(P1)=75% VERSION 1, U(P2) = 50% < U min =70% STEP 2 : Append new/copied objects into new pages by their logical order. P3 Sec J COPY Ch BSec FFig M P4 Ch KSec L P1 P2, U(P3)=100%, U(P4)=100% Usefulness Based Copy Control (UBCC)

9 New Support are Needed …  Complex Query Support:  Temporal Selection  Structural Projection  Content-Based Selection  Regular Path Expression  Query on Diff  UBCC is not efficient in supporting version queries.  A new scheme is needed …

10 The SPaR Versioning Scheme  SPaR numbering scheme  Version model  Complex query support  Usefulness-based storage strategy

11 SPaR Numbering Scheme  XML document structure are represented by:  a Durable Node Number (DNN), and  a Range  DNN is a sparse numbering scheme that preserves element order.  Range preserves parent-child relationships.  Documents can be decomposed and stored as separate elements, then reconstructed (maybe partially) when needed.  Indexes can be built upon DNN and Range for efficient XML query evaluation.

12 SPaR Numbering Scheme --- by Example  DNN is a sparse numbering scheme that preserves element order as pre-order traversal (the same as document order).  Range preserves parent-child containment relationship such that: dnn(P) < dnn(C) < dnn(C)+range(C) < dnn(P)+range(P). 5565 5180305 1001 2125 Root dnn=1 Ch A dnn=5 Ch B dnn=51 Fig G dnn=61 dnn=11 Sec E dnn=21 Sec F dnn=55dnn=71 Fig HFig D range=100 range=25range=30 range=2 range=5range=10range=2

13 Durability upon Updates  Unused ranges are saved between consecutive elements for future insertions.  When a new element Y is inserted between two consecutive elements X and Z, an unused SPaR range is assigned to Y according to the structural relationship between X, Y, and Z.  Range overflow is handled by floating point numbers with variable length.

14 SPaR Version Model  Elements are stored by their DNN order along with:  Lifespan --- (T start, T end )  SPaR range  Adding a new version, V N :  Delete(E) – Set E.T end to V N-1 and free its SPaR range.  Insert(E) – Set the lifespan of E to (N,now) and assign it an unused SPaR range.  Update(E,new-value) – Delete(E) + Insert(new_value) using the same SPaR range.  New elements of V N are appended into data pages by their DNN order.  However, elements of V N may be scattered among low usefulness data pages …

15 Version Reconstruction  To reconstruct version V N :  Step 1 --- Locate useful data pages using the Sparse Page Index.  Step 2 --- Ordering elements according to their DNN number.  Step 3 --- Reconstruct the ordered-tree structure of the document.

16 Step 1 --- Locate Useful Pages  Sparse Page Index 12345678910Version # P1 P2 P3 P4 P5 P6 P7 P8 P1(1,now) P2(1,6) P3(2,5) P4(3,now) P5(4,10) P6(7,8)P7(8,9) P8(9,now)List Llpr

17 Step 2 and 3  Ordering elements by their DNN numbers ---  Valid elements inserted at the same version are already sorted by their DNN number, for instance :  Merge-sort these sorted lists.  Reconstructing ordered-tree structure ---  Parent-child is determined by SPaR ranges.  Sibling order is implied by the DNN order.  Maintain a backward ancestor stack for back-tracking. ChSecFigSecFig SecFigSec Fig ChFigSec Fig V3 V7 V13 … … … … Fig …… …

18 Regular Path Expression  Regular Path Query --- “For version 10, retrieve all figures contained by a chapter.” doc[version=10]/Ch/*/Fig  Basic Ideas:  Traditional algorithms trace tree structure to match path pattern.  SPaR range makes it possible to evaluate path query simply using relational join operator.  We use SPaR range of Ch elements to reduce the search space for Fig elements.  Multi-version B+ Tree is built to help search based upon DNN numbers.

19 Dense Element Index  Multi-version B+ Tree (MVBT) keeps history for B+ Tree.  We use MVBT to build dense element indexes. Ch_MVBT … Fig_MVBT … SPaR : (200,300) Life : (1,now) Loc : Page P1 SPaR : (500,700) Life : (3,now) Loc : Page P1 … SPaR : (400,410) Life : (1,now) Loc : Page P5 SPaR : (480,490) Life : (1,15) Loc : Page P1 … SPaR : (250,260) Life : (2,10) Loc : Page P3 SPaR : (550,560) Life : (2,9) Loc : Page P1

20  Pages stored : size(RCS)/(1-U min )  Retrieval of single version : size(Version)/U min pages  UBCC uses a separate edit script pointing to the data –to retrieve only useful pages –in the right order!  SPaR scheme only needs SPaR ranges to reconstruct versions.  SPaR is slightly better than UBCC in storage cost and version reconstruction. Performance

21 Performance and Storage Cost (10% inserted, 10% deleted)

22  The web changes everything—XML unifies everything.  It’s time for a new technology that merges and overcomes the limitations of traditional versioning schemes and temporal databases.  Usefulness-based clustering is effective and versatile: we applied it to edit script based schemes (UBCC) and spar scheme.  Spar numbering scheme makes it possible to build document structural index and efficiently evaluate complex version queries.  Emerging issues: –Query language support for version queries. –User interface for browsing versions and presenting query results Conclusion and Future Work


Download ppt "Storing and Querying Multi-version XML Documents using Durable Node Numbers Shu-Yao Chien Dept. of CS UCLA Vassilis J. Tsotras Dept. of."

Similar presentations


Ads by Google