Download presentation
Presentation is loading. Please wait.
Published byJudith Powell Modified over 8 years ago
1
Efficient Complex Query Support For Multi-version XML Documents Shu-Yao Chien Dept. of CS UCLA csy@cs.ucla.edu Vassilis J. Tsotras Dept. of CS&E UC Riverside tsotras@cs.ucr.edu Carlo Zaniolo Dept. of CS UCLA zaniolo@cs.ucla.edu Donghui Zhang Dept. of CS&E UC Riverside donghui@cs.ucr.edu
2
Motivation Problem statement Framework Problem Reduction Solutions Performance Conclusions Content
3
The web changes everything---XML unifies everything. An assortment of new and old applications seek from XML a shared technology and toolset to support their assorted requirements. Version management for XML documents is an important topic. Main requirements and research challenges: –Efficient version retrieval. –Storage efficiency. –Complex query support. Motivation
4
Given an XML document which evolve over time, how to store the whole history of it and perform complex queries on any version efficiently? Problem Definition
5
Durable Node Numbering Scheme XML document has ordered-tree structure and each element has: a Durable Node Number (DNN), and a Range DNNDNN + Range
6
Node Numbering Scheme --- by Example Root (1) Ch A (2) Ch B (5) Fig G (9) Sec E (4) Sec F (6) Fig H (8) Fig D (3) DNN preserves element order as pre-order traversal. Range preserves parent-child relationship such that: dnn(P) < dnn(C) < dnn(C)+range(C) < dnn(P)+range(P). dnn=1 dnn=5dnn=51 dnn=61 dnn=11dnn=21dnn=55dnn=71 range=100 range=25range=30 range=2 range=5range=10range=2 5565 5180305 1001 2125
7
Version Model Each element has: Lifespan --- (V start, V end ) SPaR range --- (DNN, Range) Adding a new version N corresponds to a set of changes: Delete(E) – Set E.V end to N and free its SPaR range. Insert(E) – Set the lifespan of E to (N, now) and assign it an unused SPaR range. Update(E, new value) – Delete(E) + Insert(E) using the same SPaR range but the new value.
8
Framework for Storage Schemes Two types of tags: individual tag (abstract, conclusion) and list tag (chapter, section, figure). User query list tag element by order (e.g. chapter 2) rather than by SPaR (e.g. the chapter whose SPaR range is (128, 512). Need to transform the order to SPaR range. Calls for separate indices.
9
Problem Reduction Complex queries that can be reduced to partial version retrievals: Structural projection: “project the part of document between chapter 2 and 5 in version 20”; Path-expression: “find the chapter that contains figure 7 in version 10”.
10
Problem Reduction Structural projection: “project the part of document between chapter 2 and 5 in version 20”: Query CH-index, find all chapters in version 20; Compute SPaR range between chapter 2 and 5; Partial version retrieval on full index.
11
Problem Reduction Partial version retrieval: given version i and DNN range r, find all elements whose DNN r in version i.
12
Problem Reduction Path-expression: “find/construct the chapter that contains figure 7 in version 10”; Query FIG-index, find the SPaR for figure 7 in version 10; Query CH-index using the SPaR to find the chapter; To construct, Partial version retrieval on full index.
13
Indexing for List Tags The indexing for list tags (CH-index, FIG-index) is trivial: small. Multi-version B+-tree (MVBT) [BGO+96]: asymptotically optimal in space, update, partial version retrieval.
14
Storage and Query Scheme for Full Index We examine two schemes: –MVBT Storage/Index –UBCC Storage + secondary index
15
The MVBT is capable of storing and querying the multi-versioned XML document, and is asymptotically optimal. Why UBCC? Motivation for UBCC Storage MVBT is designed for handling one-by-one updates, not specialized for the batch update in the document versioning environment.
16
Naive approach stores each version in its entirety: minimizes retrieval but very inefficient storage. RCS (Revision Control System) : –stores the latest version in its entirety, and –old versions represented by deltas ---reverse edit script –minimizes storage cost –version retrieval cost grows linearly with version number SCCS (Source Code Control System) : –objects time-stamped and stored by their document order –version retrieval cost as high as whole change history These schemes are used by most current systems--- but need improvements in storage management, retrieval, query, and support for complex objects. Traditional Versioning Schemes
17
RCS and SCCS stores major versions and incremental modifications. To query, find nearest major version and apply incremental changes for multiple versions. Also, designed for full version retrieval. UBCC [VLDB’01]: Usefulness-Based Copy Control, uses the concept of Page Usefulness UBCC Storage Scheme
18
DELV2DEL V3 ABCD75%ABCD25% ABCD V1 100% VersionPage Usefulness We set a minimum usefulness requirement U min, e.g. 70% (0 < U min <= 1). A page is useful/useless when its usefulness is above/ below U min. Useful Useless Page Usefulness – by Example
19
RootCh AFig DSec ECh BSec FFig GFig H VERSION 2 INS(Sec J) DEL INS(Fig M) INS(Sec T) INS(Fig R) DEL INS(Ch K) INS(Sec L) STEP 1 : Determine page usefulness for copying., U(P1)=75% VERSION 1, U(P2) = 50% < U min =70% STEP 2 : Append new/copied objects into new pages by their logical order. P3 Sec J COPY Ch BSec FFig M P4 Ch KSec L P1 P2, U(P3)=100%, U(P4)=100% Usefulness Based Copy Control (UBCC) Sec TFig R
20
Version retrieval I/O cost for Version N is bound by (S N /U min ). –S N is the size of Version N –E.g. Umin = 50% ---> I/O <= 2*S N Version file size is linear with the size of change history (RCS), and is bound by O(S chg /(1-U min )), where –S chg is the size of change history. –U min is usefulness requirement. Both are optimal! Complexity Analysis
21
Using UBCC to cluster the document elements. On top of the document file: –MVBT as a dense index; or –MVRT as a sparse index. Indexing Choices using UBCC
22
Sparse Page Index --- Multi-version R Tree Multi-version R-Tree : each record corresponds to a UBCC page: Life Span : (T1,T2) Maximum DNN Range : (D1,D2) UBCC Page-ID When retrieve a segment for a version, MVRT is traced to locate useful data pages with an overlapping DNN range. DNN range Version P 5 P 8 P 11 P 15 V 10 D1D2 “Retrieve Version 10, Segment (D1,D2)” P 22
23
Good for sparse MVRT: –small size; –each page is checked at most once. Bad for sparse MVRT: –May read unnecessary pages, e.g. : Request: Version 3, SPaR = (420,700) Page P is qualified but contains no valid element. Sparse vs. Dense Indexing E1 DNN = 200 Life = (1,4) E2 DNN = 300 Life = (1,4) E3 DNN = 400 Life = (1,2) E4 DNN = 500 Life = (1,2) Max DNN Range = (200,500) Life Span = (1,4) Umin = 50% Page P
24
Sun Enterprise 250 Server, Solaris 2.8, 16KB page size, 100 pages buffer size, GNU C++. Dataset: 1000 versions; initial version 1000 objects; each object = 200 bytes; change between two versions is 10%. Implemented schemes: –scheme 1: MVBT storage/index –scheme 2: UBCC storage, dense MVBT index –scheme 3: UBCC storage, sparse MVRT index Experimental Setup
25
Performance Comparison --- Check-In Time and Index Size
26
Performance Comparison --- Partial Version Retrieval
27
We proposed a framework for storing and querying multi-versioned XML documents. We examined techniques that merges traditional versioning schemes and temporal databases for XML version management. Best scheme: –UBCC storage –Sparse MVRT for full index –Dense MVBT for each tag index Emerging issues: –Query language support for version queries. –User interface for browsing versions and presenting query results Conclusions and Future Work
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.