Storing and Querying Multi-version XML Documents using Durable Node Numbers Shu-Yao Chien Dept. of CS UCLA Vassilis J. Tsotras Dept. of.

Slides:



Advertisements
Similar presentations
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
Advertisements

Data Structures: A Pseudocode Approach with C
TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Multiversion Access Methods - Temporal Indexing. Basics A data structure is called : Ephemeral: updates create a new version and the old version cannot.
DictionaryADT and Trees. Overview What is the DictionaryADT? What are trees? Implementing DictionaryADT with binary trees Balanced trees DictionaryADT.
Temporal Indexing Snapshot Index. Transaction Time Environment Assume that when an event occurs in the real world it is inserted in the DB A timestamp.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Spatio-Temporal Databases
Chapter 4: Trees Radix Search Trees Lydia Sinapova, Simpson College Mark Allen Weiss: Data Structures and Algorithm Analysis in Java.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Temporal Databases. Outline Spatial Databases Indexing, Query processing Temporal Databases Spatio-temporal ….
Time Chapter 10 © Worboys and Duckham (2004)
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Chapter 3: Data Storage and Access Methods
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..
Change-Centric Management of Versions in an XML Warehouse Amélie Marian Columbia University Serge Abiteboul, Grégory Cobéna, Laurent Mignet INRIA-Rocquencourt.
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
The Design Of A Web Document Snapshots Delivery System David Chao College of Business San Francisco State University.
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 22 Lists, Stacks, Queues, and Priority.
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
Introduction to Databases A line manager asks, “If data unorganized is like matter unorganized and God created the heavens and earth in six days, how come.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Querying Structured Text in an XML Database By Xuemei Luo.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
University of Crete Department of Computer Science ΗΥ-561 Web Data Management XML Data Archiving Konstantinos Kouratoras.
Efficient Complex Query Support For Multi-version XML Documents Shu-Yao Chien Dept. of CS UCLA Vassilis J. Tsotras Dept. of CS&E UC Riverside.
XML and Database.
ICDE 2002, San Jose, CA Efficient Temporal Join Processing using Indices Donghui Zhang University of California, Riverside Vassilis J. Tsotras University.
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
(A comparative study for XML change detection) Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST) Etude comparative sur la détection.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
CS 540 Database Management Systems
Temporal Data Modeling
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Spatio-Temporal Databases. Term Project Groups of 2 students You can take a look on some project ideas from here:
1 Indexing Lecture HW#3 & Project See course page for new instructions: submit source code and output of program on the given pairs of actors Can.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
Mehdi Kargar Department of Computer Science and Engineering
Spatio-Temporal Databases
Subject Name: File Structures
CS 540 Database Management Systems
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Temporal Indexing MVBT.
Temporal Indexing MVBT.
B+ Tree.
OrientX: an Integrated, Schema-Based Native XML Database System
Spatio-Temporal Databases
Temporal Queries in XML Document Archives and Web Warehouses
Lecture 19: Data Storage and Indexes
Temporal Databases.
Indexing Lecture 15.
Indexing 4/11/2019.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Storing and Querying Multi-version XML Documents using Durable Node Numbers Shu-Yao Chien Dept. of CS UCLA Vassilis J. Tsotras Dept. of CS&E UC Riverside Carlo Zaniolo Dept. of CS UCLA Donghui Zhang Dept. of CS&E UC Riverside

 Traditional applications migrating to the web: –Software configuration management –Cooperative work –CAD  An array of web-based applications: –Web content providers and trackers –Link Permanence –WebDAV Document Version Management

 An assortment of new and old applications seek from XML a shared technology and toolset to support their assorted requirements  Main requirements and research challenges: –Efficient version retrieval –Storage efficiency –Complex query support Problem Definition

 Naive approach stores each version in its entirety: minimizes retrieval but very inefficient storage.  RCS (Revision Control System) : –stores the latest version in its entirety, and –old versions represented by deltas ---reverse edit script –minimizes storage cost –version retrieval cost grows linearly with version number  SCCS (Source Code Control System) : –objects timestamped and stored by their document order –version retrieval cost as high as whole change history  These schemes are used by most current systems--- but need improvements in storage management, retrieval, query, and support for complex objects. Traditional Versioning Schemes

 DBs for CAD and for semi-structured information paid much attention to version support  Temporal DBs: efficient support for transaction time by various indexing schemes, Snapshot Index, Multi- Version B+ -Trees, etc.  But typical DBs do not support object ordering (since reconstruction of complete document is not a critical query)  Numbering schemes are proposed to represent document structure and enhance efficiency in evaluating regular path expressions. Databases --- Temporal, OO, Semi-structured, XML DB, …

 UBCC [WebDB200] enhances RCS with page management  Flexibility of trading off storage and retrieval costs  Using the concept of Page Usefulness  Captures the information on the order of the object document in the (forward) edit script Storage Level Enhancement

DELT2DEL T3 ABCD75%ABCD25% ABCD T1 100% VersnPage Usefulness  We set a minimum usefulness requirement U min, e.g. 70% (0 < U min <= 1).  A page is useful/useless when its usefulness is above/ below U min. Useful Useless Page Usefulness – by Example

RootCh AFig DSec ECh BSec FFig GFig H VERSION 2 INS(Sec J) DEL INS(Fig M) DEL INS(Ch K), INS(Sec L) STEP 1 : Determine page usefulness for copying., U(P1)=75% VERSION 1, U(P2) = 50% < U min =70% STEP 2 : Append new/copied objects into new pages by their logical order. P3 Sec J COPY Ch BSec FFig M P4 Ch KSec L P1 P2, U(P3)=100%, U(P4)=100% Usefulness Based Copy Control (UBCC)

New Support are Needed …  Complex Query Support:  Temporal Selection  Structural Projection  Content-Based Selection  Regular Path Expression  Query on Diff  UBCC is not efficient in supporting version queries.  A new scheme is needed …

The SPaR Versioning Scheme  SPaR numbering scheme  Version model  Complex query support  Usefulness-based storage strategy

SPaR Numbering Scheme  XML document structure are represented by:  a Durable Node Number (DNN), and  a Range  DNN is a sparse numbering scheme that preserves element order.  Range preserves parent-child relationships.  Documents can be decomposed and stored as separate elements, then reconstructed (maybe partially) when needed.  Indexes can be built upon DNN and Range for efficient XML query evaluation.

SPaR Numbering Scheme --- by Example  DNN is a sparse numbering scheme that preserves element order as pre-order traversal (the same as document order).  Range preserves parent-child containment relationship such that: dnn(P) < dnn(C) < dnn(C)+range(C) < dnn(P)+range(P) Root dnn=1 Ch A dnn=5 Ch B dnn=51 Fig G dnn=61 dnn=11 Sec E dnn=21 Sec F dnn=55dnn=71 Fig HFig D range=100 range=25range=30 range=2 range=5range=10range=2

Durability upon Updates  Unused ranges are saved between consecutive elements for future insertions.  When a new element Y is inserted between two consecutive elements X and Z, an unused SPaR range is assigned to Y according to the structural relationship between X, Y, and Z.  Range overflow is handled by floating point numbers with variable length.

SPaR Version Model  Elements are stored by their DNN order along with:  Lifespan --- (T start, T end )  SPaR range  Adding a new version, V N :  Delete(E) – Set E.T end to V N-1 and free its SPaR range.  Insert(E) – Set the lifespan of E to (N,now) and assign it an unused SPaR range.  Update(E,new-value) – Delete(E) + Insert(new_value) using the same SPaR range.  New elements of V N are appended into data pages by their DNN order.  However, elements of V N may be scattered among low usefulness data pages …

Version Reconstruction  To reconstruct version V N :  Step Locate useful data pages using the Sparse Page Index.  Step Ordering elements according to their DNN number.  Step Reconstruct the ordered-tree structure of the document.

Step Locate Useful Pages  Sparse Page Index Version # P1 P2 P3 P4 P5 P6 P7 P8 P1(1,now) P2(1,6) P3(2,5) P4(3,now) P5(4,10) P6(7,8)P7(8,9) P8(9,now)List Llpr

Step 2 and 3  Ordering elements by their DNN numbers ---  Valid elements inserted at the same version are already sorted by their DNN number, for instance :  Merge-sort these sorted lists.  Reconstructing ordered-tree structure ---  Parent-child is determined by SPaR ranges.  Sibling order is implied by the DNN order.  Maintain a backward ancestor stack for back-tracking. ChSecFigSecFig SecFigSec Fig ChFigSec Fig V3 V7 V13 … … … … Fig …… …

Regular Path Expression  Regular Path Query --- “For version 10, retrieve all figures contained by a chapter.” doc[version=10]/Ch/*/Fig  Basic Ideas:  Traditional algorithms trace tree structure to match path pattern.  SPaR range makes it possible to evaluate path query simply using relational join operator.  We use SPaR range of Ch elements to reduce the search space for Fig elements.  Multi-version B+ Tree is built to help search based upon DNN numbers.

Dense Element Index  Multi-version B+ Tree (MVBT) keeps history for B+ Tree.  We use MVBT to build dense element indexes. Ch_MVBT … Fig_MVBT … SPaR : (200,300) Life : (1,now) Loc : Page P1 SPaR : (500,700) Life : (3,now) Loc : Page P1 … SPaR : (400,410) Life : (1,now) Loc : Page P5 SPaR : (480,490) Life : (1,15) Loc : Page P1 … SPaR : (250,260) Life : (2,10) Loc : Page P3 SPaR : (550,560) Life : (2,9) Loc : Page P1

 Pages stored : size(RCS)/(1-U min )  Retrieval of single version : size(Version)/U min pages  UBCC uses a separate edit script pointing to the data –to retrieve only useful pages –in the right order!  SPaR scheme only needs SPaR ranges to reconstruct versions.  SPaR is slightly better than UBCC in storage cost and version reconstruction. Performance

Performance and Storage Cost (10% inserted, 10% deleted)

 The web changes everything—XML unifies everything.  It’s time for a new technology that merges and overcomes the limitations of traditional versioning schemes and temporal databases.  Usefulness-based clustering is effective and versatile: we applied it to edit script based schemes (UBCC) and spar scheme.  Spar numbering scheme makes it possible to build document structural index and efficiently evaluate complex version queries.  Emerging issues: –Query language support for version queries. –User interface for browsing versions and presenting query results Conclusion and Future Work