Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
Fast Algorithms For Hierarchical Range Histogram Constructions
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,
Adam Silberstein, Hao He, Ke Yi, Jun Yang Duke University Durham, North Carolina, USA BOXes: Efficient Maintenance of Order-Based Labeling for Dynamic.
Fall 2007CS 2251 Trees Chapter 8. Fall 2007CS 2252 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information.
CS Lecture 9 Storeing and Querying Large Web Graphs.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Node Indexes Interval Labeling Schemes Prefix Labeling Schemes Konsolaki Konstantina (624) University of Crete Department of Computer.
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
Foundations of Computer Science Computing …it is all about Data Representation, Storage, Processing, and Communication of Data 10/4/20151CS 112 – Foundations.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Spring 2010CS 2251 Trees Chapter 6. Spring 2010CS 2252 Chapter Objectives Learn to use a tree to represent a hierarchical organization of information.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
Floating point numerical information. Previously discussed Recall that: A byte is a memory cell consisting of 8 switches and can store a binary number.
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet Svetlana Strunjaš-Yoshikawa Joint with Fred Annexstein and.
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Bahareh Sarrafzadeh 6111 Fall 2009
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
1 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan Room: C3-222, ext: 1204, Lecture 7 (W5)
1 Algorithms CSCI 235, Fall 2015 Lecture 30 More Greedy Algorithms.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.
1 Review of report "LSDX: A New Labeling Scheme for Dynamically Updating XML Data"
Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes Changqing Li, Tok Wang Ling, Min Hu.
1 Updates ADT 2010 ADT 2010 XQuery Updates in MonetDB/XQuery Stefan Manegold
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML Changqing LiTok Wang Ling Department of Computer Science School of.
HUFFMAN CODES.
Assignment 6: Huffman Code Generation
Dynamic Hashing (Chapter 12)
Parallel Density-based Hybrid Clustering
Dynamic Hashing.
Spatial Online Sampling and Aggregation
Chapter 9: Huffman Codes
Pyramid Sketch: a Sketch Framework
Indexing and Hashing Basic Concepts Ordered Indices
Data Structure and Algorithms
A Small and Fast IP Forwarding Table Using Hashing
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Donghui Zhang, Tian Xia Northeastern University
Analysis of Algorithms CS 477/677
Relax and Adapt: Computing Top-k Matches to XPath Queries
CSE 542: Operating Systems
Presentation transcript:

Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu

ICDE'06Efficient Processing of Updates in XML2 Outline Background and related work Our proposals –Lexicographical order –A compact dynamic binary string encoding (CDBS) –Applying CDBS to different labeling schemes for update processing –Experimental evaluation Conclusion

ICDE'06Efficient Processing of Updates in XML3 Background and related work: Labeling schemes Three main categories of labeling schemes to process XML queries –(1) Containment labeling scheme [Zhang et al SIGMOD01 etc.] –(2) Prefix labeling scheme [Tatarinov et al SIGMOD02 etc.] –(3) Prime number labeling scheme [Wu et al ICDE04] In this talk, we focus on the labeling schemes to efficiently process updates

ICDE'06Efficient Processing of Updates in XML4 (1) Containment scheme Each node is assigned with three values, i.e. “ start ”, “ end ”, and “ level ” Based on “ start ”, “ end ”, and “ level ” to determine different relationships 1,18,1 2,3,24,9,210,11,2 12,17,2 5,6,37,8,315,16,313,14,3

ICDE'06Efficient Processing of Updates in XML5 Containment is bad to process updates Need to re-label all the ancestor nodes and all the nodes after the inserted node in document order 1,18,1 2,3,24,9,210,11,2 12,17,2 5,6,37,8,315,16,313,14,3

ICDE'06Efficient Processing of Updates in XML6 Containment is bad to process updates Need to re-label all the ancestor nodes and all the nodes after the inserted node in document order 1,20,1 2,3,24,9,212,13,2 14,19,2 5,6,37,8,317,18,315,16,3 10,11,2

ICDE'06Efficient Processing of Updates in XML7 Existing approaches to process the updates in containment scheme Increase the interval size and leave some values unused for the future insertions [Li et al VLDB01] –When unused values are used up, have to re-label Use float-point value [Amagasa et al ICDE03] –Float-point value represented in a computer with a fixed number of bits –Due to float-point precision, have to re-label They both can not avoid the re-labeling

ICDE'06Efficient Processing of Updates in XML8 (2) Prefix scheme Three main prefix schemes –DeweyID [Tatarinov et al SIGMOD02] –BinaryString [Cohen et al PODS02] –OrdPath [O'Neil et al SIGMOD04]

ICDE'06Efficient Processing of Updates in XML9 DeweyID (Cont.) Determine different relationships based on the prefix property

ICDE'06Efficient Processing of Updates in XML10 DeweyID is bad to process order- sensitive updates Order-sensitive updates: to maintain the document order when updates are performed –Need to re-label all the sibling nodes after the inserted node and all the descendants of these siblings

ICDE'06Efficient Processing of Updates in XML11 DeweyID is bad to process order- sensitive updates Order-sensitive updates: to maintain the document order when updates are performed –Need to re-label all the sibling nodes after the inserted node and all the descendants of these siblings

ICDE'06Efficient Processing of Updates in XML12 Existing approaches to process the updates in prefix scheme: OrdPath OrdPath [O'Neil et al SIGMOD04] –Similar to DeweyID –But at the beginning, use odd numbers only

ICDE'06Efficient Processing of Updates in XML13 Existing approaches to process the updates in prefix scheme: OrdPath OrdPath a bdc Label of node a “ -1 ” Label of node b “ 4.1 ” Label of node c “ 4.3 ” Label of node d “ ” They are siblings, but their labels look very different

ICDE'06Efficient Processing of Updates in XML14 (3) Prime number scheme [Wu et al ICDE04] Prime re-calculate the SC value to maintain the document order instead of re-labeling. But re-calculation is much more expensive.

ICDE'06Efficient Processing of Updates in XML15 Our CDBS encoding (1) Lexicographical order (2) Encoding (3) Applications and processing of updates (4) Experimental results

ICDE'06Efficient Processing of Updates in XML16 (1) Lexicographical order of binary string Given two binary strings “ 0011 ” and “ 01 ”, “ 0011 ” “ 01 ” lexicographically because the comparison is from left to right, and the 2 nd bit of “ 0011 ” is “ 0 ”, while the 2 nd bit of “ 01 ” is “ 1 ”. “ 0011 ” < “ 01 ” Given two binary strings “ 01 ” and “ 0101 ”, “ 01 ” “ 0101 ” lexicographically because “ 01 ” is a prefix of “ 0101 ”. “01” < “0101”

ICDE'06Efficient Processing of Updates in XML17 Find a binary string between two binary strings lexicographically To insert a binary string between “0011” and “01” –the size of “0011” is 4 which is larger than the size 2 of “01”; this is Case (a) (larger than or equal) –therefore we directly concatenate one more “1” after “0011”. –The inserted binary string is “00111”, and “0011” < “ ” < “ 01 ” lexicographically. To insert a binary string between “ 01 ” and “ 0101 ” –the size of “ 01 ” is 2 which is smaller than the size 4 of “ 0101 ” ; this is Case (b) (smaller than) –therefore we change the last bit “ 1 ” of “ 0101 ” to “ 01 ”, i.e. the inserted binary string is “ ” ; “ 01 ” < “ ” < “ 0101 ” lexicographically.

ICDE'06Efficient Processing of Updates in XML18 (2) Compact encoding Achieved the dynamic objective. Further, we need to propose a Compact Dynamic Binary String encoding, called CDBS.

ICDE'06Efficient Processing of Updates in XML19 Example illustration of CDBS We show how to encode 18 numbers based on our CDBS encoding This is only an example, any other numbers can be encoded with our CDBS 1,18,1 2,3,24,9,210,11,2 12,17,2 5,6,37,8,315,16,313,14,3

ICDE'06Efficient Processing of Updates in XML20 Integer numberV-Binary Total size (bits) 64

ICDE'06Efficient Processing of Updates in XML21 Integer numberV-BinaryF-Binary Total size (bits) 6490

ICDE'06Efficient Processing of Updates in XML22 Integer numberV-BinaryV-CDBSF-Binary Total size (bits) 6490

ICDE'06Efficient Processing of Updates in XML23 Integer numberV-BinaryV-CDBSF-Binary Total size (bits) 6490

ICDE'06Efficient Processing of Updates in XML24 Integer numberV-BinaryV-CDBSF-Binary Total size (bits) 6490

ICDE'06Efficient Processing of Updates in XML25 Integer numberV-BinaryV-CDBSF-Binary Total size (bits) 6490

ICDE'06Efficient Processing of Updates in XML26 Integer numberV-BinaryV-CDBSF-Binary Total size (bits) 64 90

ICDE'06Efficient Processing of Updates in XML27 Integer numberV-BinaryV-CDBSF-BinaryF-CDBS Total size (bits) 64 90

ICDE'06Efficient Processing of Updates in XML28 (3) Applying CDBS to the containment scheme Replace the “start” and “end” values 1 to 18 with our CDBS encoding Based on the lexicographical order comparison Level is still the same 00001,1111,1 0001,001,20011,0111,21,10001,2 1001,111,2 01,01001,30101,011,311,1101,3101,1011,3

ICDE'06Efficient Processing of Updates in XML29 Applying CDBS to the prefix scheme The CDBS codes for 4 numbers are “001”, “01”, “1” and “11”. The CDBS codes for 2 numbers are “01” and “1”

ICDE'06Efficient Processing of Updates in XML30 Applying CDBS to the prime scheme Store the document order with our CDBS codes. Based on the lexicographical order to determine the orders of nodes. The size of Prime and the query performance of Prime are bad, so we do not show the details.

ICDE'06Efficient Processing of Updates in XML31 Processing updates based on CDBS: for containment scheme To insert two binary strings between “0011” and “01”, the inserted two binary strings will be “00111” and “001111”. The complete label of the inserted node is “00111,001111,3” No need to re-label the existing nodes, but different relationships, e.g. ancestor-descendant etc., can be determined, and the orders can be kept ,1111,1 0001,001,20011,0111,21,10001,2 1001,111,2 01,01001,30101,011,311,1101,3101,1011,3

ICDE'06Efficient Processing of Updates in XML32 Processing updates based on CDBS: for prefix scheme To insert a binary string before “01”, the inserted binary string will be “001” The complete label of the inserted node is “01.001” No need to re-label the existing nodes, but different relationships, e.g. ancestor-descendant etc., can be determined, and the orders can be kept

ICDE'06Efficient Processing of Updates in XML33 Problem about CDBS The size of V-CDBS and F-CDBS may encounter the overflow problem when many nodes are inserted. To solve the overflow problem, we propose QED in [ Li & Ling CIKM05 ] QED uses four quaternary symbols, i.e. 0, 1, 2, and 3, and each is stored with 2 bits –0 is used as the separator or delimiter, and it will never encounter the overflow problem –QED is not as compact as CDBS, update cost is higher

ICDE'06Efficient Processing of Updates in XML34 (4) Experimental results Experimental setup Performance study on static XML Performance study on updates

ICDE'06Efficient Processing of Updates in XML35 Experimental setup All the schemes are implemented in Java and all the experiments are carried out on a 3.0 GHz Pentium 4 processor with 1 GB RAM running Windows XP Professional.

ICDE'06Efficient Processing of Updates in XML36 DatasetsTopics # of files Max/ave rage fan- out for a file Max/ave rage depth for a file Total # of nodes for each dataset D1Movie 49014/65/ D2Department 19233/814/ D3Actor 48037/115/ D4Company 24529/1355/ D5Shakespeare’s play 37434/486/ D6NASA /97/ The following table shows the datasets we used. Experimental setup (cont.)

ICDE'06Efficient Processing of Updates in XML37 Performance study on static XML Our V-CDBS and F-CDBS are the most compact variable and fixed length dynamic encoding Label sizes of different schemes

ICDE'06Efficient Processing of Updates in XML38 The 5 cases of node updates in experiments We select one XML file Hamlet in dataset D1 to test the update performance (it is similar for other XML files). Hamlet has 5 act elements. We test the following 5 cases –inserting an act element before act[1], –inserting an act element before act[2], –···, –and inserting an act element before act[5].

ICDE'06Efficient Processing of Updates in XML39 Labeling schemes Number of nodes to re-label (5 cases) Float-point-Containment00000 V-Binary-Containment F-Binary-Containment V-CDBS-Containment00000 F-CDBS-Containment00000 BinaryString-Prefix DeweyID(UTF8)-Prefix OrdPath1-Prefix00000 OrdPath2-Prefix00000 QED-Prefix00000 Prime Number of nodes to re-label in updates

ICDE'06Efficient Processing of Updates in XML40 Total time for node updates Several nodes inserted, main time is the I/O time, our approaches are the best to process updates. When considering processing time only, our approaches are much better, more than 300 times faster. More appropriate for updates with many nodes. Log2(Update time) of different schemes

ICDE'06Efficient Processing of Updates in XML41 Conclusion Our CDBS is dynamic Our CDBS is the most compact Update cost is the cheapest, only need to modify the last 1 bit of the neighbor label