QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
VBI-Tree: A Peer-to-Peer Framework for Supporting Multi-Dimensional Indexing Schemes Presenter: Quang Hieu Vu H.V.Jagadish, Beng Chin Ooi, Quang Hieu Vu,
A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Algorithm Design Techniques: Greedy Algorithms. Introduction Algorithm Design Techniques –Design of algorithms –Algorithms commonly used to solve problems.
AVL Trees1 Part-F2 AVL Trees v z. AVL Trees2 AVL Tree Definition (§ 9.2) AVL trees are balanced. An AVL Tree is a binary search tree such that.
Chapter 4: Trees Part II - AVL Tree
Greedy Algorithms Greed is good. (Some of the time)
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
QUANZHONG LI BONGKI MOON Indexing & Querying XML Data for../Regular Path Expressions/* SUNDAR SUPRIYA.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Fall 2007CS 2251 Trees Chapter 8. Fall 2007CS 2252 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Accessing Spatial Data
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Node Indexes Interval Labeling Schemes Prefix Labeling Schemes Konsolaki Konstantina (624) University of Crete Department of Computer.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Data Structures Week 6: Assignment #2 Problem
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Spring 2010CS 2251 Trees Chapter 6. Spring 2010CS 2252 Chapter Objectives Learn to use a tree to represent a hierarchical organization of information.
10/20/2015 2:03 PMRed-Black Trees v z. 10/20/2015 2:03 PMRed-Black Trees2 Outline and Reading From (2,4) trees to red-black trees (§9.5) Red-black.
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
CPSC 221: Algorithms and Data Structures Lecture #7 Sweet, Sweet Tree Hives (B+-Trees, that is) Steve Wolfman 2010W2.
Priority Queues, Trees, and Huffman Encoding CS 244 This presentation requires Audio Enabled Brent M. Dingle, Ph.D. Game Design and Development Program.
XML Access Control Koukis Dimitris Padeleris Pashalis.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
1 Binary Search Trees  Average case and worst case Big O for –insertion –deletion –access  Balance is important. Unbalanced trees give worse than log.
Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes Changqing Li, Tok Wang Ling, Min Hu.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML Changqing LiTok Wang Ling Department of Computer Science School of.
Mohammed I DAABO COURSE CODE: CSC 355 COURSE TITLE: Data Structures.
Data Structures and Design in Java © Rick Mercer
Red-Black Trees v z Red-Black Trees Red-Black Trees
Red-Black Trees 5/17/2018 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
HUFFMAN CODES.
B/B+ Trees 4.7.
Red-Black Trees 5/22/2018 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Updating SF-Tree Speaker: Ho Wai Shing.
Red-Black Trees v z Red-Black Trees 1 Red-Black Trees
B+ Tree.
Red-Black Trees Motivations
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Chapter 9: Huffman Codes
Red-Black Trees v z /20/2018 7:59 AM Red-Black Trees
Red-Black Trees v z Red-Black Trees Red-Black Trees
Huffman Coding.
Indexing and Hashing Basic Concepts Ordered Indices
Red-Black Trees v z /17/2019 4:20 PM Red-Black Trees
Data Structure and Algorithms
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Donghui Zhang, Tian Xia Northeastern University
Red-Black Trees v z /6/ :10 PM Red-Black Trees
Presentation transcript:

QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling

2 Outline Background and related work Our QED encoding Completely avoid re-labeling in XML updates based on our QED Experiments Conclusion

3 Background Three main categories of labeling schemes to process XML queries –Containment labeling scheme [Zhang et al SIGMOD01 etc.] –Prefix labeling scheme [Tatarinov et al SIGMOD02 etc.] –Prime number labeling scheme [Wu et al ICDE04]

4 (1) Containment Scheme “ start ”, “ end ”, and “ level ” Determine ancestor-descendant and parent- child relationships based on the containment property 1,16,1 2,3,24,9,210,13,2 14,15,2 5,6,37,8,311,12,3 “ 5,6,3 ” is a descendant of “ 1,16,1 ” because interval [5,6] is contained in interval [1,16] “ 5,6,3 ” is a child of “ 4,9,2 ” because interval [5,6] is contained in interval [4,9], and levels 3-2=1

5 (1)Containment Scheme, Containment is bad to process updates Need to re-label all the ancestor nodes and all the nodes after the inserted node in document order 1,16,1 4,9,22,3,210,13,2 14,15,2 5,6,37,8,311,12,3

6 (1)Containment Scheme, Containment is bad to process updates Need to re-label all the ancestor nodes and all the nodes after the inserted node in document order 2,3,2 1,18,1 4,9,210,11,212,15,2 16,17,2 5,6,37,8,313,14,3 All the red color numbers need to be changed, very expensive

7 (1)Containment Scheme, Approaches to solve the update problem Increase the interval size and leave some values unused [Li et al VLDB01] –When unused values are used up, have to re-bel Use float-point value [Amagasa et al ICDE03] –Float-point value represented in a computer with a fixed number of bits –Due to float-point precision, have to re-label They both can not completely avoid re-labeling

8 (2) Prefix Scheme Determine ancestor-descendant and parent- child relationships based on the prefix property “ 2.1 ” is a descendant of the root, because the label of the root is empty which is a prefix of “ 2.1 ” “ 2.1 ” is a child of “ 2 ” because “ 2 ” is an immediate prefix of “ 2.1 ”, i.e. when removing “ 2 ” from the left side of “ 2.1 ”, “ 2.1 ” has no other prefixes.

9 (2) Prefix Scheme, Prefix is bad to process order-sensitive updates To maintain the document order when updates are performed ---- order-sensitive updates Need to re-label all the sibling nodes after the inserted node and all the descendants of these siblings

10 (2) Prefix Scheme, Prefix is bad to process order-sensitive updates To maintain the document order when updates are performed ---- order-sensitive updates Need to re-label all the sibling nodes after the inserted node and all the descendants of these siblings All the red color numbers need to be changed, very expensive

11 (2) Prefix Scheme, Approaches to solve the update problem OrdPath [O'Neil et al SIGMOD04] –At the beginning, use odd numbers only

bd a (2) Prefix Scheme, Approaches to solve the update problem OrdPath [O'Neil et al SIGMOD04] –In insertion, use even number together with odd numbers Label of node a “ -1 ” Label of node b “ 6.1 ” Label of node c “ 6.3 ” Label of node d “ ” c All are at the same level, bad

13 (2) Prefix Scheme, Problems of OrdPath Nodes a, b, and c are at the same level, but their labels “- 1”, “6.1”, and “6.3” do not look like this; need more time to determine this; will decrease the query performance Waste half numbers (even numbers); will make label size increase Need to calculate the even number between two odd numbers; update cost not cheap Use a fixed length size to indicate the size of a label, the fixed length size field will eventually encounter the overflow problem when a lot of nodes are inserted, so OrdPath can not completely avoid re-labeling

14 (3) Prime scheme Based on a top-down approach, each node is given a unique prime number (self_label) and the label of each node is the product of its parent node’s label (parent_label) and its own self_label. Query –Use the modular and division operations to determine the ancestor-descendant and ordering relationships, which are very expensive Update –When nodes are inserted into the XML tree, needs to re-calculate the SC values, which is much more expensive than re-labeling Details can be found in [Wu et al ICDE04]

15 Our QED encoding Dynamic Quaternary Encoding (QED) Four quaternary numbers “0”, “1”, “2” and “3” are used in the code and each number is stored with two bits, i.e. “00”, “01”, “10” and “11”. The quaternary number “0” is used as the separator, and only “1”, “2”, and “3” are used in the QED encoding. –Compare QED codes based on the lexicographical order

16 Example about QED We show how to encode 16 numbers; we choose 16 because the total “start” and “end” values in the containment scheme is 16; this is only an example Any other number is ok to be encoded by our QED Every time encode the (1/3)th and (2/3)th numbers between two numbers –“0” is the separator, and only “1”, “2”, and “3” appear in the QED codes, so (1/3)th and (2/3)th 1,16,1 2,3,24,9,210,13,2 14,15,2 5,6,37,8,311,12,3

17 Example about QED Decimal numberFixedLengthVarLengthQEDPosition (1/3)th position = 6 = round(0+(17-0)/3) (2/3)th position = 11 = round(0+(17-0)*2/3)

18 Example about QED Decimal numberFixedLengthVarLengthQEDPosition (1/3)th position = 2 = round(0+(6-0)/3) (2/3)th position = 4 = round(0+(6-0)*2/3) (1/3)th position = 6 = round(0+(17-0)/3) (1/3)th position = 8 = round(6+(11-6)/3) (2/3)th position = 9 = round(6+(11-6)*2/3) (2/3)th position = 11 = round(0+(17-0)*2/3) (1/3)th position = 13 = round(11+(17-11)/3) (2/3)th position = 15 = round(0+(17-11)*2/3)

19 Example about QED Decimal numberFixedLengthVarLengthQEDPosition (1/3)th position = 1 = round(0+(2-0)/3) (1/3)th position = 2 = round(0+(6-0)/3) (1/3)th position = 3 = round(2+(4-2)/3) (2/3)th position = 4 = round(0+(6-0)*2/3) (1/3)th position = 5 = round(4+(6-4)/3) (1/3)th position = 6 = round(0+(17-0)/3) (1/3)th position = 7 = round(6+(8-6)/3) (1/3)th position = 8 = round(6+(11-6)/3) (2/3)th position = 9 = round(6+(11-6)*2/3) (1/3)th position = 10 = round(9+(11-9)/3) (2/3)th position = 11 = round(0+(17-0)*2/3) (1/3)th position = 12 = round(11+(13-11)/3) (1/3)th position = 13 = round(11+(17-11)/3) (1/3)th position = 14 = round(13+(15-13)/3) (2/3)th position = 15 = round(0+(17-11)*2/3) (1/3)th position = 16 = round(15+(17-15)/3) 0 17

20 Overflow problem of other methods In the previous page, we can see that the FixedLenth codes are stored with length 5, i.e. the length of each code is 5 bits When a lot of codes are inserted, the length 5 is not large enough, all the FixedLength codes need to be changed. For the VarLength codes, we also need to store the length of each VarLength code, e.g., the length of “10000” is 5. We need to store this 5 using fixed length of bits (“101”; 3 bits). The sizes of other codes should also be stored using fixed length of bits (3 bits). When a lot of codes are inserted, this size of the size field 3 is not large enough, then all the codes must be changed This is called the overflow problem.

21 Our QED use “0” to separate different codes will never encounter the overflow problem For the QED codes “112”, “12”, and “122” etc. in the table, they are separated with “0” Stored as “ ”, based on the separator “0”, we can separate different codes “0” will never encounter the overflow problem Our QED encoding can help to completely avoid the re-labeling

22 Lexicographical order for our QED Our QED compares codes based on the lexicographical order The QED codes in the table are lexicographically ordered from top to bottom. –E.g., “132” < “2” lexicographically because the comparison is from left to right, and the 1st symbol of “132” is “1”, while the 1st symbol of “2” is “2”. –Another example, “23” < “232” lexicographically because “23” is a prefix of “232”.

23 (a) Applying QED encoding to the containment scheme Replace the “start” and “end” values “1” to “16” with our QED codes A QED encoding based on containment scheme is formed Compare labels based on lexicographical order 112,332 12,12213,23232,32 322,33 132,2212,223,312 Note that we drop the level values from the right graph just for a clear presentation

24 (b) Applying QED encoding to the prefix scheme The root has 4 children. To encode 4 numbers based on our QED, the codes will be “12”, “2”, “3” and “32”. Similarly if there are 2 siblings, their self_labels (last component, e.g., “3” in “2.3” is the self_label) are “2” and “3”. If there is only 1 sibling, its self_label is “2”

25 (b) Processing the delimiters of the prefix scheme based on our QED For the prefix scheme, the delimiter “.” can not be stored together with the numbers in the implementation to separate different components. For our QED encoding, we use the following approach to process the delimiters. –We use one “0” as the delimiter to separate different components of a prefix label e.g. separate “12” and “3” in “12.3”; the delimiter “0” is equivalent to the “.”; “12.3” is stored as “1203” in the implementation; –use two consecutive separators “00” as the separator to separate different labels e.g. “ ” represents 2 labels, i.e. “1202” and “1203”.

26 Algorithm for insertion based on QED Algorithm: GetInsertedCode Input: Left_Code, Right_Code Output: Inserted_Code, such that Left_Code < Inserted_Code < Right_Code lexicographically. 1: get the sizes of Left_Code and Right_Code 2: if size(Left_Code) < size(Right_Code) //Case (1) 3: then Inserted_Code = (the Right_Code with the last 4: symbol changed to “1”) concatenate “2” 5: else if size(Left_Code) > size(Right_Code) 6: if the last symbol of Left_Code is “2” //Case (2) 7: then Inserted_Code = the Left_Code with the 8: last symbol changed from “2” to “3” 9: else if the last symbol of Left_Code is “3” //Case (3) 10: then Inserted_Code = Left_Code concatenate “2” 11: else if size(Left_Code) = size(Right_Code) //Case (4) 12: then Inserted_Code = Left_Code concatenate “2”

27 XML updates based on our QED– containment When we insert a node as shown in the below figure We should insert two QED codes between “23” and “232” –First create the “start” value i.e. a code between “23” and “232”, the new code is “2312”; see Case (1) of the GetInsertedCode algorithm; –Then create the “ end” value i.e. a code between “2312” and “232”, the new code is “2313”; see Case (2) of the GetInsertedCode algorithm; “23” < “2312” < “2313” < “232” lexicographically, we need not re-label any existing nodes. 112,332 13,2312,122232,32 322,33 132,2212,223,312

28 XML updates based on our QED – based on prefix scheme When we insert a node as shown in the below figure We should insert one QED code between “2” and “3” –The new QED code between “2” and “3” is “22”; –see Case (4) of the GetInsertedCode algorithm; “2” < “22” < “3” lexicographically, we need not re-label any existing nodes, but we can keep the order

29 Experimental results – Experimental setup We mainly report the results in updates We select the Hamlet file in Shakespeare’s play dataset Intermittent updates –Hamlet file has 5 act elements, 6 insertion cases, i.e. before act[1], between act[1] and act[2], …, between act[4] and act[5], and after act[5]. Uniformly frequent updates –Insertions happens randomly at different places of the Hamlet file Skewed frequent updates –Insertions always happen at a fixed place of the Hamlet file

30 Experimental results – intermittent updates Prime needs to re-calculate less SC values, but its re- calculation time is very large Theorem. Our QED never needs to re-label any existing nodes The update time of our QED is much smaller The update performance differences among OrdPath, Float-point, and our QED can be seen in the next page Note that QED represents both the QED encoding and the QED-containment scheme, QED-PREFIX represents the scheme when we apply QED encoding to the prefix scheme. (a) Number of nodes to re-label (b) Time to re-label

31 Experimental results – uniformly frequent updates When uniformly frequent updates are performed, –The update time of OrdPath and Float-Point is much larger (more than 386 times) than the time required by our QED approaches Our QED encoding only needs to modify the last 2 bits of the neighbor label, which is very cheap Both OrdPath and Float-point can not completely avoid re- labeling (a) OrdPath1&2 vs QED-PREFIX (b) Float-point vs QED

32 Experimental results – skewed frequent updates When skewed frequent updates are performed, –The update time of OrdPath and Float-Point is much larger (more than 8126 times) than the time required by our QED approaches The very large update time makes OrdPath and Float-point unsuitable to answer queries in the frequent insertion environment. Our QED still works the best to answer queries in the environment that frequent insertions are executed (a) OrdPath1&2 vs QED-PREFIX (b) Float-point vs QED

33 Conclusion We propose the QED encoding QED can be applied broadly to different labeling schemes QED can completely avoid re-labeling in XML updates