1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li
2 Our Objective Developing a system that will enable us to perform XML data queries efficiently.
3 XML Queries Languages Used for retrieving data from XML files. Use a regular path expression syntax. e.g. XPath, XQuery.
4 Queries Today - Inefficient Usually XML tree traversals – Inefficient. –Top-Down Approach –Bottom-Up Approach –An example: the query: /chapter/_*/figure (finding all figures in all chapters.)
5 Our Objective - Refined Developing a system that will enable us to perform XML data queries efficiently Developing such a system consists of: –Developing a way to efficiently store XML data. –Developing efficient algorithms for processing regular path expressions (e.g. XQuery expressions).
6 Storing XML Documents - XISS XISS - XML Indexing and Storage System. Provides us with ways to: –efficiently find all elements or attributes with the same name string grouped by document which they belong to. –quickly determine the ancestor-descendant relationship between elements and/or attributes in the hierarchy of XML data hierarchy.
7 Determining Ancestor-Descendent Relationship According to Dietz’s: for two given nodes x and y of a tree T, x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal. Example:
8 Determining Ancestor-Descendent Relationship – cont. Advantage: the ancestor-descendent relationship can be determined in constant time. Disadvantage: a lack of flexibility. –e.g. inserting a new node requires recomputation of many tree nodes.
9 A new numbering scheme: –Each node is associated with a pair: For a tree node y and its parent x: [order(y), order(y) + size(y)] (order(x), order(x) + size(x)] For two sibling nodes x and y, if x is the predecessor of y in preorder traversal holds: order(x) + size(x) < order(y). Determining Ancestor-Descendent Relationship – cont. exclusive
10 Determining Ancestor-Descendent Relationship – cont. Fact: for two given nodes x and y of a tree T, x is an ancestor of y iff: order(x) < order(y) order(x) + size(x)
11 Determining Ancestor-Descendent Relationship – cont. Properties: –the ancestor-descendent relationship can be determined in constant time. –flexibility – node insertion usually doesn’t require recomputation of tree nodes. –an element can be uniquely identified in a document by its order value.
12 XISS System Overview
13 Name Index and Value Table Objective: minimizing the storage and computation overhead by eliminating replicated strings and string comparisons. Name Index - mapping distinct name strings into unique name identifiers (nid). Value Table - mapping distinct value strings (i.e. attribute value and text value) into unique value identifiers (vid). Both implemented as a B + -tree.
14 The Element Index Objective: quickly finding all elements with the same name string. Structure:
15 The Attribute Index Objective: quickly finding all elements with the same name string. Structure: –Same structure as the Element Index except that the record in attribute index has a value identifier vid which is a key used to obtain the attribute from the value table.
16 The Structure Index Objectives: –Finding the parent element and child elements (or attributes) for a given element. –Finding the parent element for a given attribute. Structure:
17 The Structure Index – cont. Structure: –B + -tree using document identifier (did) as a key. –Leaf nodes: linear arrays with records for all elements and attributes from an XML document. –Each record: {nid,, Parent order, Child order, Sibling order, Attribute order}. –Records are ordered by order value.
18 Querying Method Decomposing path expressions into simple path expressions. Applying algorithms on simple path expressions and their intermediate results.
19 Decomposition of Path Expressions The main idea: –A complex path expression is decomposed into several simple path expressions. –Each simple path expression produces an intermediate result that can be used in the subsequent stage of processing. –The results of the simple path expressions are than combined or joined together to obtain the final result of the given query.
20 Basic Subexpressions - Example Decomposition of (E 1 /E 2 ) * / E 3 / ((E 4 | (E 5 /_ * /E 6 )): (1) Single Element/Attribute (2) Element-Attribute (3) Element-Element (4) Kleene Closure (5) Union / /_ * / *| [ ]/ / (4) (2) (3) (5) (3) (1)
21 Example: EA-Join: Element and Attribute Join
22 EA-Join: Element and Attribute Join Input: {E 1,…,E m }: E i is a set of elements having a common document identifier ( did ); {A 1,…,A n }: A j is a set of elements having a common document identifier ( did ); Output: A set of (e,a) pairs such that the element e is the parent of the attribute a.
23 EA-Join: Element and Attribute Join The Algorithm: // Sort-merge {E i } and {A j } by did. (1)foreach E i and A j with the same did do: // Sort-merge E i and A j by // PARENT-CHILD relationship (2)foreach e E i and a A j do (3)if (e is a parent of a) then output (e,a) end
24 EA-Join – Example Consider the XML document: And the query: Ele Att
25 Sort-merging “Ele”s and “Att”s by parent-child relation ship will give us the list:,,, Finding the elements “Ele”s with a child attribute “Att” with a value “A1” from the accepted list is easy using the information in the Element Record. EA-Join – Querying Ele Att
26 EA-Join – Comments Only a two-stage sort-merge operation without additional cost of sorting: –First merge: by did. –Second merge: by examining parent-child relationship. This merge is based on the order values of the element and attribute as defined by the numbering scheme. Attributes should be placed before their sibling elements in the order of the numbering scheme. –guarantees that elements and attributes with the same did can be merged in a single scan.
27 Conclusions XISS can efficiently process regular path expression queries. Performance improvement over the conventional methods by up to an order of magnitude. Future work: optimal page size or the break-even point between the two criteria.
28 Thank you so much!