5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University
5/2/20052 Outline Introduction to XML Storage Query Languages Indexing Query Processing Conclusions
5/2/20053 From Documents to Data HTML describes presentation References S. Abiteboul, P. Buneman, D. Suciu, Data On The Web, 2000.
5/2/20054 From Documents to Data (cont.) XML (eXtensible Markup Language) describes content S. Abiteboul P. Buneman D. Suciu Data On The Web 2000
5/2/20055 XML Syntax Element a piece of text bounded by matching tags: D. Suciu elements can be nested Attribute (name, value) pair: … alternative ways to represent data XML document has a single root element Well-formed XML documents tags must nest properly attributes must be unique
5/2/20056 XML Hierarchical Data Model XML is ordered references book author titleyear 2000Data on the Web S. Abiteboul P. Buneman D. Suciu author … …
5/2/20057 Specifying the Structure DTD (Document Type Definition): A context- free grammar <!DOCTYPE references [ ]>
5/2/20058 Specifying the Structure (cont.) XML Schema in XML format element names and types associated locally includes primitive data types a superset of DTDs Valid XML documents the document must be well-formed the element names must follow the structure specified in a DTD file or XML schema file
5/2/20059 Storing XML Documents Designing a specialized system for storing native XML data Using a DBMS to store the whole XML documents as text fields Using a DBMS to store the document contents as data elements It must support the XML’s ordered data model
5/2/ Using a DBMS: Relational DTD Schema-aware An element that can occur at most once in its parent is stored as a column of the table representing its parent ParentIDIDTEXT ParentIDIDtitleyear35“S. Abiteboul” 23“Data on The Web”“2000”36“P. Buneman” 24……37“D. Suciu” The book tableThe author table references book author titleyear 2000Data on the Web S. Abiteboul P. Buneman D. Suciu author … …
5/2/ Using a DBMS: Edge Schema-less A single table is used to store the entire document Each node is assigned an ID in depth first order references book author titleyear 2000 Data on the Web S. Abiteboul P. Buneman D. Suciu author … … root
5/2/ Using a DBMS: Edge (cont.) SourceIDtagordinalTargetIDData 1reference12NULL 2book13NULL 2book24NULL 3author10“S. Abiteboul” 3author20“P. Buneman” 3author30“D. Suciu” 3title40“Data on The Web” 3year50“2000” The edge table
5/2/ XPath XPath is a language for addressing parts of an XML document. XPath uses path expressions to select nodes or node- sets that satisfy certain patterns specified in the expression. The names in the XPath expression are element or attribute names in the XML document. A single slash (/) before an element specifies that the element must appear as a direct child of the previous (parent) element. A double slash (//) specifies that the element can appear as a descendant of the previous element at any level.
5/2/ XPath Examples references selects all the child nodes of the references element /references selects the root element references //book selects all book elements no matter where they are in the document references//book selects all book elements that are descendant of the references element /references/* selects all the child nodes of the references element //book/title | //book/author selects all the title AND author elements of all book elements
5/2/ XQuery XQuery is a language for finding and extracting elements and attributes from XML documents XQuery uses XPath expressions, but has additional constructs. FLWR stands for the four main clauses of XQuery: FOR LET WHERE RETURN For example: for $b in doc(“references.xml")//book where count ($b/author) > 0 return { $b/title } { for $a in $b/author return $a }
5/2/ Indexing In order to find all occurrences of a query pattern, efficient mechanisms are needed for Determining the ancestor-descendant relationship between XML elements Accessing XML values Two types of indexes that can help determine the ancestor-descendant relationships: Structural index: It can reduce the time for traversing the XML hierarchy. Numbering scheme: It encodes each element by its positional information within the XML hierarchy. Using such a numbering scheme, the ancestor-descendant relationship between a pair of elements can be determined quickly.
5/2/ Structural Index DataGuides [Goldman97]: Every label path of the source graph has exactly one data path instance in its DataGuide. C D C D C D A B B C D C D AB C D AB
5/2/ Structural Index (cont.) 1-Index [Milo99]: Grouping together nodes if they have the same set of incoming paths D CABAB D D C A B D CA B D data graph1-indexdataguide
5/2/ Structural Index (cont.) Covering indexes [Kaushik02] Forward and Backward Index (F&B-Index) Add inverse edges to the graph Compute the 1-index (or DataGuide) for the modified graph The size of F&B-Index is too large. To reduce the size: only useful tags are indexed do not index all idref edges (XPath gives a higher priority to tree edges and // matches only tree edges) exploit local similarity (short paths only) restrict tree deepth
5/2/ Numbering Scheme Dewey Decimal Coding [ Tatarinov02 ] references book author titleyearauthor title 1.2.2
5/2/ Numbering Scheme (cont.) Inserting new elements references book author titleyearauthor title new element nodes that require renumbering
5/2/ Numbering Scheme (cont.) Preorder and postorder [Dietz82] (preorder, postorder) x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal. references book author titleyearauthor (1,10) (2,6)(2,6) (8,9)(8,9) (3,1)(3,1)(4,2)(4,2)(5,3)(5,3)(6,4)(6,4)(7,5)(7,5)(9,7)(9,7) title (10,8)
5/2/ Numbering Scheme (cont.) Various interval schemes (docno, begin:end, level) [Zhang01] The begin and end positions can be generated by doing a depth-first traversal of the tree and sequentially assigned a number at each visit. (preorder, size) [Li01] Size is an arbitrary integer larger than the total number of the current descendants. (lowest_post, postorder) [Agrawal89] Lowest_post is the lowest postorder number of its descendants.
5/2/ Query Processing To find all occurrences of a query pattern in XML documents. Navigation-based approach It computes results by analyzing an input document one tag at a time. The query is represented as a non-deterministic finite automaton (NFA) [Diao03] Index-based approach It uses precomputed indexes to answer the query.
5/2/ Holistic Twig Join [Bruno02] Indexes string: (doc, left, level) element: (doc, left: right, level) Query: A//B//C A1 B1 A2 B2 C1 data SASA SBSB SCSC A1 A2 B1 B2 C1 stack encoding A1 B1 C1 A1 B2 C1 A2 B2 C1 query results
5/2/ Count (A // B // C) XPath Query Sequential Data SASA SBSB SCSC Read A-node’s Count = 1 B-node’s Count = 0 C-node’s Count = 0 Count Operation [Chen04]
5/2/ Count (A // B // C) XPath Query Sequential Data SASA SBSB SCSC Read A-node’s Count = 2 B-node’s Count = 0 C-node’s Count = 0 Count Operation (cont.)
5/2/ SASA SBSB SCSC A(2) null Count (A // B // C) XPath Query Sequential Data Count Operation (cont.) Read A-node’s Count = 0 B-node’s Count = 1 C-node’s Count = 0
5/2/ SASA SBSB SCSC A(2) null Count (A // B // C) XPath Query Sequential Data Count Operation (cont.) Read A-node’s Count = 0 B-node’s Count = 2 C-node’s Count = 0
5/2/ Count (A // B // C) XPath Query Sequential Data Count Operation (cont.) Query result is 2 * 2 = 4. SASA SBSB SCSC A(2) null B(2) Read A-node’s Count = 0 B-node’s Count = 0 C-node’s Count = 1
5/2/ Count Operation (cont.) Count (A // B // C) XPath Query SASA SBSB SCSC A(2) null B(2) Read A-node’s Count = 0 B-node’s Count = 0 C-node’s Count = 0 Sequential Data
5/2/ Count Operation (cont.) Count (A // B // C) XPath Query SASA SBSB SCSC A(2) null Read A-node’s Count = 0 B-node’s Count = 1 C-node’s Count = 0 Sequential Data B(2) –1 = 1
5/2/ Count Operation (cont.) Count (A // B // C) XPath Query SASA SBSB SCSC A(2) null Read A-node’s Count = 0 B-node’s Count = 0 C-node’s Count = 0 Sequential Data
5/2/ Count Operation (cont.) Count (A // B // C) XPath Query Read A-node’s Count = 1 B-node’s Count = 0 C-node’s Count = 0 Sequential Data SASA SBSB SCSC A(2) –1 = 1
5/2/ Count Operation (cont.) Count (A // B // C) XPath Query SASA SBSB SCSC Read A-node’s Count = 0 B-node’s Count = 0 C-node’s Count = 0 Sequential Data
5/2/ Future Work Version management Materialized views Cache management Aggregate query processing Streaming data processing
5/2/ References [Agrawal89] R. Agrawal et al., “Efficient management of transitive relationships in large data and knowledge bases,” SIGMOD, [Bruno02] N. Bruno et al., “Holistic twig joins: Optimal XML pattern matching,” SIGMOD, [Chen04] Yaw-Huei Chen and Ming-Chi Ho, “Aggregate query processing of streaming XML data,” ICS, [Christophides03] V. Christophides et al., “On labeling schemes for the semantic web,” WWW, [Diao03] Y. Diao et al., “Path sharing and predicate evaluation for high- performance XML filtering,” ACM TODS, [Dietz82] P.F. Dietz, “Maintaining order in a linked list,” ACM Symposium on Theory of Computing, May [Goldman97] R. Goldman and J. Widom, “DataGuides: Enabling query formulation and optimization in semistructured databases,” VLDB, 1997.
5/2/ References (cont.) [Kaushik02] R. Kaushik et al., “Covering indexes for branching path queries,” SIGMOD, [Li01] Q. Li and B. Moon, “Indexing and querying XML data for regular path expressions,” VLDB, [Milo99] T. Milo and D. Suciu, “Index structures for path expressions,” Proc. of the Int’l Conf. on Database Theory, [Tatarinov02] I. Tatarinov et al., “Storing and querying ordered XML using a relational database system,” SIGMOD, [Tian02] F. Tian et al., “The design and performance evaluation of alternative XML storage strategies,” SIGMOD Record, March [Zhang01] C. Zhang et al., “On supporting containment queries in relational database management systems,” SIGMOD, 2001.