Presentation is loading. Please wait.

Presentation is loading. Please wait.

DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science.

Similar presentations


Presentation on theme: "DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science."— Presentation transcript:

1 DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science National Taiwan Ocean University yahui@cs.ntou.edu.tw Sept. 10 th, 2002

2 DBLABNational Taiwan Ocean University2/35 Overview XML introduction Element block Element tree Two types of index structures Document index Element index Experiment results Conclusion

3 DBLABNational Taiwan Ocean University3/35 Principles of database systems Ullman Jeffrey Computer Science Press 1999 database Element Block

4 DBLABNational Taiwan Ocean University4/35 Element Tree Example of Offset Blocks

5 DBLABNational Taiwan Ocean University5/35 the Query Processor Document Index Element Index XML Document Identifying Document Determining Position Retrieving Data QueryResult

6 DBLABNational Taiwan Ocean University6/35 the Index Structures Purpose: Providing efficient query processing over multiple XML documents Two types: Document index Representing the correspondence of document identifiers and element values Element index Representing the positions of elements

7 DBLABNational Taiwan Ocean University7/35 Document Index Based on B + -Tree: the size of each node is restricted by order; the tree is balanced. Order=5

8 DBLABNational Taiwan Ocean University8/35 Document Index (cont) Each node is represented as an XML document. Search-key value is represented as the attribute key of the element Pointer, while the document identifier is represented as the content. B0001 B0002 B0001 B3.bt XML DTD

9 DBLABNational Taiwan Ocean University9/35 Element Index The position information of elements is represented based on the order specified in DTD, or the element tree. The element indexes are partitioned into offset blocks corresponding to element blocks to capture the nesting structures of elements. It is named “offset” since we keep the relative position of elements, to reduce the cost of maintenance. Offset tuples constitute the offset block: the first component records the offset to the parent element; the last component records the pointer to the offset tuple for the next sibling element; the other components record the relative positions of sub- elements.

10 DBLABNational Taiwan Ocean University10/35 Example of Offset Blocks Book1 Title1 pointer Publisher1 Date1 Keyword1 pointer Author1 Lastname1 Firstname1 point Author2 Lastname2 Firstname2 null Book2 Title2 pointer Publisher2 Date2 Keyword2 null Author3 Lastname3 Firstname3 null Books pointer null Element tree Sibling link Child link

11 DBLABNational Taiwan Ocean University11/35 Example of Retrieving Offsets Suppose we plan to retrieve all the data corresponding to the path “/Books/Book/Title”. Based on the element tree, Book is the first child of Books, and Title is the first child of Book. This information tells us which components to retrieve in the offset tuples of Books and Book. We also need to follow the sibling links. Booksnull Book1Title1………Book2Title2………null

12 DBLABNational Taiwan Ocean University12/35 Example of Retrieving Offsets (cont) Suppose the input path is “/Books/Book/Author/Lastname”, where Book is the first child, Author is the second child and Lastname is the first child. We need to process the sibling elements for both Author and Book. Booksnull Book1………… Book2…………null Author1Lastname1…Author2Lastname2…nullAuthor3Lastname3…null

13 DBLABNational Taiwan Ocean University13/35 Constructing Algorithm Idea: performing a linear scan on the XML document; retrieving the absolute positions of all tags to calculate offsets. data structures used: StartTagList: the sequence of start-tags and their absolute positions EndTagList: the sequence of end-tags and their absolute positions Stack: all unfinished elements; on top is the most recent one, which is also the parent of the current element Each internal node of the element tree will need to record how many child nodes it has.

14 DBLABNational Taiwan Ocean University14/35 StartTagList … ['Title', 18] ['Book', 9] ['Books', 0] EndTagList … ['Firstname', 138] ['Lastname', 104] ['Title', 62] Stack [‘/’, 0, -1] Offset Tuples Principles of database systems Ullman Jeffrey Computer Science Press 1999 database Initial Data

15 DBLABNational Taiwan Ocean University15/35 StartTagList … ['Title', 18] ['Book', 9] ['Books', 0] EndTagList … ['Firstname', 138] ['Lastname', 104] ['Title', 62] Stack ['Books', 0, 0] [‘/’, 0, -1] 0 [0, _, _] Principles of database systems Ullman … Round 1 2 3 4 1 Offset Tuples

16 DBLABNational Taiwan Ocean University16/35 StartTagList … ['Author', 66] ['Title', 18] ['Book', 9] EndTagList … ['Firstname', 138] ['Lastname', 104] ['Title', 62] Stack ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, _, _, _, _, _, _] Principles of database systems Ullman … Round 2 3 4 1 2 Offset Tuples

17 DBLABNational Taiwan Ocean University17/35 StartTagList … ['Lastname', 78] ['Author', 66] ['Title', 18] EndTagList … ['Firstname', 138] ['Lastname', 104] ['Title', 62] Stack ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, _, _, _, _, _] Principles of database systems Ullman … Round 3 3 1 2 Offset Tuples

18 DBLABNational Taiwan Ocean University18/35 StartTagList … ['Firstname', 109] ['Lastname', 78] ['Author', 66] EndTagList … ['Author', 150] ['Firstname', 138] ['Lastname', 104] Stack ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, _, _, _] Principles of daatabase systems Ullman … Round 4 3 4 1 2 Offset Tuples

19 DBLABNational Taiwan Ocean University19/35 StartTagList … ['Publisher', 154] ['Firstname', 109] ['Lastname', 78] EndTagList … ['Author', 150] ['Firstname', 138] ['Lastname', 104] Stack ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, 12, _, _] Principles of daatabase systems Ullman … Round 5 3 1 2 Offset Tuples

20 DBLABNational Taiwan Ocean University20/35 StartTagList … ['Date', 202] ['Publisher', 154] ['Firstname', 109] EndTagList … ['Publisher', 198] ['Author', 150] ['Firstname', 138] Stack ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, 12, 43, _] Ullman Jeffrey Computer Science Press 1999 … Round 6 3 2 1 Offset Tuples

21 DBLABNational Taiwan Ocean University21/35 StartTagList ['Keyword', 222] ['Date', 202] ['Publisher', 154] EndTagList … ['Date', 218] ['Publisher', 198] ['Author', 150] Stack ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, 12, 43, 0] Ullman Jeffrey Computer Science Press 1999 … Round 7 1 Offset Tuples

22 DBLABNational Taiwan Ocean University22/35 StartTagList ['Keyword', 222] ['Date', 202] ['Publisher', 154] EndTagList … ['Keyword', 248] ['Date', 218] ['Publisher', 198] Stack ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, 145, _, _, _] 2 [57, 12, 43, 0] Ullman Jeffrey Computer Science Press 1999 … Round 8 3 1 2 Offset Tuples

23 DBLABNational Taiwan Ocean University23/35 StartTagList ['Keyword', 222] ['Date', 202] EndTagList ['Books', 266] ['Book', 257] ['Keyword', 248] ['Date', 218] Stack ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, 145, 193, _, _] 2 [57, 12, 43, 0] Computer Science Press 1999 database Round 9 3 1 2 Offset Tuples

24 DBLABNational Taiwan Ocean University24/35 StartTagList ['Keyword', 222] EndTagList ['Books', 266] ['Book', 257] ['Keyword', 248] Stack ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, 145, 193, 213, _] 2 [57, 12, 43, 0] Computer Science Press 1999 database Round 10 3 1 2 Offset Tuples

25 DBLABNational Taiwan Ocean University25/35 StartTagListEndTagList ['Books', 266] ['Book', 257] Stack ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, 145, 193, 213, 0] 2 [57, 12, 43, 0] Computer Science Press 1999 database Round 11 1 Offset Tuples

26 DBLABNational Taiwan Ocean University26/35 StartTagListEndTagList ['Books', 266] Stack ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, 0] 1 [9, 9, 2, 145, 193, 213, 0] 2 [57, 12, 43, 0] Computer Science Press 1999 database Round 12 2 1 Offset Tuples

27 DBLABNational Taiwan Ocean University27/35 StartTagListEndTagList Stack [‘/’, 0, -1] 0 [0, 1, 0] 1 [9, 9, 2, 145, 193, 213, 0] 2 [57, 12, 43, 0] Final Data Principles of database systems Ullman Jeffrey Computer Science Press 1999 database Offset Tuples

28 DBLABNational Taiwan Ocean University28/35 Performance Evaluation Comparison with DOM: showing the efficiency of utilizing the pre-built element index DOM (Document Object Model): a tree-based parsing mechanism where each element is a node Using Microsoft MSXML 3.0 DOM API Construction of the cost model: showing the scalability of our indexing scheme Comparison with Lore: showing the performance of the whole query processor Lore: a specialized database system for semi- structured/XML data

29 DBLABNational Taiwan Ocean University29/35 Comparison with DOM

30 DBLABNational Taiwan Ocean University30/35 Cost Model The I/O cost consists of processing the following four portions of data: The internal nodes of the document index The leaf nodes of the document index The offset blocks The XML files The cost model is as follows:

31 DBLABNational Taiwan Ocean University31/35 Experiment Setups type ABCDEF Number of book (p) 9819726561590496561 Distinct values (v) 3927812436561 B + -Tree order (k) 444444 Number of results (n) 3831882771

32 DBLABNational Taiwan Ocean University32/35 Experiment Data Time (ms)\typeABCDEF Internal00.51.21.732.0673.367 Leaf0.50.60.7331.22.8330.533 Offset & XML3.18.332.5100.3337.61.1 Actual time3.69.434.4103.3342.55 Estimated time4.4811.6640.81110.3338.514.2 ratio1.24 1.191.070.990.84

33 DBLABNational Taiwan Ocean University33/35 Queries to Compare with Lore TypeDescription # of Query AFind journals by title 20 BFind journals by author 40 CFind journals by title & author 20

34 DBLABNational Taiwan Ocean University34/35 Experiment Results Our approachLoreLore-Vindex A0.0315s5.025s5.065s B0.0065s5.1725s5.165s C0.0075s4.445s4.465s

35 DBLABNational Taiwan Ocean University35/35 Conclusions Summary We construct a query processor to retrieve data from multiple XML documents, which utilizes two index structures: the document index could quickly identify the required document the maintainable element index could quickly determine the precise location of desired data Experiment results show the efficiency of our approach. Future work Supporting more complicated queries Improving space utilization


Download ppt "DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science."

Similar presentations


Ads by Google