DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science.

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
1 DynaMat A Dynamic View Management System for Data Warehouses Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan.
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
1 Extensible Markup Language: XML HTML: portable, widely supported protocol for describing how to format data XML: portable, widely supported protocol.
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
1 Extensible Markup Language: XML HTML: portable, widely supported protocol for describing how to format data XML: portable, widely supported protocol.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts B + -Tree Index Files Indexing mechanisms used to speed up access to desired data.  E.g.,
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
Database Management 9. course. Execution of queries.
Database Management System Lecture 4 The Relational Database Model- Introduction, Relational Database Concepts.
XML DOM Functionality in.NET DSK Chakravarthy
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.
Winter 2006Keller, Ullman, Cushing18–1 Plan 1.Information integration: important new application that motivates what follows. 2.Semistructured data: a.
1 XSLT An Introduction. 2 XSLT XSLT (extensible Stylesheet Language:Transformations) is a language primarily designed for transforming the structure of.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Data Structure & File Systems Hun Myoung Park, Ph.D., Public Management and Policy Analysis Program Graduate School of International Relations International.
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.
Indexing Structures for Files
COSC 2007 Data Structures II Chapter 15 External Methods.
Database Systems Part VII: XML Querying Software School of Hunan University
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
XML Refresher Course Bálint Joó School of Physics University of Edinburgh May 02, 2003.
XML Access Control Koukis Dimitris Padeleris Pashalis.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Session 1 Module 1: Introduction to Data Integrity
CIS 250 Advanced Computer Applications Database Management Systems.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Query Execution Chapter 15 Section 15.1 Presented by Khadke, Suvarna CS 257 (Section II) Id
Mehdi Kargar Department of Computer Science and Engineering
Module 11: File Structure
Indexes By Adrienne Watt.
Indexing Structures for Files and Physical Database Design
CS522 Advanced database Systems
Indexing and hashing.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
CS522 Advanced database Systems
Overview of Query Optimization
Chapter 15 QUERY EXECUTION.
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Semi-Structured data (XML Data MODEL)
Advance Database System
Alin Deutsch, University of Pennsylvania Mary Mernandez, AT&T Labs
Chapter 11: Indexing and Hashing
Semi-Structured data (XML)
MEET-IP Memory and Energy Efficient TCAM-based IP Lookup
Presentation transcript:

DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science National Taiwan Ocean University Sept. 10 th, 2002

DBLABNational Taiwan Ocean University2/35 Overview XML introduction Element block Element tree Two types of index structures Document index Element index Experiment results Conclusion

DBLABNational Taiwan Ocean University3/35 Principles of database systems Ullman Jeffrey Computer Science Press 1999 database Element Block

DBLABNational Taiwan Ocean University4/35 Element Tree Example of Offset Blocks

DBLABNational Taiwan Ocean University5/35 the Query Processor Document Index Element Index XML Document Identifying Document Determining Position Retrieving Data QueryResult

DBLABNational Taiwan Ocean University6/35 the Index Structures Purpose: Providing efficient query processing over multiple XML documents Two types: Document index Representing the correspondence of document identifiers and element values Element index Representing the positions of elements

DBLABNational Taiwan Ocean University7/35 Document Index Based on B + -Tree: the size of each node is restricted by order; the tree is balanced. Order=5

DBLABNational Taiwan Ocean University8/35 Document Index (cont) Each node is represented as an XML document. Search-key value is represented as the attribute key of the element Pointer, while the document identifier is represented as the content. B0001 B0002 B0001 B3.bt XML DTD

DBLABNational Taiwan Ocean University9/35 Element Index The position information of elements is represented based on the order specified in DTD, or the element tree. The element indexes are partitioned into offset blocks corresponding to element blocks to capture the nesting structures of elements. It is named “offset” since we keep the relative position of elements, to reduce the cost of maintenance. Offset tuples constitute the offset block: the first component records the offset to the parent element; the last component records the pointer to the offset tuple for the next sibling element; the other components record the relative positions of sub- elements.

DBLABNational Taiwan Ocean University10/35 Example of Offset Blocks Book1 Title1 pointer Publisher1 Date1 Keyword1 pointer Author1 Lastname1 Firstname1 point Author2 Lastname2 Firstname2 null Book2 Title2 pointer Publisher2 Date2 Keyword2 null Author3 Lastname3 Firstname3 null Books pointer null Element tree Sibling link Child link

DBLABNational Taiwan Ocean University11/35 Example of Retrieving Offsets Suppose we plan to retrieve all the data corresponding to the path “/Books/Book/Title”. Based on the element tree, Book is the first child of Books, and Title is the first child of Book. This information tells us which components to retrieve in the offset tuples of Books and Book. We also need to follow the sibling links. Booksnull Book1Title1………Book2Title2………null

DBLABNational Taiwan Ocean University12/35 Example of Retrieving Offsets (cont) Suppose the input path is “/Books/Book/Author/Lastname”, where Book is the first child, Author is the second child and Lastname is the first child. We need to process the sibling elements for both Author and Book. Booksnull Book1………… Book2…………null Author1Lastname1…Author2Lastname2…nullAuthor3Lastname3…null

DBLABNational Taiwan Ocean University13/35 Constructing Algorithm Idea: performing a linear scan on the XML document; retrieving the absolute positions of all tags to calculate offsets. data structures used: StartTagList: the sequence of start-tags and their absolute positions EndTagList: the sequence of end-tags and their absolute positions Stack: all unfinished elements; on top is the most recent one, which is also the parent of the current element Each internal node of the element tree will need to record how many child nodes it has.

DBLABNational Taiwan Ocean University14/35 StartTagList … ['Title', 18] ['Book', 9] ['Books', 0] EndTagList … ['Firstname', 138] ['Lastname', 104] ['Title', 62] Stack [‘/’, 0, -1] Offset Tuples Principles of database systems Ullman Jeffrey Computer Science Press 1999 database Initial Data

DBLABNational Taiwan Ocean University15/35 StartTagList … ['Title', 18] ['Book', 9] ['Books', 0] EndTagList … ['Firstname', 138] ['Lastname', 104] ['Title', 62] Stack ['Books', 0, 0] [‘/’, 0, -1] 0 [0, _, _] Principles of database systems Ullman … Round Offset Tuples

DBLABNational Taiwan Ocean University16/35 StartTagList … ['Author', 66] ['Title', 18] ['Book', 9] EndTagList … ['Firstname', 138] ['Lastname', 104] ['Title', 62] Stack ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, _, _, _, _, _, _] Principles of database systems Ullman … Round Offset Tuples

DBLABNational Taiwan Ocean University17/35 StartTagList … ['Lastname', 78] ['Author', 66] ['Title', 18] EndTagList … ['Firstname', 138] ['Lastname', 104] ['Title', 62] Stack ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, _, _, _, _, _] Principles of database systems Ullman … Round Offset Tuples

DBLABNational Taiwan Ocean University18/35 StartTagList … ['Firstname', 109] ['Lastname', 78] ['Author', 66] EndTagList … ['Author', 150] ['Firstname', 138] ['Lastname', 104] Stack ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, _, _, _] Principles of daatabase systems Ullman … Round Offset Tuples

DBLABNational Taiwan Ocean University19/35 StartTagList … ['Publisher', 154] ['Firstname', 109] ['Lastname', 78] EndTagList … ['Author', 150] ['Firstname', 138] ['Lastname', 104] Stack ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, 12, _, _] Principles of daatabase systems Ullman … Round Offset Tuples

DBLABNational Taiwan Ocean University20/35 StartTagList … ['Date', 202] ['Publisher', 154] ['Firstname', 109] EndTagList … ['Publisher', 198] ['Author', 150] ['Firstname', 138] Stack ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, 12, 43, _] Ullman Jeffrey Computer Science Press 1999 … Round Offset Tuples

DBLABNational Taiwan Ocean University21/35 StartTagList ['Keyword', 222] ['Date', 202] ['Publisher', 154] EndTagList … ['Date', 218] ['Publisher', 198] ['Author', 150] Stack ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, 12, 43, 0] Ullman Jeffrey Computer Science Press 1999 … Round 7 1 Offset Tuples

DBLABNational Taiwan Ocean University22/35 StartTagList ['Keyword', 222] ['Date', 202] ['Publisher', 154] EndTagList … ['Keyword', 248] ['Date', 218] ['Publisher', 198] Stack ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, 145, _, _, _] 2 [57, 12, 43, 0] Ullman Jeffrey Computer Science Press 1999 … Round Offset Tuples

DBLABNational Taiwan Ocean University23/35 StartTagList ['Keyword', 222] ['Date', 202] EndTagList ['Books', 266] ['Book', 257] ['Keyword', 248] ['Date', 218] Stack ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, 145, 193, _, _] 2 [57, 12, 43, 0] Computer Science Press 1999 database Round Offset Tuples

DBLABNational Taiwan Ocean University24/35 StartTagList ['Keyword', 222] EndTagList ['Books', 266] ['Book', 257] ['Keyword', 248] Stack ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, 145, 193, 213, _] 2 [57, 12, 43, 0] Computer Science Press 1999 database Round Offset Tuples

DBLABNational Taiwan Ocean University25/35 StartTagListEndTagList ['Books', 266] ['Book', 257] Stack ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, _] 1 [9, 9, 2, 145, 193, 213, 0] 2 [57, 12, 43, 0] Computer Science Press 1999 database Round 11 1 Offset Tuples

DBLABNational Taiwan Ocean University26/35 StartTagListEndTagList ['Books', 266] Stack ['Books', 0, 0] [‘/’, 0, -1] 0 [0, 1, 0] 1 [9, 9, 2, 145, 193, 213, 0] 2 [57, 12, 43, 0] Computer Science Press 1999 database Round Offset Tuples

DBLABNational Taiwan Ocean University27/35 StartTagListEndTagList Stack [‘/’, 0, -1] 0 [0, 1, 0] 1 [9, 9, 2, 145, 193, 213, 0] 2 [57, 12, 43, 0] Final Data Principles of database systems Ullman Jeffrey Computer Science Press 1999 database Offset Tuples

DBLABNational Taiwan Ocean University28/35 Performance Evaluation Comparison with DOM: showing the efficiency of utilizing the pre-built element index DOM (Document Object Model): a tree-based parsing mechanism where each element is a node Using Microsoft MSXML 3.0 DOM API Construction of the cost model: showing the scalability of our indexing scheme Comparison with Lore: showing the performance of the whole query processor Lore: a specialized database system for semi- structured/XML data

DBLABNational Taiwan Ocean University29/35 Comparison with DOM

DBLABNational Taiwan Ocean University30/35 Cost Model The I/O cost consists of processing the following four portions of data: The internal nodes of the document index The leaf nodes of the document index The offset blocks The XML files The cost model is as follows:

DBLABNational Taiwan Ocean University31/35 Experiment Setups type ABCDEF Number of book (p) Distinct values (v) B + -Tree order (k) Number of results (n)

DBLABNational Taiwan Ocean University32/35 Experiment Data Time (ms)\typeABCDEF Internal Leaf Offset & XML Actual time Estimated time ratio

DBLABNational Taiwan Ocean University33/35 Queries to Compare with Lore TypeDescription # of Query AFind journals by title 20 BFind journals by author 40 CFind journals by title & author 20

DBLABNational Taiwan Ocean University34/35 Experiment Results Our approachLoreLore-Vindex A0.0315s5.025s5.065s B0.0065s5.1725s5.165s C0.0075s4.445s4.465s

DBLABNational Taiwan Ocean University35/35 Conclusions Summary We construct a query processor to retrieve data from multiple XML documents, which utilizes two index structures: the document index could quickly identify the required document the maintainable element index could quickly determine the precise location of desired data Experiment results show the efficiency of our approach. Future work Supporting more complicated queries Improving space utilization