Efficient Relational Storage and Retrieval of XML Documents Jill Chen Mojdeh Makabi CS240B.

Slides:

Advertisements

Similar presentations

Chapter 10: Designing Databases

Advertisements

XML: Extensible Markup Language

CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.

Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Indexing and.

TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.

Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,

TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005.

Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.

1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.

Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.

Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA

CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.

From Semistructured Data to XML: Migrating The Lore Data Model and Query Language Roy Goldman, Jason McHugh, Jennifer Widom Stanford University

Efficient XML Storage, Query, and Update Shi Xu Heng Yuan Spring 2004 CS240B Prof. Zaniolo.

1 COS 425: Database and Information Management Systems XML and information exchange.

Lec 15 April 9 Topics: l binary Trees l expression trees Binary Search Trees (Chapter 5 of text)

Storing and Querying Ordered XML Using Relational Database System Swapna Dhayagude.

Physical Database Monitoring and Tuning the Operational System.

1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.

1 New Ways of Querying the Web by Eliahu Brodsky and Alina Blizhovsky.

Database Systems and XML David Wu CS 632 April 23, 2001.

Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.

Multimedia Information Systems CS Outlines Introduction to DMBS Relational database and SQL B + - tree index structure.

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

XML Compression Aslam Tajwala Kalyan Chakravorty.

1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.

Query Processing Presented by Aung S. Win.

Main challenges in XML/Relational mapping Juha Sallinen Hannes Tolvanen.

IST Databases and DBMSs Todd S. Bacastow January 2005.

XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.

IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.

Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.

DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.

Querying Structured Text in an XML Database By Xuemei Luo.

Advanced Databases: Lecture 6 Query Optimization (I) 1 Introduction to query processing + Implementing Relational Algebra Advanced Databases By Dr. Akhtar.

5/24/01 Leveraging SQL Server 2000 in ColdFusion Applications December 9, 2003 Chris Lomvardias SRA International

Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.

1 Design Issues in XML Databases Ref: Designing XML Databases by Mark Graves.

12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,

5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.

Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.

The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.

XML and Database.

XML Access Control Koukis Dimitris Padeleris Pashalis.

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.

Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.

Internal and External Sorting External Searching

1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.

SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

XML Extensible Markup Language

Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)

1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.

XML: Extensible Markup Language

Module 11: File Structure

Compressing XML Documents with Finite State Automata

Database Management System

Semi-Structured Data and Agile Application Development

Storing and Querying XML Documents Without Using Schema Information

Indexing and Hashing Basic Concepts Ordered Indices

MANAGING DATA RESOURCES

Presentation transcript:

Efficient Relational Storage and Retrieval of XML Documents Jill Chen Mojdeh Makabi CS240B

References Kanda Runapongsa and Jignesh M. Patel. Storing and Querying XML Data in Object-Relational DBMSs. In A.B. Chaudhri al. (Eds): EDBT 2002 Workshops, LNCS 2490, pp , H. Liefke and D. Suciu. XMill: an Efficient Compressor for XML Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp , Dallas, Texas, May C. Kanne and G. Moerkotte. Efficient storage of XML Data. et al. ICDE available at Albrecht Schmidt, Martin Kersten, Menzo Windhouwer, and Florian Waas. Efficient Relational Storage and Retrieval of XML Documents. et al. WebDB available at

XML XML assumes the role of the standard data exchange format in Web database environments XML is semi-structured and one consequence of that is we can expect all instances of one type to share the same structure Modeling issues arises from the inconsistency between semi-structured data on the one hand side and fully structured database schemas on the other hand To make XML the language of Web databases, there should be effective tools for the management of the XML documents

Monet XML Model Efficient Relational Storage and Retrieval of XML Documents The data model is based on the notion of binary associations It decomposes XML documents into small, flexible and semantically homogenous units It is very efficient

Ben Bit How to Hack Ed Itor Bob Byte Ken Key Hacking & RSI XML documents and Syntax Tree

Main Question The question central to querying XML documents is how to store the syntax tree as database instance that provides efficient retrieval capabilities

Different Approaches Tree could be stored using a single database table  Makes querying expensive  By enforcing scans over large amounts of data in relevant to a query With few Joins, large data volumes may have to processed Tree could be stored by storing all associations of the same type in the same binary relation. Being used in Monet XML Model

Monet XML Model The basis for the Monet XML Model:  Paths  Associations  Binary Relations

Path For a node o in the syntax tree, its path is the sequence of labels along the path (vertex and edge labels) from the root to o Path describe the position of the element in the graph relative to the root node For Node with OID O3, its path is : bibliography article author The Ben Bit has path: bibliography article author cdata string

Associations A pair (o,.) Є oid x (oid U string) is called an association The different types of associations describe different parts of the tree  Association of type oid x oid represents edges  Association of type oid x string represents attributes values

Binary Relation In order to transform XML document to Monet Model, we need to get the set of binary relations that contain all associations between nodes  Store all association of the same type in the same binary relation Example: For association of bibliography and article: {(O 1, O 2 ), (O 1, O 7 )}

Monet Transformation bibliography article title cdata string = {(O 6, “How to Hack” ), (O 15, “Hacking & RSI”)}, bibliography article = {(O 1, O 2 ), (O 1, O 7 )}, bibliography article author = {(O 2, O 3 ), (O 7, O 10 ), (O 7, O 12 )}, bibliography article title = {(O 2, O 5 ), (O 7, O 14 )}, bibliography article author cdata = {(O 3, O 4 ), (O 10, O 11 ), (O 12, O 13 )}, bibliography article author cdata string ={(O 4, “Ben Bit” ), (O 11, “Bob Byte”), (O 13, “Ken Key”)}, bibliography article title cdata = {(O 5, O 6 ), (O 14, O 15 )}, bibliography article editor = {(O 7, O 8 )}, bibliography article editor cdata = {(O 8, O 9 )}, bibliography article editor cdata string = {(O 9, “Ed Itor” )}, bibliography article key = {(O 2, “BB88” ), (O 7, “BK99”)},

Query p = {O 2, O 7 }, = {(O 2, “Ben Bit” ), (O 7, “Bob Byte”), (O 7, “Ken Key”)}, assoc(p a) = {(O 2, “How to Hack” ), (O 7, “Hacking & RSI”)},assoc(p t) SELECT p FROM bibliography article p, p author cdata a, p title cdata t WHERE a=“Ben Bit” and t like “Hack” Show Ben Bit’s publication whose titles contain the word “Hack” Specify setsSpecify associations between the first and last element of the path Obtained by equality joins of binary relations along the path

Single Database Table KeyAuthortitleEditor BB88Ben BitHow to Hack NULL BK99Bob ByteHacking & RSI Ed Itor BK99Ken KeyHacking & RSI Ed Itor …. Disadvantages:  Scans over large amounts of data  Large data volumes may have to be processed by few joins  Add NULL values for irregularities SELECT * FROM bibliography WHERE Author=“Ben Bit” and t like “Hack” Ben Bit How to Hack Ed Itor Bob Byte Ken Key Hacking & RSI

Monet XML Model Results in higher degree of fragmentation  In our example, we have 11 tables Path is used to group semantically related associations into the same relation. No need to scan the entire documents There is no need to introduce novel features on the storage level to cope with irregularities induced by semi-structured nature of XML The complete decomposition is linear in the size of the documents Memory requirements is linear in the height of the syntax tree

Quantitative Assessments DocumentsSize in XMLSize in Monet XML#TablesLoading ACM Anthology46.6 MB44.2 MB s Shakespeare's Plays 7.9 MB8.2 MB954.5s Database Size  Resulting size of the decomposition scheme are a critical issues  In the worst case, the size of the path summary can be linear in the size of the documents – if the documents are completely unstructured  In practical applications, there are generally large structured portions The Monet XML version of the ACM anthology is of smaller size than the original documents  Reduction is due to the removal of redundancy occurring character data – and removal of tags

Comparison of Response Times Comparing Monet XML against SYU/Postgres SYU store all data on a single table and have to scan these data repeatedly Monet transform yields smaller data volumes We have a set of 10 queries using Shakespeare's plays The substantial difference in response time shows that Monet XML outruns the competitor by up to two orders of magnitude Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10 Monet XML 1.2ms SYU150ms

Summary Presented a data model for efficient processing of XML documents The experiences show that it is worth taking the plunge and fully decompose XML documents into binary associations This approach combines the elegance of clear semantics with a highly efficient execution model by means of a simple and effective mapping between XML documents and a relational schema

XORator & Object- Relational DBMSs

Two Dominating Approaches Use a native XML database engine for storing and querying data sets  Provide a more natural data model and query language for XML data – hierarchical or graph representation Map the XML data and queries to constructs provided by Relational DBMS (RDBMS)  XML data is mapped to relations, queries on XML data are converted into SQL queries

RDBMS Advantage  user is not involved in the complexity of mapping  it can be used for querying both XML data and data that exists in the relational systems Disadvantage  it can lower performance since a mapping from XML data to the relational data may produce a database schema with many relations  queries on XML data when translated to SQL queries may have many joins, making the queries expensive to evaluate

In the Paper Object-Relational DBMS (ORDBMS)  Has all the advantages of an RDBMS  More expressive type system than RDBMS  Better suited for XML documents that may use a richer set of data types XORator Algorithm  Uses Document Type Definitions (DTDs) to map XML documents to tables in ORDBMS  New XML data type: XADT (XML Abstract Data Type)

Storing XML Documents in an ORDBMS – Reducing DTD Complexity Apply transformations to reduce the number of nested expressions and the number of element items, making the mapping process easier  Flattening (to convert a nested definition into a flat representation): (e 1, e 2 )* → e 1 *, e 2 *  Simplification (to reduce multiple unary operators into a single unary operator): e 1 ** → e 1 *  Grouping (to group subelements that have the same name): e 0, e 1 *, e 1 *, e 2 → e 0, e 1 *, e 2  e+ → e*

Reducing DTD Complexity (cont.)

Storing XML Documents in an ORDBMS – Building a DTD Graph

Storing XML Documents in an ORDBMS – XORator XML to OR Translator Algorithm builds on Hybrid Algorithm  If a non-leaf node N has exactly one parent, and if there are no links incident on any of the descendants of this node, then node N is assigned to an XADT attribute. (If node N is assigned to a relation, then queries on this node and its parent requires a join.)

XORator (cont.)  If a non-leaf node below a * node is accessed by multiple nodes, then it is assigned to a relation. (For nodes that are mapped to relations, the ancestors of these nodes must also be assigned as relations.) e.g. scene  If a leaf node is below a * node, then it is assigned as an attribute of the XADT. Otherwise, it is assigned as an attribute of string type. e.g. line

XORator (cont.)

Storing XML Documents in an ORDBMS – Defining an XML Data Type Compressed representation for the XML fragment  Element tags are mapped to integer codes, and element tags are replaced by these integer codes.  A small dictionary is stored along with the XML fragment to record the mapping between the integer codes and the actual element tag names. Compression is used only if the space efficiency is above a certain threshold value.

Defining an XML Data Type (XADT) (cont.) Methods on the XADT  XADT getElm(XADT inXML, VARCHAR rootElm, VARCHAR searchElm, VARCHAR searchKey, INTEGER level)  INTEGER findKeyInElm(XADT in XML, VARCHAR searchElm, VARCHAR searchKey)  XADT getElmIndex(XADT inXML, VARCHAR parentElm, VARCHAR childElm, INTEGER startPos, INTEGER endPos)

Defining an XML Data Type (XADT) (cont.)

Unnest Operator  Required when a query needs to examine individual elements in the set.  E.g. A distinct list of all speakers who speak in at least one play.  Implemented using a table User-Defined Function (UDF).

Defining an XML Data Type (XADT) – Unnest Operator (cont.)

Performance Evaluation Randomly parse a few sample documents to obtain the storage space sizes in both uncompressed and compressed cases. Compressed format is chosen only if it reduces the storage space by at least 20%

Performance – Shakespeare Plays XORator algorithm chooses not to use the compressed storage alternative. The size of the database produced by the XORator algorithm is about 60% of the size of the database produced by the Hybrid algorithm.

Performance – Larger Data Set Took the original Shakespeare data set and loaded it multiple times, producing data sets that were two, four and eight times the original database size (DSx2, DSx4, and DSx8). Query sets:  QS1: Flattening – list speakers and the lines that they speak  QS2: Full path expression – retrieve the lines that have the keyword “Rising” in the text of the stage direction  QS3: Selection  QS4: Multiple selections  QS5: Twig with selection  QS6: Order access

Performance – Larger Data Set Much less loading times Significantly better execution times for all queries, except query QS6 All queries requested at least one few join QS6 is slower because the database needs to scan the XADT attribute to extract elements in the specified order when using the XORator algorithm, while the Hybrid database needs to only extract the value of the element order attribute

Performance – SIGMOD Proceedings Data Set Deep DTD – representative of the worst-case scenario for the XORator algorithm. Compressed storage alternative is used – it reduces the database size by about 38%. The size of the database produced by the XORator algorithm is about 65% of the size of the database produced by the Hybrid algorithm

Performance – Larger Data Set Took the original SIGMOD Proceedings data set and loaded it multiple times, producing data sets that were two, four and eight times the original database size (DPx2, DPx4, and DPx8). Query Sets  QG1: Selection and extraction – retrieve the authors of the papers with the keyword “join” in the paper title  QG2: Flattening – list all authors and the names of the proceeding sections in which their papers appear  QG3: Flattening with selection  QG4: Aggregation  QG5: Aggregation with selection  QG6: Order access with selection

Performance – Larger Data Set When the size of data is small (DPx1 and DPx2), the XORator algorithm performs worse than the Hybrid algorithm. When the size of data becomes large (DPx4 and DPx8), the XORator algorithm outperforms the Hybrid algorithm. No table joins, but each query has 4 to 8 calls of UDFs to extract subelements or to join elements inside XADT attribute.

Analysis The cost of invoking UDFs is significant component of the query evaluation of XORator algorithm. Does UDF incur a higher performance penalty than an equivalent built-in function?  Implement two string functions to return length and substring using UDFs and built-in functions, and test the following queries.  QT1: Return the length of string in the SPEAKER attribute.  QT2: Return a substring of string in the SPEAKER attribute from the fifth position to the last position.

Analysis (cont.) Using UDFs is about 40% more expensive than using built-in functions.

Analysis (cont.) Invoking UDFs are expensive because:  XADT methods use string compare and copy functions on VARCHAR. This sometimes requires scanning a large amount of data. Associate metadata with each XADT attribute to quickly access the starting position of each element.  Cost of evaluating UDF is higher compared to equivalent built-in function. Implement XADT as a native data type

Performance As the data size increases, the ratios of the response times between two algorithms become more than 1. Queries using the XORator algorithm have no join and thus the response time grow at O(n) rate (scan cost), n = # of tuples Queries using the Hybrid algorithm have many joins grow at either O(nlogn) rate (merge sort join cost), or O(n 2 ) rate (nested loop join cost).

Summary New algorithm: XORator New data type: XADT Outperforms Hybrid algorithm due to less joins Future work: Implementation and evaluation of UDF

Conclusion We presented some efficient models for storing and querying XML documents  Monet XML Model  XORator Algorithm There is still a lot of work that needs to be done in order to bridge the gap between the structured web databases and semistructured XML documents