Efficient XML Storage, Query, and Update Shi Xu Heng Yuan Spring 2004 CS240B Prof. Zaniolo.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Chapter 4: Trees Part II - AVL Tree
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Advanced Database Discussion B Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if.
TREES Chapter 6. Trees - Introduction  All previous data organizations we've studied are linear—each element can have only one predecessor and successor.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
Trees Chapter 8.
Modern Information Retrieval
Efficient Relational Storage and Retrieval of XML Documents Jill Chen Mojdeh Makabi CS240B.
Xyleme A Dynamic Warehouse for XML Data of the Web.
B+-tree and Hashing.
Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.
Lists A list is a finite, ordered sequence of data items. Two Implementations –Arrays –Linked Lists.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Transforming Infix to Postfix
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Data Structures Using C++ 2E Chapter 11 Binary Trees and B-Trees.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Natix Done by Asmaa Hassanain CSC 5370 Dr. Hachim Haddoutti 12/8/2003.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
CS4432: Database Systems II
Indexing and Hashing (emphasis on B+ trees) By Huy Nguyen Cs157b TR Lee, Sin-Min.
1 Multiway trees & B trees & 2_4 trees Go&Ta Chap 10.
Anatomy of a Native XML Base Management System By Yaojun Wu.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Storage CMSC 461 Michael Wilson. Database storage  At some point, database information must be stored in some format  It’d be impossible to store hundreds.
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
CS261 Data Structures Trees Introduction and Applications.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Querying Structured Text in an XML Database By Xuemei Luo.
Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.
File Systems CSCI What is a file? A file is information that is stored on disks or other external media.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Starting at Binary Trees
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
M180: Data Structures & Algorithms in Java Trees & Binary Trees Arab Open University 1.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Internal and External Sorting External Searching
1 Trees General Trees  Nonrecursive definition: a tree consists of a set of nodes and a set of directed edges that connect pairs of nodes.
Week 7 - Wednesday.  What did we talk about last time?  Recursive running time  Master Theorem  Symbol tables.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Indexing Structures Database System Implementation CSE 507 Some slides adapted from R. Elmasri and S. Navathe, Fundamentals of Database Systems, Sixth.
18-1 Chapter 18 Binary Trees Data Structures and Design in Java © Rick Mercer.
XML Extensible Markup Language
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Data Structures and Design in Java © Rick Mercer
Indexing Structures for Files and Physical Database Design
Multiway Search Trees Data may not fit into main memory
File System Implementation
Database System Implementation CSE 507
Semi-Structured Data and Agile Application Development
B+ Tree.
OrientX: an Integrated, Schema-Based Native XML Database System
Database Design and Programming
Indexing 4/11/2019.
Presentation transcript:

Efficient XML Storage, Query, and Update Shi Xu Heng Yuan Spring 2004 CS240B Prof. Zaniolo

XML Storage Methods Flat Streams Flat Streams Metamodeling Metamodeling Mixed Mixed Redundant Redundant Hybrid Hybrid

Method Covered “Efficient storage of XML data” covers hybrid method using a custom made storage system called Natix. “Efficient storage of XML data” covers hybrid method using a custom made storage system called Natix. “Efficient relational storage and retrieval of XML documents” covers Metamodeling using their Monet database. “Efficient relational storage and retrieval of XML documents” covers Metamodeling using their Monet database.

Natix Overview Natix is an efficient, native repository for storing, retrieving and managing XML documents. Natix is an efficient, native repository for storing, retrieving and managing XML documents. It supports tree-structured objects like XML documents at low architecture level. It supports tree-structured objects like XML documents at low architecture level.

Natix architectural overview

Logic Model Tree is often used in logic model of semistructured data. Tree is often used in logic model of semistructured data. Each non-leaf node is labeled with a symbol taken from an alphabet  DTD. Each non-leaf node is labeled with a symbol taken from an alphabet  DTD. Leaf nodes can be labeled as the data itself. Leaf nodes can be labeled as the data itself.

A sample XML with its associated logical tree Example XML: OTHELLO Let me see your eyes; Look in my face.

Physical Model Object Content: Object Content: Node and objects are used interchangeably. Node and objects are used interchangeably. A record contains a set of nodes/objects. A record contains a set of nodes/objects. Aggregate nodes are inner nodes of the tree. They contain their respective child nodes. Aggregate nodes are inner nodes of the tree. They contain their respective child nodes. Literal nodes are leaf nodes containing an uninterpreted stream of bytes, like text strings, graphics, etc. Literal nodes are leaf nodes containing an uninterpreted stream of bytes, like text strings, graphics, etc. Proxy nodes are nodes which point to different records. Proxy nodes are nodes which point to different records.

Node Representation Whole documents (or subtrees of documents) can be stored in one record. Whole documents (or subtrees of documents) can be stored in one record. Each record contains exactly one subtree. Each record contains exactly one subtree. The root nodes of each record’s subtree are called standalone objects, other nodes are called embedded objects. The root nodes of each record’s subtree are called standalone objects, other nodes are called embedded objects. The record size has an upper limit, the page size. The record size has an upper limit, the page size.

Large Trees For a large tree, physical model must provide a mechanism for distributing data trees over several pages. For a large tree, physical model must provide a mechanism for distributing data trees over several pages. Method 1: “flat” representation. It wastes the available structural information about the data. Method 1: “flat” representation. It wastes the available structural information about the data. Method 2: split large objects based on the underlying tree structure. Method 2: split large objects based on the underlying tree structure. Use proxy objects to connect subtrees of the large object residing in other records. Use proxy objects to connect subtrees of the large object residing in other records.

A Sample Distribution of logical nodes on records Proxies (p1, p2) Proxies (p1, p2) Helper aggregate objects (h1, h2) Helper aggregate objects (h1, h2) Scaffolding objects include proxies and helper aggregates. Scaffolding objects include proxies and helper aggregates. Facade objects (f i) Facade objects (f i)

Dynamic maintenance of an efficient storage The principle problem is that a record containing a subtree can grow larger than a page if a node is added or grows. The principle problem is that a record containing a subtree can grow larger than a page if a node is added or grows. Subtree contains in the record has to be partitioned into several subtrees. Subtree contains in the record has to be partitioned into several subtrees. Scaffolding nodes link the new records together in the physical tree. Scaffolding nodes link the new records together in the physical tree.

Multiway tree representation of records

Tree Growth Procedure Step 1: Determine the record r into which the node has to be inserted. Step 1: Determine the record r into which the node has to be inserted. Step 2: If there is not enough on the page, try to move r. If the record still does not fit, split the record: Step 2: If there is not enough on the page, try to move r. If the record still does not fit, split the record: (a) Determine the separator by recursively descending into the r’s subtree (a) Determine the separator by recursively descending into the r’s subtree (b) Distribute the resulting partitions onto records (b) Distribute the resulting partitions onto records (c) Insert the separator into the parent record, recursively calling this procedure (c) Insert the separator into the parent record, recursively calling this procedure Step 3: Insert the new node Step 3: Insert the new node

Determining the Insertion Location There are several possibilities to insert a new node f n into the physical tree. There are several possibilities to insert a new node f n into the physical tree. This choice can be determined by a configuration parameters. This choice can be determined by a configuration parameters.

Determining the separator Separator – a tree structure with proxies pointing to the new records to indicate where which part of the old record was moved. Separator – a tree structure with proxies pointing to the new records to indicate where which part of the old record was moved. Consists of all the nodes on the path from d to the subtree’s root. Consists of all the nodes on the path from d to the subtree’s root. Partition the tree into left partition L, right partition R and Separator S. Partition the tree into left partition L, right partition R and Separator S.

A record’s subtree before a split occurs

Splitting a Record Distributing the nodes on records Distributing the nodes on records After determining the partitioning, the contents of the record has to be distributed onto new records. After determining the partitioning, the contents of the record has to be distributed onto new records. Each resulting subtree is then stored in its own record, called partition records. Each resulting subtree is then stored in its own record, called partition records. Inserting the separator Inserting the separator The separator is moved to the parent record. The separator is moved to the parent record.

Split Algorithm Find a node d, such that the resulting L and R. Find a node d, such that the resulting L and R. The ratio between the sizes of L and R is determined by a configuration parameter (split target). The ratio between the sizes of L and R is determined by a configuration parameter (split target). Another configuration parameter Split tolerance specifies the minimum size for the subtree of d. It is used to prevent fragmentation. Another configuration parameter Split tolerance specifies the minimum size for the subtree of d. It is used to prevent fragmentation.

Record assembly for the subtree from previous figure

Physical storage of the tree represented inside one record

Performance Test XML markup version of Shakspeare’s play with 8MB with 320,000 nodes. XML markup version of Shakspeare’s play with 8MB with 320,000 nodes. Pentium-II 333Mhz with 128MB under Windows NT4.0 with IBM DCAS disk. Pentium-II 333Mhz with 128MB under Windows NT4.0 with IBM DCAS disk. The implementation of the record and tree storage managers was done in C++. The implementation of the record and tree storage managers was done in C++.

Test Conditions Record:Node 1:1 indicating smart record splitting being inhibited. Record:Node 1:1 indicating smart record splitting being inhibited. Record:Node 1:n indicating that the algorithm has full control over distribution of nodes on records. Record:Node 1:n indicating that the algorithm has full control over distribution of nodes on records. Incremental updates distributed over the whole document. Incremental updates distributed over the whole document. Updates in pre-order (append). Updates in pre-order (append).

Insertion

Full tree traversal

Queries Retrieve all speakers in the third act and second scene of every play, which means it accesses all leaf nodes of a certain type in one selected subtree of the document. Retrieve all speakers in the third act and second scene of every play, which means it accesses all leaf nodes of a certain type in one selected subtree of the document. Recreate the textual representation of the complete first speech in every scene, hence reading a lot of small contiguous fragments of each document. Recreate the textual representation of the complete first speech in every scene, hence reading a lot of small contiguous fragments of each document. A simple path query was evaluated by reading only the opening speech of each play. A simple path query was evaluated by reading only the opening speech of each play.

Selection on leaf nodes of document subtree

Small contiguous fragments

Single path for each document

Space requirements

Monet Model XML document is decomposed into binary relations. XML document is decomposed into binary relations. Efficient for storage and retrieval of XML documents in a relational database. Efficient for storage and retrieval of XML documents in a relational database. The database used is their Monet database server which supports the Monet model. The database used is their Monet database server which supports the Monet model.

Some Definitions An XML document is a rooted tree d = (V, E, r, label E, label A, rank) with nodes V and edges E  V  V and a distinguished node r  V. An XML document is a rooted tree d = (V, E, r, label E, label A, rank) with nodes V and edges E  V  V and a distinguished node r  V. The function label E : V  string assigns labels to nodes The function label E : V  string assigns labels to nodes label A : V  string  string assigns pairs of strings, attributes and their values, to nodes. label A : V  string  string assigns pairs of strings, attributes and their values, to nodes. rank : V  int establishes a ranking to allow for an order among nodes with the same parent node. rank : V  int establishes a ranking to allow for an order among nodes with the same parent node.

A sample XML document Ben Bit How to Hack Ed Itor Bob Byte Ken Key Hacking & RSI

Syntax Tree of the Previous XML Document

Monet Transform Given an XML document d, the Monet transform is a quadruple M t (d)=(r,R,A,T) where Given an XML document d, the Monet transform is a quadruple M t (d)=(r,R,A,T) where R is the set of binary relations that contain all associations between nodes; R is the set of binary relations that contain all associations between nodes; A is the set of binary relations that contain all associations between nodes and their attribute values, including character data; A is the set of binary relations that contain all associations between nodes and their attribute values, including character data; T is set of binary relations that contain all pairs of nodes and their rank; T is set of binary relations that contain all pairs of nodes and their rank; r is the root of the document; r is the root of the document;

Monet Transform of the Example Document

OQL-like query

Query Handling

Assessment Implemented within the Monet database server Implemented within the Monet database server Tested on 550 MHz Silicon Graphics 1400 Server with 1 GB main memory. Tested on 550 MHz Silicon Graphics 1400 Server with 1 GB main memory. Also used Sun UltraSparc-IIi with 360 MHz and 256 MB main memory to contrast with a related work. Also used Sun UltraSparc-IIi with 360 MHz and 256 MB main memory to contrast with a related work.

Size of document collections in XML and Monet XML format

Scaling of Document Scaled the ACM Anthology from 30 to 3x106 which corresponds to XML source size between 10KB and 1GB. Run 4 queries consisting of path expressions of length 1 through 4 for various sizes of the anthology.

Response Time vs. Result Size

Comparison of response time for query set of SYU, another method for storage/retrieval of XML document.

Compare/Contrast Natix and Monet Natix uses custom database while Monet is built on top of relational database Natix uses custom database while Monet is built on top of relational database Neither uses DTD. Neither uses DTD. Natix focuses on XML query as well as update. Natix focuses on XML query as well as update. Monet focuses on XML storage and query. Monet focuses on XML storage and query. Though lacking equivalent test, Monet is faster than Natix on query. Though lacking equivalent test, Monet is faster than Natix on query. Monet seems to be more space efficient than Natix as well. Monet seems to be more space efficient than Natix as well.

References “Efficient storage of XML data” By Carl-Christian Kanne, et al. ICDE “Efficient storage of XML data” By Carl-Christian Kanne, et al. ICDE “Efficient Relational Storage and Retrieval of XML Documents” By Albrecht Schmidt, et al. WebDB m.html “Efficient Relational Storage and Retrieval of XML Documents” By Albrecht Schmidt, et al. WebDB m.html m.html m.html