XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.

Slides:



Advertisements
Similar presentations
Boxwood: Distributed Data Structures as Storage Infrastructure Lidong Zhou Microsoft Research Silicon Valley Team Members: Chandu Thekkath, Marc Najork,
Advertisements

Boxwood: Abstractions as the Foundation for Storage Infrastructure Lidong Zhou, Microsoft Research Silicon Valley Joint work with Chandu Thekkath, John.
Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
A View Based Security Framework for XML Wenfei Fan, Irini Fundulaki, Floris Geerts, Xibei Jia, Anastasios Kementsietsidis University of Edinburgh Digital.
XML Data Management 8. XQuery Werner Nutt. Requirements for an XML Query Language David Maier, W3C XML Query Requirements: Closedness: output must be.
Querying on the Web: XQuery, RDQL, SparQL Semantic Web - Spring 2006 Computer Engineering Department Sharif University of Technology.
XML: Extensible Markup Language
Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Min LuTIMBER: A Native XML DB1 TIMBER: A Native XML Database Author: H.V. Jagadish, etc. Presenter: Min Lu Date: Apr 5, 2005.
Spark: Cluster Computing with Working Sets
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
QUANZHONG LI BONGKI MOON Indexing & Querying XML Data for../Regular Path Expressions/* SUNDAR SUPRIYA.
Chapter 11: File System Implementation
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
1 COS 425: Database and Information Management Systems XML and information exchange.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Database Systems and XML David Wu CS 632 April 23, 2001.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Natix Done by Asmaa Hassanain CSC 5370 Dr. Hachim Haddoutti 12/8/2003.
...Looking back Why use a DBMS? How to design a database? How to query a database? How does a DBMS work?
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
TECHNIQUES FOR OPTIMIZING THE QUERY PERFORMANCE OF DISTRIBUTED XML DATABASE - NAHID NEGAR.
Module 17 Storing XML Data in SQL Server® 2008 R2.
10/14/2001 Coping with Semantics in XML Document Management Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
XML in SQL Server Overview XML is a key part of any modern data environment It can be used to transmit data in a platform, application neutral form.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
TDDD43 XML and RDF Slides based on slides by Lena Strömbäck and Fang Wei-Kleiner 1.
Chapter pages1 File Management Chapter 12.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation XML Storage Techniques.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
Sofia, Bulgaria | 9-10 October Using XQuery to Query and Manipulate XML Data Stephen Forte CTO, Corzen Inc Microsoft Regional Director NY/NJ (USA) Stephen.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
Optimization in XSLT and XQuery Michael Kay. 2 Challenges XSLT/XQuery are high-level declarative languages: performance depends on good optimization Performance.
The european ITM Task Force data structure F. Imbeaux.
Database Systems Part VII: XML Querying Software School of Hunan University
ROOT I/O for SQL databases Sergey Linev, GSI, Germany.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
XML and Database.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
File Systems. 2 What is a file? A repository for data Is long lasting (until explicitly deleted).
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
CS 540 Database Management Systems
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Scheduling of Transactions on XML Documents Author: Stijin Dekeyser Jan Hidders Reviewed by Jason Chen, Glenn, Steven, Christian.
11 Copyright © 2004, Oracle. All rights reserved. Managing XML Data in an Oracle 10g Database.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
CS 540 Database Management Systems
CS522 Advanced database Systems
XML and Databases.
Introduction to Query Optimization
OrientX: an Integrated, Schema-Based Native XML Database System
Toshiyuki Shimizu (Kyoto University)
2/18/2019.
Presentation transcript:

XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04

2 Overview Goal: Make XML a native data structure of the Boxwood storage virtualization system Motivation:  Useful for users of Boxwood because XML is gaining popularity as a way of storing and exchanging data  Validates the argument that the modular Boxwood design facilitates construction of complex data structures  Validates the design and implementation of the Boxwood prototype

3 Boxwood overview

4 XML example Beef Parmesan with Garlic Angel Hair Pasta Preheat oven to 350 degrees F (175 degrees C). elements attributes text fragment All called nodes in the XML data model

5 XML as a tree collection recipe titleingredientpreparationnutrition “Beef …”step “Preheat …”

6 Challenges in building a scalable XML store Must support flexible access methods  XPath: file system path like queries  XQuery: more complex SQL like queries Large amount of documents Large documents Document mutation (e.g. node insertions/removals)

7 Outline Overview XML to Boxwood mapping XPath query processing Evaluation

8 Numbering Scheme XPath requires fast access to related nodes: parent/children/sibling Could be done by maintaining pointers to these nodes but too much space overhead  At least 4 pointers needed (parent, first child, next sibling, previous sibling) XML db community proposed schemes to number nodes in regular ways that make these pointers unnecessary. We use the numbering scheme used in the eXist XML database.

9 Numbering nodes in the XML tree collection recipe titleingredientpreparationnutrition “Beef …”step “Preheat …” Max fan-out

10 Mapping to Boxwood Basic Boxwood data structures  Chunks (variable-sized, “persistent malloc()”)  B-Tree (fixed-sized keys  fixed-sized values)

11 XML store data structures “recipe” “doc2” … Structure Index node id  … “Beef…”, name=… Doc Directory … Element Index elem_name  * Value Index (one per user- defined index) value  * Data Chunks Serialized XML nodes B-Tree Chunk ingredient

12 Outline Overview XML to Boxwood mapping XPath query processing Performance and experience

13 XPath query processing XPath expressions  Path expressions:  Comparison expressions:  Arithmetic expressions  …

14 Evaluating path expressions Uses recently proposed XML query processing techniques (Li&Moon, VLDB’01) = No query optimization yet chapter book child descendant joins selections

15 Join algorithms //book/chapter Nested loop: 1. Find “book” elements by enumerating all elements 2. For each “book” element, enumerate all children and output the ones named “chapter” “Path join”: efficient join operations utilizing indexes 1. A={IDs of all elements named “book”}, obtained through the element index 2. B={IDs of all elements named “chapter”} child axis join: 3. Result= “child” join is one of the 13 different join operations chapter book child

16 Performance Results Single machine  Inserting an XML doc containing 47K nodes (SIGMOD conference records) into the store: 75 sec  Read the whole doc: 0.2 sec  A simple XPath (selecting all papers of a certain author): 1.2 sec (whole doc in cache) Scalability experiment  # of machines: 1-4 machines  Local cache size: 10K nodes  Document size: 10K / 40K XML nodes  Issue XPath queries from one machine, or from all machines in parallel

17 Scalability Experiment Results

18 Scalability Experiment Results

19 Issues using Boxwood Bulk-loading a B-tree is slow  Could support a “bulk-loading” mode where coarser- grained locking & recovery is used B-tree only supports fix-sized keys and values  The index implements a dictionary with variable-size keys and values on top of B-tree  Could be generalized Various performance issues  e.g. a flag set when opening the device file causing all i/o to be serialized

20 Boxwood benefits Chunk and B-tree abstracts data layout and distribution Data consistency easier  B-Trees already does proper locking  Others manually call global lock server Recovery support easier  B-Tree operations already transactional  Provide logging and recovery code in other cases Free data caching in B-tree and chunks 7K LOC, about 3K for XPath, reasonably easy to write

21 In this project, we devised a way of mapping XML to existing Boxwood data structures implemented data structures and algorithms to efficiently access and query XML documents demonstrated that Boxwood facilitates building of scalable data structures. made significant improvements to several aspects of Boxwood performance

22 Thank you!

23 Status Non-validating (no DTD or XML Schema support) XML parser XmlReader (.NET framework) interface for sequential access to docs Custom interface for random access Subset of XPath 2.0  Supports: all path expressions axes, predicates, all arithmetic and comparison expressions  Missing: variables, functions, control flow

24 Data vs. Document XML Data XML  e.g. sales orders, stock quotes, scientific data  more structured, fine-grained (smallest unit of data could be a number), order generally not significant Document XML  e.g. books, s, Word documents in XML  less regular, coarse-grained (smallest unit could be a paragraph), lots of mixed content