XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 25, 2008.

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

Chapter 10: Designing Databases
Native XML Database or RDBMS. Data or Document orientation If you are primarily storing documents, then a Native XML Database may be the best option.
Technical University of Kaiserslautern Lehrgebiet Informationssysteme Muhammad Mainul Hossain Architectural Approaches of XDBMS Realization.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Self-Tuning and Self-Configuring Systems Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 16, 2005.
Xyleme A Dynamic Warehouse for XML Data of the Web.
B+-tree and Hashing.
Database management concepts Database Management Systems (DBMS) An example of a database (relational) Database schema (e.g. relational) Data independence.
Efficient XML Storage, Query, and Update Shi Xu Heng Yuan Spring 2004 CS240B Prof. Zaniolo.
Physical Database Monitoring and Tuning the Operational System.
XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
Chapter 4: Transaction Management
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Natix Done by Asmaa Hassanain CSC 5370 Dr. Hachim Haddoutti 12/8/2003.
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
Intro to MIS – MGS351 Databases and Data Warehouses Chapter 3.
Anatomy of a Native XML Base Management System By Yaojun Wu.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.
CSC271 Database Systems Lecture # 30.
Lecture 7 of Advanced Databases XML Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Lecture 6 of Advanced Databases XML Querying & Transformation Instructor: Mr.Eyad Almassri.
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint.
Sofia, Bulgaria | 9-10 October Using XQuery to Query and Manipulate XML Data Stephen Forte CTO, Corzen Inc Microsoft Regional Director NY/NJ (USA) Stephen.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.
Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December.
Chapter 16 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.
Sensor Data Management and XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 19, 2008.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
XML and Database.
1 Final Review Tuesday, March 6, The Final Date: Tuesday, March 13, 2007 Time: 6:30 - 8:30 Room: EE 037 You must come to campus Open book exam.
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
Chapter 5 Index and Clustering
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Querying XML, Part II Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems February 5, 2008.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
Practical Database Design and Tuning
Module 11: File Structure
CS522 Advanced database Systems
Tree-Structured Indexes
Semi-Structured Data and Agile Application Development
OrientX: an Integrated, Schema-Based Native XML Database System
Database management concepts
Practical Database Design and Tuning
Lecture 19: Data Storage and Indexes
Data Model.
Database management concepts
Indexing 4/11/2019.
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
Presentation transcript:

XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 25, 2008

Administrivia  For next time, please read & review the TurboXPath paper 2

3 XML: A Format of Many Uses  XML has become the standard for data interchange, and for many document representations  Sometimes we’d like to store it…  Collections of text documents, e.g., the Web, doc DBs  … How would we want to query those?  IR/text queries, path queries, XQueries?  Interchanging data  SOAP messages, RSS, XML streams  Perhaps subsets of data from RDBMSs  Storing native, database-like XML data  Caching  Logging of XML messages

4 XML: Hierarchical Data and Its Challenges  It’s not normalized…  It conceptually centers around some origin, meaning that navigation becomes central to querying and visualizing  Contrast with E-R diagrams  How to store the hierarchy?  Complex navigation may include going up, sideways in tree  Updates, locking  Optimization  Also, it’s ordered  May restrict order of evaluation (or at least presentation)  Makes updates more complex  Many of these issues aren’t unique to XML  Semistructured databases, esp. with ordered collections, were similar  But our efforts in that area basically failed…

5 Two Ways of Thinking of XML Processing  XML databases (today)  Hierarchical storage + locking (Natix, TIMBER, BerkeleyDB, Tamino, …)  Query optimization  “Streaming XML” (next time)  RDBMS  XML export  Partitioning of computation between source and mediator  “Streaming XPath” engines  The difference is in storage (or lack thereof)

6 XML in a Database  Use a legacy RDBMS  Shredding [Shanmugasundaram+99] and many others  Path-based encodings [Cooper+01]  Region-based encodings [Bruno+02][Chen+04]  Order preservation in updates [Tatarinov+02], …  What’s novel here? How does this relate to materialized views and warehousing?  Native XML databases  Hierarchical storage (Natix, TIMBER, BerkeleyDB, Tamino, …)  Updates and locking  Query optimization (e.g., that on Galax)

7 Query Processing for XML  Why is optimization harder?  Hierarchy means many more joins (conceptually)  “traverse”, “tree-match”, “x-scan”, “unnest”, “path”, … op  Though typically parent-child relationships  Often don’t have good measure of “fan-out”  More ways of optimizing this  Order preservation limits processing in many ways  Nested content ~ left outer join  Except that we need to cluster a collection with the parent  Relationship with NF 2 approach  Tags (don’t really add much complexity except in trying to encode efficiently)  Complex functions and recursion  Few real DB systems implement these fully  Why is storage harder?  That’s the focus of Natix, really

8 The Natix System  In contrast to many pieces of work on XML, focuses on the bottom layers, equivalent to System R’s RSS  Physical layout  Indexing  Locking/concurrency control  Logging/recovery

9 Physical Layout  What are our options in storing XML trees?  At some level, it’s all smoke-and-mirrors  Need to map to “flat” byte sequences on disk  But several options:  Shred completely, as in many RDBMS mappings  Each path may get its own contiguous set of pages  e.g., vectorized XML [Buneman et al.]  An element may get its 1:1 children  e.g., shared inlining [Shanmugasundaram+] and [Chen+]  All content may be in one table  e.g., [Florescu/Kossmann] and most interval encoded XML  We may embed a few items on the same page and “overflow” the rest  How collections are often stored in ORDBMS  We may try to cluster XML trees on the same page, as “interpreted BLOBs”  This is Natix’s approach (and also IBM’s DB2)  Pros and cons of these approaches?

10 Challenges of the Page-per-Tree Approach  How big of a tree?  What happens if the XML overflows the tree?  Natix claims an adaptive approach to choosing the tree’s granularity  Primarily based on balancing the tree, constraints on children that must appear with a parent  What other possibilities make sense?  Natix uses a B+ Tree-like scheme for achieving balance and splitting a tree across pages

11 Example Split point in parent page Note “proxy” nodes

12 That Was Simple – But What about Updates?  Clearly, insertions and deletions can affect things  Deletion may ultimately require us to rebalance  Ditto with insertion  But insertion also may make us run out of space – what to do?  Their approach: add another page; ultimately may need to split at multiple levels, as in B+ Tree  Others have studied this problem and used integer encoding schemes (plus B+ Trees) for the order

13 Does this Help?  According to general lore, yes  The Natix experiments in this paper were limited in their query and adaptivity loads  But the IBM people say their approach, which is similar, works significantly better than Oracle’s shredded approach

14 There’s More to Updates than the Pages  What about concurrency control and recovery?  We already have a notion of hierarchical locks, but they claim:  If we want to support IDREF traversal, and indexing directly to nodes, we need more  What’s the idea behind SPP locking?

15 Logging  They claim ARIES needs some modifications – why?  Their changes:  Need to make subtree updates more efficient – don’t want to write a log entry for each subtree insertion  Use (a copy of) the page itself as a means of tracking what was inserted, then batch-apply to WAL  “Annihilators”: if we undo a tree creation, then we probably don’t need to worry about undoing later changes to that tree  A few minor tweaks to minimize undo/redo when only one transaction touches a page

16 Annihilators

17 Assessment  Native XML storage isn’t really all that different from other means of storage  There are probably some good reasons to make a few tweaks in locking  Optimization remains harder  A real solution to materialized view creation would probably make RDBMSs come close to delivering the same performance, modulo locking

Next Time: “Streaming XML”  An XQuery consists of a series of XPath expressions in the FOR/LET clauses, plus a WHERE condition and a RETURN constructor  The FOR/LET clauses create bindings between variables and nodes (or node sets)  We can consider a set of bindings to be a tuple  So: can we build an XPath matcher that processes XML across the network, and produces tuple streams? 18