Presentation is loading. Please wait.

Presentation is loading. Please wait.

XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005.

Similar presentations


Presentation on theme: "XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005."— Presentation transcript:

1 XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005

2 2 Administrivia  We’re moving beyond simple databases now…  For Monday – read & compare focus of:  Hanson: Scalable Trigger Processing  Stanford STREAM processor  For Wednesday:  Retrospective on Aurora

3 3 Today’s Trivia Question

4 4 XML: What Makes It Hard?  It’s not normalized…  It conceptually centers around some origin, meaning that navigation becomes central  Contrast with E-R diagrams  How to store the hierarchy?  Complex navigation  Updates, locking  Optimization  Also, it’s ordered  May restrict order of evaluation (or at least presentation)  Makes updates more complex  Many of these issues aren’t unique to XML  Semistructured databases, esp. with ordered collections, were similar  But our efforts in that area basically failed…

5 5 XML: What’s It Good For?  Collections of text documents, e.g., the Web, doc DBs  … How would we want to query those?  IR/text queries, path queries, XQueries?  Interchanging data  SOAP messages, RSS, XML streams  Perhaps subsets of data from RDBMSs  Storing native, database-like XML data  Caching  Logging of XML messages  …?

6 6 Lots of XML Research Out There  Text:  Hybrids of database and IR techniques for search  (e.g., Amer-Yahia & Shanmugasundaram, Weikum & Ramakrishnan, …)  Interchange:  Web service verification  XML stream processing  XML databases:  Natix, TIMBER, …  Tamino, DB2 UDB, Oracle, …

7 7 The Main Focal Points  XML with documents  Inverted indices  Integration of ranking into DBMS  Interaction between structure and content  “Streaming XML”  RDBMS  XML export  Partitioning of computation between source and mediator  “Streaming XPath” engines  XML databases  Hierarchical storage + locking (Natix, TIMBER, BerkeleyDB, Tamino, …)  Query optimization

8 8 Text-Based XML  The fundamental questions: 1.How should we model ranking in query processing?  Simply as another value (e.g., Amer-Yahia & Shanmugasundaram)  Using a probabilistic model or as an undefined metric  e.g., Weikum and Ramakrishnan work-in-progress 2.How does structure affect ranking?  PageRank-style (e.g., Shanmugasundaram et al.)  Query relaxation (FleXPath)  Other? 3.How do we achieve efficient pruning?  A* search [Cohen 98]  Fagin’s Threshold Algorithm  Custom logic? 4.How do we integrate keyword indexing with structural indexing?  Multiple indices (e.g., Lore, Natix, …)  Integrated indices (e.g., ViST)

9 9 XML as a Wire Format  RDBMS  XML export  SilkRoute and Xperanto, outer unions  Interaction with RDBMS optimization techniques  Updates [Tatarinov+01]  Cascading updates are already possible in RDBMSs  Updating XML views  Streaming XML  SAX-based XPath-matching engines [Ives+01][Altinel&Franklin00][Green+02] [Diao&Franklin][Chen+] …  Push-down of XPath matching as early as possible  Query decomposition (still in need of a standard means of pushing XQuery to a source)  Subsets of XQuery that are amenable to streaming

10 10 XML in a Database  Use a legacy RDBMS  Shredding [Shanmugasundaram+99] and many others  Path-based encodings [Cooper+01]  Region-based encodings [Bruno+02][Chen+04]  Order preservation in updates [Tatarinov+02], …  What’s novel here? How does this relate to materialized views and warehousing?  Native XML databases  Hierarchical storage (Natix, TIMBER, BerkeleyDB, Tamino, …)  Updates and locking  Query optimization (e.g., that on Galax)

11 11 Query Processing for XML  Why is optimization harder?  Hierarchy means many more joins (conceptually)  “traverse”, “tree-match”, “x-scan”, “unnest”, “path”, … op  Though typically parent-child relationships  Often don’t have good measure of “fan-out”  More ways of optimizing this  Order preservation limits processing in many ways  Nested content ~ left outer join  Except that we need to cluster a collection with the parent  Relationship with NF 2 approach  Tags (don’t really add much complexity except in trying to encode efficiently)  Complex functions and recursion  Few real DB systems implement these fully  Why is storage harder?  That’s the focus of Natix, really

12 12 The Natix System  In contrast to many pieces of work on XML, focuses on the bottom layers, equivalent to System R’s RSS  Physical layout  Indexing  Locking/concurrency control  Logging/recovery

13 13 Physical Layout  What are our options in storing XML trees?  At some level, it’s all smoke-and-mirrors  Need to map to “flat” byte sequences on disk  But several options:  Shred completely, as in many RDBMS mappings  Each path may get its own contiguous set of pages  e.g., vectorized XML [Buneman et al.]  An element may get its 1:1 children  e.g., shared inlining [Shanmugasundaram+] and [Chen+]  All content may be in one table  e.g., [Florescu/Kossmann] and most interval encoded XML  We may embed a few items on the same page and “overflow” the rest  How collections are often stored in ORDBMS  We may try to cluster XML trees on the same page, as “interpreted BLOBs”  This is Natix’s approach (and also IBM’s DB2)  Pros and cons of these approaches?

14 14 Challenges of the Page-per-Tree Approach  How big of a tree?  What happens if the XML overflows the tree?  Natix claims an adaptive approach to choosing the tree’s granularity  Primarily based on balancing the tree, constraints on children that must appear with a parent  What other possibilities make sense?  Natix uses a B+ Tree-like scheme for achieving balance and splitting a tree across pages

15 15 Example Split point in parent page Note “proxy” nodes

16 16 That Was Simple – But What about Updates?  Clearly, insertions and deletions can affect things  Deletion may ultimately require us to rebalance  Ditto with insertion  But insertion also may make us run out of space – what to do?  Their approach: add another page; ultimately may need to split at multiple levels, as in B+ Tree  Others have studied this problem and used integer encoding schemes (plus B+ Trees) for the order

17 17 Does this Help?  According to general lore, yes  The Natix experiments in this paper were limited in their query and adaptivity loads  But the IBM guys say their approach, which is similar, works significantly better than Oracle’s shredded approach

18 18 There’s More to Updates than the Pages  What about concurrency control and recovery?  We already have a notion of hierarchical locks, but they claim:  If we want to support IDREF traversal, and indexing directly to nodes, we need more  What’s the idea behind SPP locking?

19 19 Logging  They claim ARIES needs some modifications – why?  Their changes:  Need to make subtree updates more efficient – don’t want to write a log entry for each subtree insertion  Use (a copy of) the page itself as a means of tracking what was inserted, then batch-apply to WAL  “Annihilators”: if we undo a tree creation, then we probably don’t need to worry about undoing later changes to that tree  A few minor tweaks to minimize undo/redo when only one transaction touches a page

20 20 Annihilators

21 21 Assessment  Native XML storage isn’t really all that different from other means of storage  There are probably some good reasons to make a few tweaks in locking  Optimization stays harder  A real solution to materialized view creation would probably make RDBMSs come close to delivering the same performance, modulo locking

22 22 Questions  Where are the main challenges of XML processing at this point?  Impact of BinaryXML?  Are we working on the right problems? What’s XML going to be used for, anyway?


Download ppt "XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005."

Similar presentations


Ads by Google