1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

XML: Extensible Markup Language
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Incremental Maintenance of XML Structural Indexes Ke Yi 1, Hao He 1, Ioana Stanoi 2 and Jun Yang 1 1 Department of Computer Science, Duke University 2.
By Daniela Floresu Donald Kossmann
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Aki Hecht Seminar in Databases (236826) January 2009
From Semistructured Data to XML: Migrating The Lore Data Model and Query Language Roy Goldman, Jason McHugh, Jennifer Widom Stanford University
CS Lecture 9 Storeing and Querying Large Web Graphs.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Database Systems and XML David Wu CS 632 April 23, 2001.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
IST Databases and DBMSs Todd S. Bacastow January 2005.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
XML과 Database 홍기형 성신여자대학교 성신여자대학교 홍기형.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
Lecture A/18-849B/95-811A/19-729A Internet-Scale Sensor Systems: Design and Policy Lecture 24 – Part 2 XML Query Processing Phil Gibbons April.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
XML and Database.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Session 1 Module 1: Introduction to Data Integrity
ICOM 5016 – Introduction to Database Systems Lecture 13- File Structures Dr. Bienvenido Vélez Electrical and Computer Engineering Department Slides by.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Databases and DBMSs Todd S. Bacastow January
Module 11: File Structure
Indexing Structures for Files and Physical Database Design
CHP - 9 File Structures.
Database Management System
XML Query Processing Yaw-Huei Chen
XML indexing – A(k) indices
ICOM 5016 – Introduction to Database Systems
Indexing 4/11/2019.
Query Optimization.
Presentation transcript:

1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

2 XML Storage Alternatives  Plain Text  Trees with Navigation  Tuples (i.e., mapping to RDBMS)

3 Plain Text  Use XML standards to encode data  Advantages: simple, universal indexing possible  Disadvantages: need to re-parse (re-validate) all the time no compliance with XQuery data model (collections) not an option for XQuery processing

4 Trees  XML data model uses tree semantics use Trees/Forests to represent XML instances annotate nodes of tree with data model info  Examples: Document Object Model (DOM) Object Exchange Model (OEM) f1 f4 f8f7 f5f6f3f2

5 DataGuides [ Goldman & Widom 97 ]  Schema-based environments Schema Data generates Queries formulates execute against

6 DataGuides [ Goldman & Widom 97 ]  Schema-free environments: don't know the schema in advance. semantic heterogeneity (i.e. a mix of schemas) DataGuides Summarized into Queries formulate App-specific Templates Data generate execute against

7 Schema vs. DataGuides  A DataGuide only includes info that exists in a DB.  A schema can be a superset of any DB that conforms to it.  So, a schema defines a superset of a DataGuide.  Issues addressed in the paper: Summarize data into DataGuides; Use them for query formulation and optimization.

8 Object Exchange Model (OEM)  Object Exchange Model (OEM) Each object has an id (oid) and a value (atomic or a set of subobjects). Each edge links an object to one of its subobjects with a label; a subobject may have multiple parents.  Label path : a seq. of labels  Data path : an alternating seq. of labels and oids  Target set : a set of all objects reached by traversing a label path

9 Definition of a DataGuide  Conciseness : a DataGuide describes every unique label path of a source exactly once  Accuracy : a DataGuide does not encode any label path that does not appear in the source  Convenience : represented as an OEM model, like the data  A DataGuide reflects the structure of a DB; it contains no atomic values.

10 From Data to DataGuides  Creating a DataGuide is equivalent to converting an NFA to DFA! Consider a label path (query) as a string to be accepted by the data source and the DataGuide. Intuition: The data source has multiple matches, so execution is non-deterministic. But the DataGuide has only one path, so execution is deterministic.  Cost of creation Source DB is a tree: linear Worst-case: exponential in #. of objects and edges in the source Empirical results: average performance for certain datasets is quite encouraging

11 Multiple DataGuides  An OEM source may have multiple DataGuides A single NFA may have many equivalent DFAs.  Minimal DataGuide Can be created using DFA minimization  Minimality may not always be desirable Hard to maintain as the data source changes--well known problem with DFA. Does not allow annotations. 22 ??

12 Annotations  Annotation: a property of the target set of a label path l in the data source s Statistical information: e.g. # occurrences of l in s Pointers to objects reachable via l …  Issue with minimality Annotation for A.C Annotation for B.C ?

13 Strong DataGuides  Each set of label paths that share a node in the DataGuide is the set of label paths that share the same target set in the source. Label paths can be merged in the DataGuide if they lead to the same target set.  There is one-to-one correspondence between source target sets and DataGuide objects.  Creation from the data source A DFS algorithm that examines source target sets reachable by al possible label paths…  Maintenance uses a similar set of data structures…

14 Query Formulation & Optimization  Query formulation Query by example: click buttons to select a path and add value filters Blurs the distinction between formulating a query and browsing a query result  Query optimization Uses the DataGuide for structural matching (e.g. A.B.C) and retrieves the target set Uses value indexes (e.g. B+trees) for value filters for a specific label (e.g. C.price>100) Intersects the two resulting sets of objects

15 XML Data Stored as Tuples  Motivation: Use an RDBMS infrastructure to store and process the XML data query optimization scalability richness and maturity of RDBMS  Alternative relational storage approaches: Map XML schema to relational schema Generic shredding of the data (edge, binary, …) New XML storage integrated tightly with the relational processor

16 Relational Support for XML [Zhang et al. 2001]  Goal: relational support for path queries, including storage and query processing  Assumption: we have the DTD/schema  Problem addressed: to support XML path queries Can we use a relational DBMS? Shall we design a native XML store, i.e. using novel storage and indexing techniques?

17 Representation of XML  Each XML document is parsed to a seq. of items: Start tag Text word End tag  All items are numbered, from XML XML 9 Processing XML 14 Processing 15 Cost Scalability

18 Element Index  An Element Index (E-index) records occurrences of each element name inside the entire collection of documents.  Each index entry in an E-index corresponds to one occurrence of the element name. It has: document identifier, start position of the element in the doc, i.e. position of its start tag. end position of the element in the doc, i.e. position of its end tag document level of the element in the doc, i.e. level from the root.  An E-index is sorted in increasing order of.

19 Example of E-Index XML XML 9 Processing XML 14 Processing 15 Cost Scalability (1, 1:25, 1) (2, … (1, 2:24, 2) (1, 6:18, 3) (2, … (1, 3:5, 3) (1, 7:10, 4) (1, 12:16, 5) (1, 20:22, 4) (2, … (1, 11:17, 4) (1, 19:23, 3) (2, …

20 Text Index  A Text Index (T-index) records the occurrences of each text word inside the entire collection of documents, similar to E-Index.  Difference is that each index entry in a T-index contains a single word position, instead of the pair of start and end positions.  Similarly, a T-index is sorted in increasing of.

21 Example of T-Index XML XML 9 Processing XML 14 Processing 15 Cost Scalability (1, 4:4, 4) (1, 8:8, 5) (1, 13:13, 6) (2, …“XML” “Processing”(1, 9:9, 5) (1, 14:14, 6) (2, … “Cost”(1, 15:15, 6) (2, … “Scalability”(1, 21:21, 5) (2, …

22 Relational Storage (a) Element-Index …… … …… … …… … doc_idstart_posend_pos doc_level …… … term (b) Text-Index 2…… doc_idword_pos doc_level …… …… …… term “XML” “Processing” “Cost” “Scalability”

23 Relational Storage (contd.)  One relation for elements, one for text words  Clustered B+trees over each table On (term, docno) On all columns: lead to index-only plans

24 “//section//title” Index Scan on Index Scan on  (//) l.doc_id = r.doc_id and l.start_pos r.end_pos

25 Questions

26 Outline  Storage and Query Processing DataGuides [Goldman and Widom 97] Relational Approach [Zhang et al. 2001]  Other Research Topics Query Rewriting Benchmarking …

27 Node Identifiers  XQuery Data Model Requirements identify a node uniquely (implementing identity) lives as long as node lives robust to updates  Identifiers might include additional information Schema/type information Document order Parent/child relationship Ancestor/descendent relationship Document information  Required for indexes

28 Simple Node Identifiers  Examples: Alternative 1 (data: trees) id of document (integer) pre-order number of node in document (integer) Alternative 2 (data: plain text) file name offset in file  Encode document ordering (Alternative 1) identity: doc1 = doc2 AND pre1 = pre2 order: doc1 < doc2 OR (doc1 = doc2 AND pre1 < pre2)  Assessment: bad: Not robust to updates bad: Not able to answer more complex queries

29 Dewey Order  Idea: Generate surrogates for each path identifies the third child of the second child of the first child of the given root  Assessment: good: order comparison, ancestor/descendent easy bad: updates expensive, space overhead  Improvement: ORDPath Bit Encoding O‘Neil et al (Microsoft SQL Server)

30 Example: Dewey Order name child person hobby

31 XML Storage Alternatives  Plain Text  Trees with Random Access  Tuples (i.e., mapping to RDBMS)

32 Plain Text  Use XML standards to encode data  Advantages: simple, universal indexing possible  Disadvantages: need to re-parse (re-validate) all the time no compliance with XQuery data model (collections) not an option for XQuery processing

33 Trees  XML data model uses tree semantics use Trees/Forests to represent XML instances annotate nodes of tree with data model info  Example f1 f4 f8f7 f5f6f3f2

34 Trees  Advantages natural representation of XML data good support for navigation, updates index built into the data structure compliance with DOM standard interface  Disadvantages difficult to partition high overhead: mixes indexes and data index everything  Example: Document Object Model (DOM)

35 Edge Approach (Florescu & Kossmann 99) SourceLabelTarget 0person4711 0person namev1 4711childi namev2 666childi314 IdValue v1Lilly Potter v2James Potter v3Harry Potter IdValue v412 Edge TableValue Table (String) Value Table (Integer)

36 XML Example Lilly Potter Harry Potter 12 James Potter

37 person Harry Potter name person Lilly Potter James Potter child i314 Lilly Potter Harry Potter 12 James Potter age 12

38 Kinds of Indexes 1. Value Indexes index atomic values; e.g., //emp/salary/fn:data(.) use B+ trees (like in relational world) (integration into query optimizer more tricky) 2. Structure Indexes materialize results of path expressions (pendant to Rel. join indexes, OO path indices) 3. Full text indexes Keyword search, inverted files (IR world, text extenders)  Any combination of the above

39 Outline  XML Storage  XML Indexing  Query Processing  Other Research Topics Query Rewriting Benchmarking …

40 What is a Correct Rewriting  E1 -> E2 is a legal rewriting iff Type(E2) is a subtype of Type(E1) FreeVar(E2) is a subset of FreeVar(E1) For a binding of free variables, either E1 or E2 return ERROR (possibly different errors) Or E1 and E2 return the same result  This definition allows the rewrite E1->ERROR Trust your vendor she does not do that for all E1!

41 Handling Backwards Navigation  Replace backwards navigation with forward navigation for $x in $input/a/b for $y in $input/a, return {$x/.., $x/d} $x in $y/b return {$y, $x/d} for $x in $input/a/b return {$x//e/..} ??  Enables streaming

42 FLWR Unnesting  Traditional database technique for $x in $input/a/b for $x in $input/a/b, where $x/c eq 3 $y in $x/d return (for $y in $x/d where ($x/e eq 4) and ($x/c eq 3) where $x/e eq 4 return $y return $y)  Problem simpler than in OQL/ODMG No nested collections in XML

43 XML Query Processing  Techniques vary a lot, depending on Storage model Indexes available Algebra used …  A large body of ongoing work Research community: McHugh and Widom 1999, Zhang et al. 2001, Bruno et al. 2002, Ghua et al. 2002, Chen et al. 2003, Paparizos et al. 2004, Jagadish 2004, … (just look at SIGMOD and VLDB proceedings in recent years!) Industry: IBM DB2, Oracle, SQL Server, …

44 XML Processing Benchmark  We cannot really compare approaches until we decide on a comparison basis  XML processing very broad  Industry not mature enough  Usage patterns not clear enough  Existing XML benchmarks (Xmark, etc. ) limited  Strong need for a TP benchmark