Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann.

Similar presentations


Presentation on theme: "1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann."— Presentation transcript:

1 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

2 2 XML Storage Alternatives  Plain Text  Trees with Navigation  Tuples (i.e., mapping to RDBMS)

3 3 Plain Text  Use XML standards to encode data  Advantages: simple, universal indexing possible  Disadvantages: need to re-parse (re-validate) all the time no compliance with XQuery data model (collections) not an option for XQuery processing

4 4 Trees  XML data model uses tree semantics use Trees/Forests to represent XML instances annotate nodes of tree with data model info  Examples: Document Object Model (DOM) http://www.w3.org/DOM/ Object Exchange Model (OEM) f1 f4 f8f7 f5f6f3f2

5 5 DataGuides [ Goldman & Widom 97 ]  Schema-based environments Schema Data generates Queries formulates execute against

6 6 DataGuides [ Goldman & Widom 97 ]  Schema-free environments: don't know the schema in advance. semantic heterogeneity (i.e. a mix of schemas) DataGuides Summarized into Queries formulate App-specific Templates Data generate execute against

7 7 Schema vs. DataGuides  A DataGuide only includes info that exists in a DB.  A schema can be a superset of any DB that conforms to it.  So, a schema defines a superset of a DataGuide.  Issues addressed in the paper: Summarize data into DataGuides; Use them for query formulation and optimization.

8 8 Object Exchange Model (OEM)  Object Exchange Model (OEM) Each object has an id (oid) and a value (atomic or a set of subobjects). Each edge links an object to one of its subobjects with a label; a subobject may have multiple parents.  Label path : a seq. of labels  Data path : an alternating seq. of labels and oids  Target set : a set of all objects reached by traversing a label path

9 9 Definition of a DataGuide  Conciseness : a DataGuide describes every unique label path of a source exactly once  Accuracy : a DataGuide does not encode any label path that does not appear in the source  Convenience : represented as an OEM model, like the data  A DataGuide reflects the structure of a DB; it contains no atomic values.

10 10 From Data to DataGuides  Creating a DataGuide is equivalent to converting an NFA to DFA! Consider a label path (query) as a string to be accepted by the data source and the DataGuide. Intuition: The data source has multiple matches, so execution is non-deterministic. But the DataGuide has only one path, so execution is deterministic.  Cost of creation Source DB is a tree: linear Worst-case: exponential in #. of objects and edges in the source Empirical results: average performance for certain datasets is quite encouraging

11 11 Multiple DataGuides  An OEM source may have multiple DataGuides A single NFA may have many equivalent DFAs.  Minimal DataGuide Can be created using DFA minimization  Minimality may not always be desirable Hard to maintain as the data source changes--well known problem with DFA. Does not allow annotations. 22 ??

12 12 Annotations  Annotation: a property of the target set of a label path l in the data source s Statistical information: e.g. # occurrences of l in s Pointers to objects reachable via l …  Issue with minimality Annotation for A.C Annotation for B.C ?

13 13 Strong DataGuides  Each set of label paths that share a node in the DataGuide is the set of label paths that share the same target set in the source. Label paths can be merged in the DataGuide if they lead to the same target set.  There is one-to-one correspondence between source target sets and DataGuide objects.  Creation from the data source A DFS algorithm that examines source target sets reachable by al possible label paths…  Maintenance uses a similar set of data structures…

14 14 Query Formulation & Optimization  Query formulation Query by example: click buttons to select a path and add value filters Blurs the distinction between formulating a query and browsing a query result  Query optimization Uses the DataGuide for structural matching (e.g. A.B.C) and retrieves the target set Uses value indexes (e.g. B+trees) for value filters for a specific label (e.g. C.price>100) Intersects the two resulting sets of objects

15 15 XML Data Stored as Tuples  Motivation: Use an RDBMS infrastructure to store and process the XML data query optimization scalability richness and maturity of RDBMS  Alternative relational storage approaches: Map XML schema to relational schema Generic shredding of the data (edge, binary, …) New XML storage integrated tightly with the relational processor

16 16 Relational Support for XML [Zhang et al. 2001]  Goal: relational support for path queries, including storage and query processing  Assumption: we have the DTD/schema  Problem addressed: to support XML path queries Can we use a relational DBMS? Shall we design a native XML store, i.e. using novel storage and indexing techniques?

17 17 Representation of XML  Each XML document is parsed to a seq. of items: Start tag Text word End tag  All items are numbered, from 1. 1 2 3 4 XML 5 6 7 8 XML 9 Processing 10 11 12 13 XML 14 Processing 15 Cost 16 17 18 19 20 21 Scalability 22 23 24 25

18 18 Element Index  An Element Index (E-index) records occurrences of each element name inside the entire collection of documents.  Each index entry in an E-index corresponds to one occurrence of the element name. It has: document identifier, start position of the element in the doc, i.e. position of its start tag. end position of the element in the doc, i.e. position of its end tag document level of the element in the doc, i.e. level from the root.  An E-index is sorted in increasing order of.

19 19 Example of E-Index 1 2 3 4 XML 5 6 7 8 XML 9 Processing 10 11 12 13 XML 14 Processing 15 Cost 16 17 18 19 20 21 Scalability 22 23 24 25 (1, 1:25, 1) (2, … (1, 2:24, 2) (1, 6:18, 3) (2, … (1, 3:5, 3) (1, 7:10, 4) (1, 12:16, 5) (1, 20:22, 4) (2, … (1, 11:17, 4) (1, 19:23, 3) (2, …

20 20 Text Index  A Text Index (T-index) records the occurrences of each text word inside the entire collection of documents, similar to E-Index.  Difference is that each index entry in a T-index contains a single word position, instead of the pair of start and end positions.  Similarly, a T-index is sorted in increasing of.

21 21 Example of T-Index 1 2 3 4 XML 5 6 7 8 XML 9 Processing 10 11 12 13 XML 14 Processing 15 Cost 16 17 18 19 20 21 Scalability 22 23 24 25 (1, 4:4, 4) (1, 8:8, 5) (1, 13:13, 6) (2, …“XML” “Processing”(1, 9:9, 5) (1, 14:14, 6) (2, … “Cost”(1, 15:15, 6) (2, … “Scalability”(1, 21:21, 5) (2, …

22 22 Relational Storage (a) Element-Index 11923 3 11117 4 2…… … 11216 5 135 3 12022 4 1710 4 2…… … 1618 3 1224 2 2…… … doc_idstart_posend_pos doc_level 1125 1 2…… … term (b) Text-Index 2…… doc_idword_pos doc_level 14 4 18 5 113 6 19 5 114 6 2…… 115 6 2…… 121 5 2…… term “XML” “Processing” “Cost” “Scalability”

23 23 Relational Storage (contd.)  One relation for elements, one for text words  Clustered B+trees over each table On (term, docno) On all columns: lead to index-only plans

24 24 “//section//title” Index Scan on Index Scan on  (//) l.doc_id = r.doc_id and l.start_pos r.end_pos

25 25 Questions

26 26 Outline  Storage and Query Processing DataGuides [Goldman and Widom 97] Relational Approach [Zhang et al. 2001]  Other Research Topics Query Rewriting Benchmarking …

27 27 Node Identifiers  XQuery Data Model Requirements identify a node uniquely (implementing identity) lives as long as node lives robust to updates  Identifiers might include additional information Schema/type information Document order Parent/child relationship Ancestor/descendent relationship Document information  Required for indexes

28 28 Simple Node Identifiers  Examples: Alternative 1 (data: trees) id of document (integer) pre-order number of node in document (integer) Alternative 2 (data: plain text) file name offset in file  Encode document ordering (Alternative 1) identity: doc1 = doc2 AND pre1 = pre2 order: doc1 < doc2 OR (doc1 = doc2 AND pre1 < pre2)  Assessment: bad: Not robust to updates bad: Not able to answer more complex queries

29 29 Dewey Order  Idea: Generate surrogates for each path 1.2.3 identifies the third child of the second child of the first child of the given root  Assessment: good: order comparison, ancestor/descendent easy bad: updates expensive, space overhead  Improvement: ORDPath Bit Encoding O‘Neil et al. 2004 (Microsoft SQL Server)

30 30 Example: Dewey Order name child person hobby 1.11.2 11.2.1 1.2.1.11.2.1.21.2.1.3

31 31 XML Storage Alternatives  Plain Text  Trees with Random Access  Tuples (i.e., mapping to RDBMS)

32 32 Plain Text  Use XML standards to encode data  Advantages: simple, universal indexing possible  Disadvantages: need to re-parse (re-validate) all the time no compliance with XQuery data model (collections) not an option for XQuery processing

33 33 Trees  XML data model uses tree semantics use Trees/Forests to represent XML instances annotate nodes of tree with data model info  Example...... f1 f4 f8f7 f5f6f3f2

34 34 Trees  Advantages natural representation of XML data good support for navigation, updates index built into the data structure compliance with DOM standard interface  Disadvantages difficult to partition high overhead: mixes indexes and data index everything  Example: Document Object Model (DOM) http://www.w3.org/DOM/

35 35 Edge Approach (Florescu & Kossmann 99) SourceLabelTarget 0person4711 0person666 4711namev1 4711childi314 666namev2 666childi314 IdValue v1Lilly Potter v2James Potter v3Harry Potter IdValue v412 Edge TableValue Table (String) Value Table (Integer)

36 36 XML Example Lilly Potter Harry Potter 12 James Potter

37 37 person Harry Potter name person Lilly Potter James Potter child 314 0 4711666 i314 Lilly Potter Harry Potter 12 James Potter age 12

38 38 Kinds of Indexes 1. Value Indexes index atomic values; e.g., //emp/salary/fn:data(.) use B+ trees (like in relational world) (integration into query optimizer more tricky) 2. Structure Indexes materialize results of path expressions (pendant to Rel. join indexes, OO path indices) 3. Full text indexes Keyword search, inverted files (IR world, text extenders)  Any combination of the above

39 39 Outline  XML Storage  XML Indexing  Query Processing  Other Research Topics Query Rewriting Benchmarking …

40 40 What is a Correct Rewriting  E1 -> E2 is a legal rewriting iff Type(E2) is a subtype of Type(E1) FreeVar(E2) is a subset of FreeVar(E1) For a binding of free variables, either E1 or E2 return ERROR (possibly different errors) Or E1 and E2 return the same result  This definition allows the rewrite E1->ERROR Trust your vendor she does not do that for all E1!

41 41 Handling Backwards Navigation  Replace backwards navigation with forward navigation for $x in $input/a/b for $y in $input/a, return {$x/.., $x/d} $x in $y/b return {$y, $x/d} for $x in $input/a/b return {$x//e/..} ??  Enables streaming

42 42 FLWR Unnesting  Traditional database technique for $x in $input/a/b for $x in $input/a/b, where $x/c eq 3 $y in $x/d return (for $y in $x/d where ($x/e eq 4) and ($x/c eq 3) where $x/e eq 4 return $y return $y)  Problem simpler than in OQL/ODMG No nested collections in XML

43 43 XML Query Processing  Techniques vary a lot, depending on Storage model Indexes available Algebra used …  A large body of ongoing work Research community: McHugh and Widom 1999, Zhang et al. 2001, Bruno et al. 2002, Ghua et al. 2002, Chen et al. 2003, Paparizos et al. 2004, Jagadish 2004, … (just look at SIGMOD and VLDB proceedings in recent years!) Industry: IBM DB2, Oracle, SQL Server, …

44 44 XML Processing Benchmark  We cannot really compare approaches until we decide on a comparison basis  XML processing very broad  Industry not mature enough  Usage patterns not clear enough  Existing XML benchmarks (Xmark, etc. ) limited  Strong need for a TP benchmark


Download ppt "1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann."

Similar presentations


Ads by Google