Johannes Kepler University Linz Department of Business Informatics Data & Knowledge Engineering Altenberger Str. 69, 4040 Linz Austria/Europe IDAR 2007 A Generic Framework for Querying and Updating Secondary XML Index Structures Katharina Grün
2 Research Methodology
3 Become aware of problem Motivation Widespread use of XML XML databases for efficient query and update processing Require index structures on content and structure of documents primary index structure default index on whole document not optimized for specific queries secondary index structures created on demand on specific document fragments adapted to query workload Framework for querying and updating secondary XML index structures (SCIENS)
4 Become aware of problem Running example ' '] //element(resource, Report)[author='Smith'] path: labelpath:
5 Become aware of problem Challenges Which secondary index structures are necessary? each kind of query is best supported by different index structure not possible to provide one index structure for each possible query How to integrate them into a common framework? each secondary index can index arbitrary properties of arbitrary fragments query and update processing must not depend on specific indices defined How to update them when documents change? document updates must be propagated to affected index structures incremental index maintenance algorithm
6 Become aware of problem Related work (1) XML databases limited support for secondary index structures XML index structures structure and/or content mostly primary index structure based on different models, proprietary structures Object-oriented index structures proprietary structures to support queries on path navigation and/or inheritance hierarchies Multidimensional index structures support several value dimensions do not consider structure
7 Become aware of problem Related work (2) Extensible indexing object-relational databases adapt index structures to different data types Indexing tasks Maintain secondary indices when documents are updated (KeyX 1 ) Select optimal index for specific query (XML Access Modules 2 ) Suggest set of indices for query workload (KeyX 1 ) currently no integrated approach for processing secondary index structures in an XML database 1) B.C.Hammerschmidt: KeyX: Selective Key-Oriented Indexing in Native XML Databases. Phd Thesis, University of Lübeck, ) Arion, A., Benzaken, V. and Manolescu, I.: XML Acess Modules: Towards Physical Data Independence in XML Databases. Ximep workshop, 2005.
8 Suggest solution SCIENS - Ideas Structure and Content Indexing with Extensible, Nestable Structures Which secondary index structures are necessary? select a small set of index structures and adapt them to various properties nest index structures to reflect hierarchical queries How to integrate them into a common framework? provide an index model common index interface to query and update indices How to update them when documents change? index maintenance algorithm that determines updates for arbitrary indices based on update fragments and index definitions
9 Construct solution Index structures – one dimension (1) Value indexing hashtable or B+-tree on ' ' Structure indexing hashtable or B+-tree on path/labelpath/type //resource /project[1]//resource /project[2]/milestone[2]/resource
10 Construct solution Index structures – one dimension (2)
11 Construct solution Index structures – multiple dimensions (1) propertyexampleindex structure (value | ' ' and author='Smith' //project[]/milestone[]/resource ' ' kdb-tree 1 1) Robinson, J.: The KDB-tree: A search Structure for Large Multidimensional Dynamic Indexes. Sigmod, ACM Press, 1981.
12 Construct solution Index structures – multiple dimensions (2) propertyexampleindex structure ((value | structure) ∆ (value | structure))+ //project[]/milestone[]/resource ' ' //resource ' ' index nesting
13 Evaluate solution Comparison time (ms) I1 (date) I2 (date, hierarchy) I3 (date > hierarchy) I4 (hierarchy > date) Q1 (specific milestone) Q2 (specific project) Q3 (all) average queries and indices on milestone hierarchy and date e.g. define index that best matches query workload
14 Construct solution Index framework (1) index search function consisting of a set of index entries provides interface to update and retrieve index entries index entry maps index keys (value, type, path,…) -> returned nodes TechnicalReport, Smith -> 3.2.1, 4.3.1,... index definition selects nodes to be indexed //element(resource, $V1)[author=$V2] represented as unordered tree pattern with index variables index structure specific data structure (hash table, prefix B+-tree, kdb-tree) one index can use several index structures (index nesting)
15 Construct solution Index framework (2) index configuration provides mapping from index to specific index structure associates with each index variable the index structure to be used $T1, $E2: kdb-tree $E2: hash table, $T1: B+-tree search configuration used to access index associates index key to be searched with each index variable generated by index selection tool $T1= Report, $E2= 'Smith'
16 Construct / evaluate solution Index maintenance propagate document updates to affected indices steps 1. find embeddings of index patterns in update fragments 2. execute queries 3. generate index entries [(TechnicalReport, 'Smith') resource][(TechnicalReport, 'Tim') resource] up to 9 times faster than existing approach (KeyX)
17 Conclusion select secondary index structures for XML extensible: various properties and operations on these properties nestable: adapt indices to hierarchical queries integrate index structures into framework hides indexing tasks from query and update processing tasks provides index model (common index interface) index maintenance algorithm propagate updates to index structures flexibility to define indices that match the query workload