Efficient Filtering of XML Documents with XPath Expressions Chee-Yong Chan, Pascal Felber, Minos Garofalakis, Rajeev Rastogi Information Sciences Research Center Bell Laboratories, Lucent Technologies
Motivation Efficient Filtering of XML Documents with XPath Expressions Growing interest in content-based filtering & routing of data. Data Publishers Subscription Table Filtering Engine Subset of Relevant Data Consumers data Content-based Router XML => More expressive XPath-based subscriptions (e.g., Intel’s NetStructure XML Accelerator). Challenge: How to efficiently filter XML data with XPath-based subscriptions?
Problem Abstraction document XPath Filter Subset of S that match D D Efficient Filtering of XML Documents with XPath Expressions Problem Abstraction XML document D Subset of S that match D XPath Filter S, Set of XPath expressions (XPEs)
Challenges Filtering with XPath expressions (XPEs) is non-trivial: Efficient Filtering of XML Documents with XPath Expressions Challenges Filtering with XPath expressions (XPEs) is non-trivial: Complexity of XPEs -- tree-structured patterns that include ``*’’ and ``//’’ operators. Need for both unordered & ordered matchings. //a /b /f //e /*/d /c Example: p = // a / b [ c / * / d ] / / e / f XPE tree of p
Efficient Filtering of XML Documents with XPath Expressions Our Solution: XTrie Speed up XPE filtering with a novel index called XTrie. Key idea: Decompose Complex, tree- structured XPE Set of simple, linear patterns (substrings) XTrie Index with trie
XTrie Index Construction Algorithm Architecture of XTrie Efficient Filtering of XML Documents with XPath Expressions XTrie Index Construction Algorithm Complex, tree - structured XPEs XML document D XML Parser (SAX based) Start/End Element Events XTrie Index Set of XPEs that match D XTrie Matching Algorithm
Architecture of XTrie Complex, tree - structured XPEs Set of simple, Efficient Filtering of XML Documents with XPath Expressions Architecture of XTrie Complex, tree - structured XPEs Set of simple, linear patterns (substrings) Decompose XPEs Build XTrie index XML document D XML Parser (SAX based) XTrie Index Start/End Element Events Trie Set of XPEs that match D XTrie Matching Algorithm Substring Table
Decomposition of XPEs Efficient Filtering of XML Documents with XPath Expressions Decompose each XPE p into a set of substrings that “cover” p. Substring = Sequence of element names along some path in XPE tree, where each consecutive pair of nodes is related by a “/” operator (without any “*” or “//”). Example: p = // a / b [ c / * / d ] // e / f Substrings in p = {a, b, c, d, e, f, ab, bc, ef, abc }. //a /b /f //e /*/d /c
Decomposition of XPEs Efficient Filtering of XML Documents with XPath Expressions Decompose each XPE p into a set of substrings that “cover” p. Substring = Sequence of element names along some path in XPE tree, where each consecutive pair of nodes is related by a “/” operator (without any “*” or “//”). Example: p = // a / b [ c / * / d ] // e / f Substrings in p = {a, b, c, d, e, f, ab, bc, ef, abc }. One possible decomposition of p is { a, bc, d, ef }. //a /b /f //e /*/d /c
Efficient Filtering of XML Documents with XPath Expressions Decomposition of XPEs In general, there are many possible decompositions. Single-Element Decomposition Minimal Decomposition //a /b /f //e /*/d /c //a /b . . . . . . . . . /c //e /*/d /f
Decomposition of XPEs Efficient Filtering of XML Documents with XPath Expressions “Enhanced” min. decomp. = min. decomp. with a substring ending at each branching node. //a /b /f //e /*/d /c Single-Element Decomposition Minimal . . . “Enhanced”
XTrie XTrie index consists of 2 components: XTrie Index Trie Substring Efficient Filtering of XML Documents with XPath Expressions XTrie XTrie index consists of 2 components: Trie Substring Table XTrie Index
XTrie XPEs p = // a / a / b / * / a / b q = / a / b [ c] // b / c Efficient Filtering of XML Documents with XPath Expressions XPEs p = // a / a / b / * / a / b q = / a / b [ c] // b / c
XTrie Decomposed Substrings XPEs /a /b /c //b //a /*/a p q Efficient Filtering of XML Documents with XPath Expressions Decomposed Substrings /a /b /c //b //a /*/a p q XPEs p = // a / a / b / * / a / b q = / a / b [ c] // b / c
XTrie Decomposed Substrings Substring-Table /a /b /c //b //a /*/a p q Efficient Filtering of XML Documents with XPath Expressions Decomposed Substrings /a /b /c //b //a /*/a p q Parent Row Rel. Level Num Child Rank aab 1 2 3 4 5 1 3 1 2 1 2 ab ab abc bc Substring-Table
XTrie Trie Substring-Table aab ab abc bc a b a b c Next Row Parent Efficient Filtering of XML Documents with XPath Expressions Trie 1 a b 2 3 a b c Substring-Table 4 5 6 Next Row Parent 1 3 Rank 2 Rel. Level Num Child 4 5 b c aab ab 7 8 abc Child Node Ptr bc Substring Table Ptr
XTrie Trie Substring-Table aab ab abc bc a b a b c Next Row Parent Efficient Filtering of XML Documents with XPath Expressions Trie 1 a b 2 3 a b c Substring-Table 4 5 6 Next Row Parent 1 3 Rank 2 Rel. Level Num Child 4 5 b c aab ab 7 8 abc Child Node Ptr Substring Table Ptr Max. Suffix Ptr bc
Optimizations for XTrie Efficient Filtering of XML Documents with XPath Expressions Optimizations for XTrie “Lazy” variant of XTrie Reduce number of accesses to substring-table by probing it only when the matched substring is a leaf substring of some XPE. XTrie for single-path XPEs Optimize data structures & algorithms by exploiting the simpler structures of single-path XPEs.
Related Work Commercial Products (e.g. BEA, Intel, etc). Efficient Filtering of XML Documents with XPath Expressions Related Work Commercial Products (e.g. BEA, Intel, etc). XFilter [ Altinel & Franklin, VLDB’00] Model single-path XPEs as finite state machines (FSMs). /a /c //b p = / a // b / c Build a hash index on FSMs’ transitions (ie element names). a b c candidate-list wait-list Optimizations XFilter-LB = XFilter with list balancing Prefiltering = 2 parses over XML data to pre-filter some XPEs.
Experimental Evaluation Efficient Filtering of XML Documents with XPath Expressions Experimental Evaluation DTD: NITF (News Industry Text Format) 123 elements, 513 attributes XML data: Generated with IBM’s XML Generator (size = 20, 100, 1000 tag pairs) XPath expressions: Generated using our own generator (P = #XPEs, L = max. depth, Pw = prob. of ‘*’, Pd = prob of ‘//’, z = skew of element names) Algorithms: Eager & Lazy XTrie, XFilter & XFilter-LB [Altinel & Franklin, VLDB00] System: Sun Ultra-250 (296MHz) with 512 MB memory running Solaris 2.7 NITF: News Industry Text Format
Efficient Filtering of XML Documents with XPath Expressions Scalability (# XPEs)
Efficient Filtering of XML Documents with XPath Expressions Scalability (# tags)
Efficient Filtering of XML Documents with XPath Expressions Conclusions XTrie -- A novel index structure that supports the efficient filtering of streaming XML data based on XPath expressions. Features: Index both simple single-path as well as complex tree-structured XPath expressions. Handles ordered, unordered, and hybrid modes of matching.
Efficient Filtering of XML Documents with XPath Expressions Speedup / # XPEs (m)
Efficient Filtering of XML Documents with XPath Expressions Wildcards (m)
Efficient Filtering of XML Documents with XPath Expressions Descendants (m)
Efficient Filtering of XML Documents with XPath Expressions Number of levels (m)
Efficient Filtering of XML Documents with XPath Expressions Skew (m)