By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01 Estimating the Selectivity of XML Path Expressions for Internet Scale Applications By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01 Presented by Kan Kin Fai
Outline of Presentation Introduction Path Trees Markov Tables Experimental Findings
Motivation XML enables Internet scale applications that query data from many sources Niagara, Xyleme, … Queries over XML data use path expressions
An XML Document <readings> <play> <title>Pygmalion</title> <author>Bernard Shaw</author> </play> <novel> <title>David Copperfield</title> <author>Charles Dickens</author> </novel> </readings>
Querying XML Data FOR $n_auth IN document("*")//novel/author $p_auth IN document("*")//play/author WHERE $n_auth/text() = $p_auth/text() RETURN $n_auth Optimizing this query requires estimating the selectivity of the path expressions This requires information about the structure of the XML data
Goal of this Paper Build database statistics that capture the structure of XML data Ensure that the statistics fit in a small amount of memory For efficient query optimization Important for Internet scale applications Use the statistics to estimate the selectivity of simple XML path expressions
Simple Path Expression A sequence of tags that represents a navigation through the tree structure of the XML data starting anywhere in the tree //t1/t2/…/tn Assumes an unordered model of XML Does not consider navigations based on IDREF attributes or on predicates on the attributes values
Path Trees A tree representing the structure of an XML document Every node represents a path starting from the root of the XML document The root node represents the root element
Path Trees A node has a child node for every distinct element directly nested in any of the elements reachable by the path it represents Every node is labeled with the tag name of the elements reachable by the path it represents and with the number of such elements (i.e. frequency of the node)
Path Trees <A> <B> </B> <B> <D> </D> </B> <C> <E> </E> </C> </A> A 1 B 2 C 1 D 1 D 1 E 3
Summarizing Path Trees Path trees contain all the information needed for selectivity estimation Problem: May not fit in available memory Small available memory Internet scale Remove low frequency nodes Removed nodes replaced with *-nodes Tag name: * meaning "any tag" Frequency: Average frequency of replaced nodes Sibling-*, Level-*, Global-*, No-*
Sibling-* Summarization Repeatedly choose the path tree node with the lowest frequency and mark it for deletion Check its siblings to see if any of them is either a *-node or a regular node that has been marked for deletion If yes, coalesce the node with such sibling node into one *-node. Coalescing the children of coalesced nodes if they have the same tag name.
Sibling-* Summarization During summarization, all path tree nodes store the number of nodes in the original unsummarized path tree that they represent and the total frequency of these nodes. When the path tree becomes small enough, traverse the tree and compute for every *-node the average frequency of the multiple deleted nodes that it represents.
Sibling-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I 2 J 4 K 11 K 12
Sibling-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I 2 J 4 K 11 K 12
Sibling-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I J 4 K 11 K 12 2
Sibling-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I J K 11 K 12 2 4
Sibling-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 * f=6 n=2 K 11 K 12
Sibling-* Summarization 1 B 13 C 9 D 7 E F 15 G 10 H 6 5 * f=6 n=2 K 11 K 12
Sibling-* Summarization 1 B 13 C 9 D 7 E F 15 G 10 H 5 6 * f=6 n=2 K 11 K 12
Sibling-* Summarization 1 B 13 C 9 D E F 15 G 10 H 7 5 6 * f=6 n=2 K 11 K 12
Sibling-* Summarization 1 B 13 C 9 * f=12 n=2 F 15 G 10 H 6 * f=6 n=2 K 11 K 12
Sibling-* Summarization 1 B 13 C 9 * f=12 n=2 F 15 G 10 H 6 * f=6 n=2 K 11 K 12
Sibling-* Summarization 1 B 13 C 9 * f=12 n=2 F 15 G H 10 6 * f=6 n=2 K 11 K 12
Sibling-* Summarization 1 B 13 C 9 * f=12 n=2 F 15 * f=16 n=2 * f=6 n=2 K 11 K 12
Sibling-* Summarization 1 B 13 C 9 * f=12 n=2 F 15 * f=16 n=2 * f=6 n=2 K f=23 n=2
Sibling-* Summarization 1 B 13 C 9 * F 15 * 6 8 * K f=23 n=2 3
Original Path Tree A 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I 2 J 4 K 11 K 12
Sibling-* Summarization 1 B 13 C 9 * F 15 * 6 8 * K f=23 n=2 3 Try to preserve the exact position of the deleted nodes in the original path tree May need to delete 2n nodes to reduce the size of the tree by n nodes
Selectivity Estimation Try to match the tags in the path expressions with tags in the path tree to find all path tree nodes to which the path expression leads The estimated selectivity is the total frequency of all these nodes.
Selectivity Estimation When we can’t match a tag in the path expression to a path tree node with a regular tag, try to match it to a *-node that can take its place. E.g. //A/B/C would match all of //A/*/C, //A/*/* and //*/B/* Allow matches with any number of *-nodes as long as they include at least one node with a regular tag name
Level-* Summarization Has a *-node for every level of the path tree representing all deleted nodes at this level All nodes deleted at any given level of the path tree are coalesced into the *-node for this level Preserves only the level in the path tree of the deleted nodes, not their exact position as in sibling-* Need to delete n+h nodes to reduce the size of the path tree by n nodes, where h is the number of levels in the tree
Level-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I 2 J 4 K 11 K 12
Level-* Summarization 1 B 13 C 9 D E F 15 G 10 H 7 5 6 I J K 11 K 12 2 4
Level-* Summarization 1 B 13 C 9 * 6 F 15 G 10 * 3 K 11 K 12
Global-* Summarization A single *-node represents all low-frequency nodes deleted from anywhere in the path tree Preserves less information about the deleted nodes than sibling-* or level-* Needs to delete only n+1 nodes to reduce the size of the path tree by n nodes
Global-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I 2 J 4 K 11 K 12
Global-* Summarization 1 B 13 C 9 D E F 15 G 10 H 7 5 6 I J K 11 K 12 2 4
Global-* Summarization 3 * B 13 C 9 D F 15 G 10 H 7 6 K 11 K 12
No-* Summarization Low-frequency nodes are simply deleted and not replaced with *-nodes Deletes exactly n nodes to reduce the size of a path tree by n nodes
No-* Summarization A 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I 2 J 4 K 11 K 12
No-* Summarization A 1 B 13 C 9 D E F 15 G 10 H 7 5 6 I J K 11 K 12 2 4
No-* Summarization B 13 C 9 D E F 15 G 10 H 7 5 6 K 11 K 12
Markov Tables A table of all distinct paths of length up to m and their frequencies For paths of length greater than m, combine paths from the Markov table Example: Uses "short memory" or "Markov" property f(B/C/D) f(B/C) f(A/B/C/D) = f(A/B/C)
Markov Tables Path Freq A 1 AC 6 B 11 AD 4 C 15 BC 9 D 19 BD 7 AB CD 8
Summarizing Markov Tables Exact selectivities for paths of length up to m Approximate selectivities for paths longer than m Problem: May not fit in available memory Remove low frequency paths Discard removed paths of length > 2 Replace removed paths of length 1 or 2 with *-paths Suffix-*, Global-*, No-*
Suffix-* Summarization Two special *-paths *: all deleted paths of length 1 */*: all deleted paths of length 2 Adds low-frequency path of length 1 to * Keeps a set of deleted paths of length 2, SD Deletes low-frequency path of length 2 Looks for suffix-* path (e.g. A/*) with the same starting tag in the Markov table Looks for path with the same starting tag in SD
Suffix-* Summarization Deletes low-frequency path of length 2 Adds deleted suffix-* path to */* At the end, add deleted paths in SD to */* and compute the average frequencies of all *-paths. Selectivity Estimation Use the frequencies of suffix-* paths and *-paths if any of the required paths is not found Return 0 if only *-paths are used for estimation
Suffix-* Summarization Path Freq A 1 AC 6 B 11 AD 4 C 15 BC 9 D 19 BD 7 AB CD 8
Suffix-* Summarization Path Freq A 1 AC 6 B 11 AD 4 C 15 BC 9 D 19 BD 7 AB CD 8 * **
Suffix-* Summarization Path Freq A 1 AC 6 B 11 AD 4 C 15 BC 9 D 19 BD 7 AB CD 8 * **
Suffix-* Summarization Path Freq AC 6 B 11 AD 4 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 **
Suffix-* Summarization Path Freq AC 6 B 11 AD 4 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 ** SD= { } Set of deleted paths of length 2
Suffix-* Summarization Path Freq AC 6 B 11 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 ** SD= { (AD,4) }
Suffix-* Summarization Path Freq AC 6 B 11 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 ** SD= { (AD,4) }
Suffix-* Summarization Path Freq AC 6 B 11 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 ** SD= { (AD,4) }
Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 ** SD= { }
Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 ** SD= { }
Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 AB CD 8 * f=1,n=1 ** SD= { (BD,7) }
Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 AB CD 8 * f=1,n=1 ** SD= { (BD,7) }
Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 AB * f=1,n=1 ** SD= { (BD,7), (CD,8) }
Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 AB * f=1,n=1 ** SD= { (BD,7), (CD,8) }
Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 AB * f=1,n=1 ** SD= { (BD,7), (CD,8) }
Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 B* f=16,n=2 D 19 AB * f=1,n=1 ** SD= { (CD,8) }
Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 B* f=16,n=2 D 19 AB * f=1,n=1 ** SD= { (CD,8) }
Suffix-* Summarization Path Freq B 11 C 15 B* f=16,n=2 D 19 AB * f=1,n=1 ** f=10,n=2 We gave A* a second chance but it lost it. Done. SD= { (CD,8) }
Suffix-* Summarization Path Freq B 11 C 15 B* 8 D 19 AB * 1 ** 6 Selectivity estimation. SD= { }
Global-*, No-* Summarization Two *-paths, * and ** Adds low-frequency path of length 1 or 2 to the appropriate *-path immediately Deletes fewer paths than suffix-* to summarize the Markov table No-* No *-paths
Experimental Findings Path trees Query paths in data: Global-* Query paths not in data: No-* Markov tables m = 2 is best (practical values: 2 and 3) Query paths in data: Suffix-*
Explanation Methods using *-nodes/*-paths aggressively assume nodes/paths that cannot be found did exist in the original path tree/Markov table. No -* conservatively assumes that nodes/paths that cannot be found did not exist in the original path tree/Markov table.
Experimental Findings When the data has many common sub-structures, Markov tables give more accurate estimation. When the data does not have many common sub-structures, path trees give more accurate estimation.
Explanation DBLP Repeated sub-structures effectively captured by Markov tables <sigmod> <inproceedings> <author>…</author> … </inproceedings> … </sigmod> <vldb> <inproceedings> <author>…</author> … </inproceedings> … </vldb>