Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences Department University of Wisconsin - Madison
Motivation XML enables Internet scale applications that query data from many sources Niagara, Xyleme, … Queries over XML data use path expressions Optimizing these queries requires estimating the selectivity of the path expressions Focus of this talk: Building statistics for XML data and using them for estimating the selectivity of simple path expressions
What is XML? Pygmalion Bernard Shaw David Copperfield Charles Dickens
Querying XML FOR $n_auth IN document("*")//novel/author $p_auth IN document("*")//play/author WHERE $n_auth/text() = $p_auth/text() RETURN $n_auth Optimizing this query requires estimating the selectivity of the path expressions This requires information about the structure of the XML data
Goal of this Work Build database statistics that capture the structure of XML data Ensure that the statistics fit in a small amount of memory For efficient query optimization Important for Internet scale applications Use the statistics to estimate the selectivity of simple XML path expressions //t1/t2/…/tn
Outline of Presentation Introduction Path Trees Markov Tables Performance Evaluation Conclusions
Path Trees A 1 C 1 B 2 D 1 D 1 E 3
Summarizing Path Trees Path trees contain all the information needed for selectivity estimation Problem: May not fit in available memory Small available memory Internet scale Remove low frequency nodes Removed nodes replaced with *-nodes Tag name: * meaning "any tag" Frequency: Average frequency of replaced nodes Sibling-*, Level-*, Global-*, No-*
Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2
Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2
Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2
Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2
Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 * f=6 n=2 *-nodes represent deleted sibling nodes Memory saved by coalescing nodes
Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 * f=6 n=2
Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 * f=6 n=2
Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 * f=6 n=2
Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 K 11 * f=6 n=2 * f=12 n=2
Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 * K 11 * f=6 n=2 f=12 n=2
Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 * K 11 * f=6 n=2 f=12 n=2
Sibling-* Summarization A 1 C 9 B 13 F 15 K 12 * K 11 * f=6 n=2 f=12 n=2 * f=16 n=2
Sibling-* Summarization A 1 C 9 B 13 * F 15 * K* f=6 n=2 f=12 n=2 f=16 n=2 f=23 n=2
Sibling-* Summarization A 1 C 9 B 13 * F 15 * K* f=23 n=2 68 3
Original Path Tree A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2
Sibling-* Summarization A 1 C 9 B 13 * F 15 * K* f=23 n= Try to retain as much information as possible about the deleted nodes
Level-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2
Level-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2
Level-* Summarization A 1 C 9 B 13 G 10 F 15 K 12 K 11 * 6 * 3 Less information about deleted nodes than sibling-* Deletes fewer nodes than sibling-*
Global-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2
Global-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2
Global-* Summarization C 9 B 13 G 10 F 15 H 6 K 12 D 7 K 11 * 3
No-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2
No-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2
No-* Summarization C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 Memory savings similar to global-* Conservative assumption about deleted nodes
Outline Introduction Path Trees Markov Tables Performance Evaluation Conclusions
Markov Tables A table of all distinct paths of length up to m and their frequencies For paths of length greater than m, combine paths from the Markov table Example: Uses "short memory" or "Markov" property f(B/C/D) f(B/C) f(A/B/C/D) = f(A/B/C)
Markov Tables PathFreqPathFreq A1AC6 B11AD4 C15BC9 D19BD7 AB11CD8 A 1 D 4 C 6 B 11 D 7 C 9 D 8
Summarizing Markov Tables Exact selectivities for paths of length up to m Approximate selectivities for paths longer than m Problem: May not fit in available memory Remove low frequency paths Discard removed paths of length > 2 Replace removed paths of length 1 or 2 with *-paths Suffix-*, Global-*, No-*
Suffix-* Summarization PathFreqPathFreq A1AC6 B11AD4 C15BC9 D19BD7 AB11CD8
Suffix-* Summarization PathFreqPathFreq A1AC6 B11AD4 C15BC9 D19BD7 AB11CD8 *0**0
Suffix-* Summarization PathFreqPathFreq A1AC6 B11AD4 C15BC9 D19BD7 AB11CD8 *0**0
Suffix-* Summarization PathFreqPathFreq AC6 B11AD4 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0
Suffix-* Summarization PathFreqPathFreq AC6 B11AD4 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0 S D = { } Set of deleted paths of length 2
Suffix-* Summarization PathFreqPathFreq AC6 B11 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0 S D = { (AD,4) }
Suffix-* Summarization PathFreqPathFreq AC6 B11 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0 S D = { (AD,4) }
Suffix-* Summarization PathFreqPathFreq AC6 B11 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0 S D = { (AD,4) }
Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0 S D = { }
Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0 S D = { }
Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19 AB11CD8 *f=1,n=1**0 S D = { (BD,7) }
Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19 AB11CD8 *f=1,n=1**0 S D = { (BD,7) }
Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19 AB11 *f=1,n=1**0 S D = { (BD,7), (CD,8) }
Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19 AB11 *f=1,n=1**0 S D = { (BD,7), (CD,8) }
Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19 AB11 *f=1,n=1**0 S D = { (BD,7), (CD,8) }
Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15B*f=16,n=2 D19 AB11 *f=1,n=1**0 S D = { (CD,8) }
Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15B*f=16,n=2 D19 AB11 *f=1,n=1**0 S D = { (CD,8) }
Suffix-* Summarization PathFreqPathFreq B11 C15B*f=16,n=2 D19 AB11 *f=1,n=1**f=10,n=2 S D = { (CD,8) }
Suffix-* Summarization PathFreqPathFreq B11 C15B*8 D19 AB11 *1**6 S D = { }
Global-*, No-* Summarization Global-* Two *-paths, * and ** Deletes fewer paths than suffix-* to summarize the Markov table No-* No *-paths Conservatively assumes that paths not in the Markov table do not exist in the data
Outline Introduction Path Trees Markov Tables Performance Evaluation Conclusions
Data Sets for Experiments Synthetic data set 100,000 XML elements Path tree: 3197 nodes, 6 levels, 38 KB Element frequencies: Zipfian (z=1) DBLP data set 1,399,765 XML elements Path tree: 5883 nodes, 6 levels, 69 KB
Query Workloads 1,000 paths of length between 1 and 4 Random paths All query paths exist in the data Random tags Most query paths of length 2 or more do not exist in the data Available memory between 5 and 50 KB
Best Summarization Methods Path trees Query paths in data: Global-* Query paths not in data: No-* Markov tables m = 2 is best Query paths in data: Suffix-* Query paths not in data: No-*
Path Trees vs. Markov Tables When to use path trees and when to use Markov tables? Also compared against Pruned Suffix Trees (PSTs) [Chen et al, ICDE 2001] Can handle branching path expressions Can handle conditions on element values
Synthetic Data – Random Paths
Synthetic Data – Random Tags
DBLP Data – Random Paths
DBLP Data – Random Tags
When are Markov Tables Better? DBLP Repeated sub-structures effectively captured by Markov tables … … …
Conclusions Novel statistics for estimating the selectivity of XML path expressions Scale to "all the XML data on the Internet" More accurate than best previously known alternative Repeated sub-structures: Markov tables No repeated sub-structures: Path trees Query paths exist in the data: Global-*, Suffix-* Query paths do not exist in the data: No-* To appear in VLDB 2001