Download presentation
Presentation is loading. Please wait.
1
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
2
Motivation XML: de-facto standard for data exchange Development of the “XML Warehouse” Conflict between “on-line” and query execution cost Increased query response times Users might wait for un-interesting results XML Data Warehouse XML R Q
3
Approximate Query Answers Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result Use approximate result as timely feedback User can assess the “value” of the query Goal: reduce number of evaluated queries XML Data Warehouse Synopsis XML R R’ Q
4
Contributions TreeSketch Synopses Structural summaries for XML data Approximate answers for complex twig queries Summarization model Structural clustering of elements Efficient processing and construction Element Simulation Distance Novel distance metric for XML data Captures “approximate” similarity between two XML trees Experimental Results Accurate approximate answers for low space budgets Low-error selectivity estimates Efficient construction algorithm
5
Outline Preliminaries TreeSketches Synopsis model Computing approximate answers Summary construction Element Simulation Distance Experimental Study Conclusions
6
Data and Query Model XML Document q0q0 q1q1 q2q2 q3q3 //section.//equation./figure Twig Query s2s2 e 11 e 13 f5f5 f7f7 r Nesting Tree p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r e 10 f5f5 s2s2 r e8e8 f5f5 s2s2 r f4f4 s2s2 r e8e8 f4f4 s2s2 r q3q3 q2q2 q1q1 q0q0 Binding Tuples
7
Problem Definition Process twig query over a synopsis Compute approximation of nesting tree q0q0 q1q1 q2q2 q3q3 //section.//equation./figure s2s2 e 11 e 13 f5f5 f7f7 r s ee f r Approximate Nesting Tree True Nesting Tree XML Data Synopsis
8
TreeSketch Model
9
Graph Synopsis XML DocumentGraph Synopsis Synopsis node Set of elements of the same tag Synopsis edge Document edge(s) P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r
10
XML DocumentTreeSketch TreeSketch Synopsis Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u 1 2 1 1 1 1 1 P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r
11
XML DocumentTreeSketch TreeSketch Synopsis Is there a lossless synopsis? What is the quality of a lossy synopsis? 1 2 1 1 1 1 1 P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r
12
XML DocumentTreeSketch Count Stability (u,v) count-stable: all elements in u have the same child-count in v 1 2 1 1 1 1 1 P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r
13
XML DocumentTreeSketch Count-Stable TreeSketch A count-stable synopsis can recover the input tree Efficient one-pass construction Stable summary can be too large for practical use! 1 1 2 2 1 1 1 P(1) S(1) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r S(1) 1
14
XML DocumentTreeSketch Lossy TreeSketch 2 1 1 1 2 1 1 1 1 1 P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r 2 #F#F #F#F
15
TreeSketches and Clustering TreeSketch Element clustering All elements in a node are mapped to a “centroid” Tight clusters Accurate synopsis Synopsis quality Clustering error Options: Manhattan Distance, Squared Error, … Quality can be measured independent of a workload Key for effective construction
16
Computing Approximate Answers TreeSketch q0q0 q1q1 q2q2 q3q3 //section.//equation.//caption QueryApproximate Nesting Tree R E 1 1+1=2 C S 2 Compute TreeSketch of approximate answer Accuracy depends on quality of clustering 1 2 1 1 1 1 1 P(1) S(2) F(2) C(4) F(2) E(2) R(1)
17
TreeSketch Construction Given an XML tree T, build a TreeSketch of size B Difficult clustering problem Space dimensionality depends on the clustering itself Construction based on bottom-up clustering Compress perfect synopsis by merging clusters Best merge determined by marginal gains Heuristic to reduce number of candidate merges Perfect Space Budget …
18
Element Simulation Distance
19
Error of Approximation Error Distance between R’ and R Popular metric: Tree-edit distance Min-cost sequence of operations that transform R’ to R Measures syntactic differences between R and R’ Not intuitive for approximate answers! T1T1 T r s e s f 14 ef 41 r s e s f 44 ef 11 r s e s f 26 ef 62 T2T2 Different counts Similar Trait Same counts Opposite Trait
20
Element Simulation Distance Capture approximate similarity between R and R’ u simulates v: u and v have identical structure ESD(u,v): “degree” of simulation between u,v How well the structure of u matches the structure of v Modeled as the distance between multi-sets Efficient computation using perfect summaries T r s e s f 14 ef 41 r s e s f 26 ef 62 T2T2 eeeeeeee f eeeeeeeeeeee ffff Recursive application of ESD
21
Experimental Results
22
Methodology Data Sets: XMark, DBLP, IMDB, SwissProt Workload: 1000 random twig queries Evaluation metrics: Average ESD for approximate answers Mean absolute relative error for selectivity estimation
23
Approximate Answers - IMDB IMDB (~102K Elements) Avg. Result Size: 3,477 tuples
24
Selectivity Estimation - SwissProt SwissProt (~182K Elements) Avg. Result Size: 104,592 tuples
25
Selectivity Estimation - ALL Data Set #Elements (x 10 3 ) # Tuples (x 10 3 ) DBLP1,50078 IMDB23613 S-Prot473365 XMark2,000145 Data Set Construction Time (min) DBLP11 IMDB2.5 S-Prot38 XMark240
26
Conclusions Approximate query answering for XML databases TreeSketch Synopses Structural summaries for tree-structured XML Approximate answers for twig-queries Model: Graph Synopsis + Edge-counts Efficient processing and construction Element Simulation Distance Capture approximate similarity between XML trees Experimental Results High accuracy for low space budgets Efficient construction
27
Questions?
28
XML Document p1p1 s2s2 f7f7 c 14 s3s3 f9f9 c 17 f5f5 e 11 c 12 e 13 f9f9 c 17 r P(1) S(2) F(2) C(4) F(2) E(2) R TreeSketch 1 2 11 1 1 1 TreeSketch Model (2/2) Average number of children Edge count #E#E #C#C 1 1
29
XML XML Document p1p1 s2s2 f7f7 c 14 s3s3 f9f9 c 17 f5f5 e 11 c 12 e 13 p: paper s: section c: caption t: title f: figure e: equation f9f9 c 17 r
30
XML DocumentTreeSketch TreeSketch Synopsis Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u 2 1 2 2 1 0.5 P(1) S(2) C(4) F(4) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r #F#F
31
Depth-Guided Merging Key observation: Two elements have similar structure, if their children have similar structure Bottom-up merging, based on depth Depth: distance from the leaves of the tree Build a pool of candidate merges by increasing depth Replenish the pool when it falls below a given threshold Reduced construction time - Accurate synopses
32
Depth-Guided Merging Observation: Two elements have similar structure, if their children have similar structure Heuristic: If a merge of two clusters is good, then merges of the child clusters are likely to have been good as well Bottom-up merging strategy Savings in construction time - Accurate synopses
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.