Download presentation
Presentation is loading. Please wait.
Published byBarbara Copeland Modified over 9 years ago
1
Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze
2
Outline Definition Selectivity Estimation Motivation Algorithms for Selectivity Estimation oPath Tree oMarkov Tables oXPathLearner oXSketches Summary
3
Selectivity Definition Selectivity of a path expression σ(p) is defined as the number of paths in the XML data tree that match the tag sequence in p ABCEDD Example: σ(A/B/D) = 2
4
Motivation Estimating the size of query results and inter- mediate results is neccessary for effective query optimization Knowing selectivities of sub-queries help identifying cheap query evaluation plans Internet Context: Quick feedback about expected result size before evaluating the full query result
5
Example XQuery-Expression: For $f IN document („personnel.xml“)//department/faculty WHERE count ($f/TA) > 0 AND count($f/RA) > 0 RETURN $f This expression matches all faculty members that has at least one TA and one RA one join for every edge is computed Presumption Number of nodes is known Join-Algorithm: Nested Loop Department Faculty RATA
6
NodeCount Dep.1 Faculty3 RA7 TA2 Department Name Faculty Secretary Name RA TA Faculty RA Scientist Name RA Method 1 Join 1: (Faculty) – TA Join 2: (Result Join 1) – RA Join 3: (Result Join 2) – Dep. Method 2 Join 1: (Faculty) – Dep. Join 2: (Result Join 1) – RA Join 3: (Result Join 2) – TA Evaluating the join Number of operations: Join 1: 3 * 2 = 6 Join 2: 1 * 7 = 7 Join 3: 1 * 1 = 1 Total = 14 Number of operations: Join 1: 3 * 1 = 3 Join 2: 3 * 7 = 21 Join 3: 3 * 2 = 6 Total = 30
7
Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees oMarkov Tables oXPathLearner oXSketches Summary
8
Representing XML data structure Path TreesMarkov Tables
9
A BC DDE 131 21 1 Path Trees Problem: The Path Tree may become larger than the available memory The tree has to be summarized
10
Summarizing a Path Tree 4 different Algorithms: Sibling-* Level-* Global-* No-* D elete the nodes with the lowest frequencies and replace them with a „* “ (star-node) to preserve some structural information Operation breakdown:
11
Sibling-* Operation breakdown: A BC EGHKKFD 1 9 106 1112 1557 13 K IJ 4 IJ 2 Mark the nodes with the lowest frequencies for deletion Check siblings, if sibling coalesce * n=2 f=6 Traverse Tree and compute average frequency 3 A BC * K F* * 1 9 8 f=23 n=2 3 156 13
12
Level-* A BC G K F* * 1 9 10 113 156 13 K 12 A BC EGH KK FD IJ 1 9 106 1112 2 1557 13 4 As before, delete the nodes with the lowest frequency One *-node for every level
13
Global-* A BC EGH KK FD IJ 1 9 106 1112 2 1557 13 4 Delete the nodes with the lowest frequency One *-node for the complete tree * BC GH KK FD 9 106 11 12 157 13 3
14
No-* Low frequency nodes are deleted and not replaced Tree may becomes a forest with many roots No-* conservatively assumes that nodes that do not exist in the summarized path tree did not exist in the original path tree
15
Selectivity-Estimation A BC * K F* * 1 9 8 11 3 156 13 find all matchings tags estimated selectivity = total frequency of these nodes Example: σ(A/B/F) = 15 + 6 = 21 σ(A/B/Z) = 6 σ(A/C/Z/K) = 11
16
Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees oMarkov Tables oXPathLearner oXSketches Summary
17
What are Markov Tables ? Table, conaining all distinct paths in the data of length up to m and their selectivity m 2 Order: m - 1 Markov Table = Markov Histogramm ABC 1 611 D 4 C 9 D 7 D 8 PathSel.PathSel. A1AC6 B11AD4 C15BC9 D19BD7 AB11CD8
18
Selectivity Estimation The table provides selectivity estimates for all paths of length up to m Assumption that the occurence of a particular tag in a path is dependant only on m-1 tags occuring before it Selectivity estimation for longer path expressions is done with the following formula
19
Selectivity Estimation P[t n ]Propability of tag t n occuring in the xml data tree NTotal number of nodes in the xml data tree P[t i |t i+1 ]Probability of tag t i occuring before tag t i+1 E EPredictand for the occurence of tag t n E1 E1Predictand for the occurence of tag t i before tag t i+1 Markov Chain t1 t2 t3 t…
20
Selectivity Estimation = Selectivity of path p Example:
21
Summarizing Markov Tables The Nodes with the lowest selectivity are deleted and replaced 3 Algorithms: Suffix-* Global-* No-*
22
Suffix-* * - Path : representing all deleted paths of length 1 */* - Path : representing all deleted paths of length 2 Deleting a path of length 1 add to path * S D : Set of deleted paths with length 2 Deleting a path of length 2 add to S D and look for paths with the same start tag Example: S D ={(A/C), (G/H)} deleting (A/B) (A/*) Before checking S D, check Markov Table suffix-* path
23
Global-* * - Path : representing all deleted paths of length 1 */* - Path : representing all deleted paths of length 2 Deleting a path of length 1 add to path * Deleting a path of length 2 immediately add to path */*
24
No-* does not use *-Paths Low-frequency paths simply discarded If any of the required paths is not found (in the markov table) its selectivity is conservatively assumed to be zero
25
Which method should be used ? Path Trees vs. Markov Table Path exists in XML-Data * - Algorithm Path do not exist No - * - Algorithm „ * “ vs. „ No-* “ Data has common structure Markov Table Data has NO common structure Path Trees
26
Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees oMarkov Tables oXPathLearner oXSketches Summary
27
Weaknesses of previous methods Off-line, scan of the entire data set Limited to simple path expressions Oblivious to workload distribution Updates too expensive
28
XPathLearner is... An on-line self-tuning Markov histogram for XML path selectivity estimation on-line: collects statistics from query feedback self-tuning: learns Markov model from feedback, adapts to changing XML data workload-aware supports simple, single-value and multi-value path expressions
29
Query Plan Enumerator Selectivity Estimator Execution Engine Query Plan Histogram Learner Histogram Result Query Optimizer Query feedback System Overview
30
Histogram Learner Histogram Training data Selectivity Estimator feedback, real selectivity updates estimated selectivity System uses feedback to update the statistics for the queried path. Updates are based on the observed estimation error. initial training Workflow observed estimation error
31
Basics Relies on path trees as intermediate representation Uses Markov histogram of order (m-1) to store the path tree and the statistics Henceforth m=2 table stores tag-tag and tag-value pairs and single tags
32
Data values Problem: Number of distinct data values is very large table may become larger than the available memory Solution Only the k most frequent tag-value pairs are stored exactly All other pairs are aggregated into buckets according to some feature Feature should distribute as uniform as possible
33
Example, k=1 TagCount A1 B6 C3 Tag Count AB6 AC3 TagValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11 Data value v1 begins with letter ‘a‘, v2 with the letter ‘b‘ A BC 1 36 1 V3V1V2 31
34
Selectivity Estimation P[t n ]Propability of tag t n occuring in the xml data tree NTotal number of nodes in the xml data tree P[t i |t i+1 ]Probability of tag t i occuring before tag t i+1 E EExpectation for the occurence of tag t n E1 E1Expectation for the occurence of tag t i before tag t i+1 (if n=2t i+1 = t n )
35
Selectivity Estimation Simple path p=//t 1 /t 2.../t n Analogous for single-value path p=//t 1 /t 2.../t n-1 =v n-1 Slightly more complicated for multi-value path
36
Selectivity Estimation Simple path p=//t 1 /t 2.../t n Single-value path p=//t 1 /t 2.../t n-1 =v n-1
37
Selectivity Estimation of a multi-value path p=//t 1 =v 1 /t 2 =v 2.../t n =v n Probability of v i occuring after t i, conditioned on observing t i
38
Example TagCount A1 B6 C3 Tag Count AB6 AC3 TagValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11 Real selectivity =3
39
Updates Changes in the data require the statistics to be updated Done via query feedback tuple (p, ) p denotes the path denotes the accurate selectivity of p Feedback is contributed to all path p according to some strategies
40
Learning process Given Initially empty Markov Histogram f Query feedback (p, ) Estimated selectivity Learn any unknown length-2-path Update selectivities for known paths Two strategies oHeavy-Tail-Rule oDelta-Rule
41
Algorithm-Part 1 Learn new paths of length up to 2 UPDATE(Histogram f, Feedback(p, ), Estimate ) if |p| 2 then if not exists f(p) then add entry f(p)= else f(p) Example: (AD)=1 (not in f), (AD) = 2 TagCount A1 B6 C3 3CA 6BA Tag 2 DA ValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11
42
Algorithm-Part 2 Learn longer paths (decompose into paths of length 2) else for each (t i,t i+1 ) p if not exists f(t i,t i+1 ) then add entry f(t i,t i+1 )=1 f(t i,t i+1 ) update endfor f(t i,t i+1 ) update depends on update strategy
43
Example TagCount A1 B6 C3 5CA 1DC 6BA Tag ValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11 (ACD)=1, (ACD)=5 f(CD)=4 decompose into AC and CD AC is present update the frequency CD is not present update f(CD) add f(CD)=1 4DC
44
Algorithm-Part 3 Learn frequency of single tags for each t i p, i 1 if not exists f(t i ) then add entry f(t i ) f(t i ) max{f(t i ), f( , t i )} endfor Example: (AD)=1 (not in f), (AD) = 2 3C 2D 6B 1A CountTag 3CA 6BA CountTag 2 DA ValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11
45
Update strategies Heavy-Tail-Rule Attribute more of the estimation error to the end of the path where w i weighting factors (increasing with i,e.g. 2 i ) learning rate W normalized weight W
46
Update strategies Delta-Rule Error reduction learning technique Minimizes an error function update to term f(t i,t i+1 ) proportional to the negative gradient of E with respect to f(t i,t i+1 ) determines the length of a step
47
Update strategies Delta-Rule update to term f( , ) proportional to the negative gradient of E with respect to f( , ) determines the length of a step
48
Evaluation Good on-line, adapts to changing data workload-aware after learning phase comparable to off-line methods update overhead nearly constant Bad still restricted to XML trees, no support for idrefs
49
Example Feedback for path ACD is (ACD,6) (ACD) ≈3, ε = 6-3=3 Updates Pathbefore updateafter update Heavy-Tail, =1 Delta, =0.5 AC345 CD688 C789 D799 attributes more to the end
50
Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees and Markov Tables oXPathLearner oXSketches Summary
51
Preliminaries XML Data Graph A: Author P: Paper B: Book PB: Publisher T: Title N: Name P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
52
Preliminaries Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14 T11 T12
53
Preliminaries Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8 V4E14 V10V11V12 V13V14 T11
54
Preliminaries Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set: {T1,T2} P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14 T11 T12
55
Preliminaries Motivation Selectivity Estimation over XML Data Graphs Outline oXSketch Synopsis oEstimation Framework oXSketch Refinement Operations oExperiment
56
XSketch Synopsis XML Data Graph General Synopsis Graph P(1) A(2) PB(1) N(2) P(2) B(2) T(2) T(2) E(1) Count(A) = | Extent(A) | = |{A1,A2}| =2 P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
57
Backward-edge Stability XML Data Graph Synopsis Graph b P(1) b A(2) PB(1) b b N(2) P(2) B(2) b b b T(2) T(2) E(1) Label(u,v) = b if all elements in v have a parent in u P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
58
Backward-edge Stability XML Data Graph Synopsis Graph b P(1) b A(2) PB(1) b b N(2) P(2) B(2) b b b T(2) T(2) E(1) Label(A2,B2) & Label(PB1,B2) are empty P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
59
Forward-edge Stability XML Data Graph Synopsis Graph f P(1) f A(2) PB(1) f f f N(2) P(2) B(2) f T(2) T(2) E(1) Label(u,v) = f if all elements in u have a child in v P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
60
Forward-edge Stability XML Data Graph Synopsis Graph f P(1) f A(2) PB(1) f f f N(2) P(2) B(2) f T(2) T(2) E(1) B9 is in B(2) have no child in E(1) P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
61
XSketch Synopsis XML Data Graph XSketch Synopsis Graph f/b P(1) f/b A(2) PB(1) f/b f/b Ø f N(2) P(2) B(2) f/b f/b b T(2) T(2) E(1) XSketch is a Synopsis G. with Label(u,v)={b,f,b/f, Ø} P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
62
Estimation Framework calculate the Selectivity for the PE. V=V1/…/Vn Count (V) = Count (Vn) * f( V ) 1.Case: For all i if Label (Vi, Vi+1) = {b} f (V) =1, so Count (V) = Count (Vn) Example : f/b P(1) f/b A(2) PB(1) f/b f/b f N(2) P(2) B(2) f/b f/b b T(2) T(2) E(1) Count (A/P/T) = Count (T) * f (A/P/T) = 2
63
Estimation Framework 2.Case: if exist i s.t. Label (Vi,Vi+1)≠ {b} A1. Path Independance Assum- ption: f (u/v | v/w) ≈ f(u/v) A2. B-Edge Uniformity Assum- ption: all U i in U such that: Label (U,V) ≠ b are uniformly distributed over all such parents Example : f/b P(1) f/b A(2) PB(1) f/b f/b Ø f N(2) P(2) B(2) f/b f/b b T(2) T(2) E(1) f (P/PB/B/T) = ???
64
Estimation Framework Example: f (P/PB/B/T) = ?? f (P/PB/B/T) = f (B/T) * f (P/PB/B | B/T) = f (B/T) * f (PB/B | B/T) * f (P/PB | PB/B/T) B-Stability = f (PB/B | B/T) A1: ≈ f (PB/B) A2: = Count (PB) / [ Count (PB) + Count (A) ] f (P/PB/B/T) = 1 / 1+2 = 1/3
65
Estimation Framework A3. Branch-Independence Assumption: Outgoing paths from v are conditionally independent of the existence of other outgoing paths A4. Forward-Edge Uniformity Assumption : The outgoing edges from v to all children u of v such that Label(u,v) ≠ F are uniformly distributed across all such children
66
XSketch Refinement Operations Goal : construct an efficient XSketch for given space budget Refinement Operations: B-Stabilize (Xs (G), u,v): Label(v,u) ≠ B. Refine node u into two element partitions u1,u2 with the same label s.t. Label(v,u1) = B or Label(v,u2) = B Example : V1 V2…Vn U V1 V2….Vn b U1 U2 b-Stabilize
67
XSketch Refinement Operations f-Stabilize (Xs(G),u,w): Label(u,w)≠ F Refine u into two nodes u1,u2 with same label s.t. Label (u1,w) = label(u,w)U{F} Example: U W1 W2….Wn U1 U2 f W1 W2…….Wn f - Stabilize
68
XSketch Refinement Operations A P 1... PiPi P i+1... PnPn PiPi PnPn P 1... A1A1 A2A2 PiPi c(A) P 1... PiPi P i+1... PnPn PnPn Backward Split
69
Experiment Nr. of elements Coarsest Summary (ΚΒ) Perfect Summary(MB) IMDB102,7555.71.5 XMark206,1313.76.2 DBLP1,399,76617.00.1
70
Workload 1000 Positive Pes Biased random sample from document Path Length: 2-5 500 contain range predicates oPredicates: random, 10% of value domain Similar results with negative PEs
71
Accuracy Metric Average Absolute Relative Error
72
Markov Tables vs. XSketch
73
Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees and Markov Tables oXPathLearner oXSketches Summary
74
Definition Selectivity Summarizing XML Documents (Path Trees / Markov Tables) Application using Markov Tables: XPathLearner Extension of Selectivity Estimation on Graphs: XSketch
75
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.