Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze
Outline Definition Selectivity Estimation Motivation Algorithms for Selectivity Estimation oPath Tree oMarkov Tables oXPathLearner oXSketches Summary
Selectivity Definition Selectivity of a path expression σ(p) is defined as the number of paths in the XML data tree that match the tag sequence in p ABCEDD Example: σ(A/B/D) = 2
Motivation Estimating the size of query results and inter- mediate results is neccessary for effective query optimization Knowing selectivities of sub-queries help identifying cheap query evaluation plans Internet Context: Quick feedback about expected result size before evaluating the full query result
Example XQuery-Expression: For $f IN document („personnel.xml“)//department/faculty WHERE count ($f/TA) > 0 AND count($f/RA) > 0 RETURN $f This expression matches all faculty members that has at least one TA and one RA one join for every edge is computed Presumption Number of nodes is known Join-Algorithm: Nested Loop Department Faculty RATA
NodeCount Dep.1 Faculty3 RA7 TA2 Department Name Faculty Secretary Name RA TA Faculty RA Scientist Name RA Method 1 Join 1: (Faculty) – TA Join 2: (Result Join 1) – RA Join 3: (Result Join 2) – Dep. Method 2 Join 1: (Faculty) – Dep. Join 2: (Result Join 1) – RA Join 3: (Result Join 2) – TA Evaluating the join Number of operations: Join 1: 3 * 2 = 6 Join 2: 1 * 7 = 7 Join 3: 1 * 1 = 1 Total = 14 Number of operations: Join 1: 3 * 1 = 3 Join 2: 3 * 7 = 21 Join 3: 3 * 2 = 6 Total = 30
Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees oMarkov Tables oXPathLearner oXSketches Summary
Representing XML data structure Path TreesMarkov Tables
A BC DDE Path Trees Problem: The Path Tree may become larger than the available memory The tree has to be summarized
Summarizing a Path Tree 4 different Algorithms: Sibling-* Level-* Global-* No-* D elete the nodes with the lowest frequencies and replace them with a „* “ (star-node) to preserve some structural information Operation breakdown:
Sibling-* Operation breakdown: A BC EGHKKFD K IJ 4 IJ 2 Mark the nodes with the lowest frequencies for deletion Check siblings, if sibling coalesce * n=2 f=6 Traverse Tree and compute average frequency 3 A BC * K F* * f=23 n=
Level-* A BC G K F* * K 12 A BC EGH KK FD IJ As before, delete the nodes with the lowest frequency One *-node for every level
Global-* A BC EGH KK FD IJ Delete the nodes with the lowest frequency One *-node for the complete tree * BC GH KK FD
No-* Low frequency nodes are deleted and not replaced Tree may becomes a forest with many roots No-* conservatively assumes that nodes that do not exist in the summarized path tree did not exist in the original path tree
Selectivity-Estimation A BC * K F* * find all matchings tags estimated selectivity = total frequency of these nodes Example: σ(A/B/F) = = 21 σ(A/B/Z) = 6 σ(A/C/Z/K) = 11
Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees oMarkov Tables oXPathLearner oXSketches Summary
What are Markov Tables ? Table, conaining all distinct paths in the data of length up to m and their selectivity m 2 Order: m - 1 Markov Table = Markov Histogramm ABC D 4 C 9 D 7 D 8 PathSel.PathSel. A1AC6 B11AD4 C15BC9 D19BD7 AB11CD8
Selectivity Estimation The table provides selectivity estimates for all paths of length up to m Assumption that the occurence of a particular tag in a path is dependant only on m-1 tags occuring before it Selectivity estimation for longer path expressions is done with the following formula
Selectivity Estimation P[t n ]Propability of tag t n occuring in the xml data tree NTotal number of nodes in the xml data tree P[t i |t i+1 ]Probability of tag t i occuring before tag t i+1 E EPredictand for the occurence of tag t n E1 E1Predictand for the occurence of tag t i before tag t i+1 Markov Chain t1 t2 t3 t…
Selectivity Estimation = Selectivity of path p Example:
Summarizing Markov Tables The Nodes with the lowest selectivity are deleted and replaced 3 Algorithms: Suffix-* Global-* No-*
Suffix-* * - Path : representing all deleted paths of length 1 */* - Path : representing all deleted paths of length 2 Deleting a path of length 1 add to path * S D : Set of deleted paths with length 2 Deleting a path of length 2 add to S D and look for paths with the same start tag Example: S D ={(A/C), (G/H)} deleting (A/B) (A/*) Before checking S D, check Markov Table suffix-* path
Global-* * - Path : representing all deleted paths of length 1 */* - Path : representing all deleted paths of length 2 Deleting a path of length 1 add to path * Deleting a path of length 2 immediately add to path */*
No-* does not use *-Paths Low-frequency paths simply discarded If any of the required paths is not found (in the markov table) its selectivity is conservatively assumed to be zero
Which method should be used ? Path Trees vs. Markov Table Path exists in XML-Data * - Algorithm Path do not exist No - * - Algorithm „ * “ vs. „ No-* “ Data has common structure Markov Table Data has NO common structure Path Trees
Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees oMarkov Tables oXPathLearner oXSketches Summary
Weaknesses of previous methods Off-line, scan of the entire data set Limited to simple path expressions Oblivious to workload distribution Updates too expensive
XPathLearner is... An on-line self-tuning Markov histogram for XML path selectivity estimation on-line: collects statistics from query feedback self-tuning: learns Markov model from feedback, adapts to changing XML data workload-aware supports simple, single-value and multi-value path expressions
Query Plan Enumerator Selectivity Estimator Execution Engine Query Plan Histogram Learner Histogram Result Query Optimizer Query feedback System Overview
Histogram Learner Histogram Training data Selectivity Estimator feedback, real selectivity updates estimated selectivity System uses feedback to update the statistics for the queried path. Updates are based on the observed estimation error. initial training Workflow observed estimation error
Basics Relies on path trees as intermediate representation Uses Markov histogram of order (m-1) to store the path tree and the statistics Henceforth m=2 table stores tag-tag and tag-value pairs and single tags
Data values Problem: Number of distinct data values is very large table may become larger than the available memory Solution Only the k most frequent tag-value pairs are stored exactly All other pairs are aggregated into buckets according to some feature Feature should distribute as uniform as possible
Example, k=1 TagCount A1 B6 C3 Tag Count AB6 AC3 TagValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11 Data value v1 begins with letter ‘a‘, v2 with the letter ‘b‘ A BC V3V1V2 31
Selectivity Estimation P[t n ]Propability of tag t n occuring in the xml data tree NTotal number of nodes in the xml data tree P[t i |t i+1 ]Probability of tag t i occuring before tag t i+1 E EExpectation for the occurence of tag t n E1 E1Expectation for the occurence of tag t i before tag t i+1 (if n=2t i+1 = t n )
Selectivity Estimation Simple path p=//t 1 /t 2.../t n Analogous for single-value path p=//t 1 /t 2.../t n-1 =v n-1 Slightly more complicated for multi-value path
Selectivity Estimation Simple path p=//t 1 /t 2.../t n Single-value path p=//t 1 /t 2.../t n-1 =v n-1
Selectivity Estimation of a multi-value path p=//t 1 =v 1 /t 2 =v 2.../t n =v n Probability of v i occuring after t i, conditioned on observing t i
Example TagCount A1 B6 C3 Tag Count AB6 AC3 TagValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11 Real selectivity =3
Updates Changes in the data require the statistics to be updated Done via query feedback tuple (p, ) p denotes the path denotes the accurate selectivity of p Feedback is contributed to all path p according to some strategies
Learning process Given Initially empty Markov Histogram f Query feedback (p, ) Estimated selectivity Learn any unknown length-2-path Update selectivities for known paths Two strategies oHeavy-Tail-Rule oDelta-Rule
Algorithm-Part 1 Learn new paths of length up to 2 UPDATE(Histogram f, Feedback(p, ), Estimate ) if |p| 2 then if not exists f(p) then add entry f(p)= else f(p) Example: (AD)=1 (not in f), (AD) = 2 TagCount A1 B6 C3 3CA 6BA Tag 2 DA ValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11
Algorithm-Part 2 Learn longer paths (decompose into paths of length 2) else for each (t i,t i+1 ) p if not exists f(t i,t i+1 ) then add entry f(t i,t i+1 )=1 f(t i,t i+1 ) update endfor f(t i,t i+1 ) update depends on update strategy
Example TagCount A1 B6 C3 5CA 1DC 6BA Tag ValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11 (ACD)=1, (ACD)=5 f(CD)=4 decompose into AC and CD AC is present update the frequency CD is not present update f(CD) add f(CD)=1 4DC
Algorithm-Part 3 Learn frequency of single tags for each t i p, i 1 if not exists f(t i ) then add entry f(t i ) f(t i ) max{f(t i ), f( , t i )} endfor Example: (AD)=1 (not in f), (AD) = 2 3C 2D 6B 1A CountTag 3CA 6BA CountTag 2 DA ValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11
Update strategies Heavy-Tail-Rule Attribute more of the estimation error to the end of the path where w i weighting factors (increasing with i,e.g. 2 i ) learning rate W normalized weight W
Update strategies Delta-Rule Error reduction learning technique Minimizes an error function update to term f(t i,t i+1 ) proportional to the negative gradient of E with respect to f(t i,t i+1 ) determines the length of a step
Update strategies Delta-Rule update to term f( , ) proportional to the negative gradient of E with respect to f( , ) determines the length of a step
Evaluation Good on-line, adapts to changing data workload-aware after learning phase comparable to off-line methods update overhead nearly constant Bad still restricted to XML trees, no support for idrefs
Example Feedback for path ACD is (ACD,6) (ACD) ≈3, ε = 6-3=3 Updates Pathbefore updateafter update Heavy-Tail, =1 Delta, =0.5 AC345 CD688 C789 D799 attributes more to the end
Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees and Markov Tables oXPathLearner oXSketches Summary
Preliminaries XML Data Graph A: Author P: Paper B: Book PB: Publisher T: Title N: Name P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
Preliminaries Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14 T11 T12
Preliminaries Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8 V4E14 V10V11V12 V13V14 T11
Preliminaries Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set: {T1,T2} P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14 T11 T12
Preliminaries Motivation Selectivity Estimation over XML Data Graphs Outline oXSketch Synopsis oEstimation Framework oXSketch Refinement Operations oExperiment
XSketch Synopsis XML Data Graph General Synopsis Graph P(1) A(2) PB(1) N(2) P(2) B(2) T(2) T(2) E(1) Count(A) = | Extent(A) | = |{A1,A2}| =2 P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
Backward-edge Stability XML Data Graph Synopsis Graph b P(1) b A(2) PB(1) b b N(2) P(2) B(2) b b b T(2) T(2) E(1) Label(u,v) = b if all elements in v have a parent in u P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
Backward-edge Stability XML Data Graph Synopsis Graph b P(1) b A(2) PB(1) b b N(2) P(2) B(2) b b b T(2) T(2) E(1) Label(A2,B2) & Label(PB1,B2) are empty P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
Forward-edge Stability XML Data Graph Synopsis Graph f P(1) f A(2) PB(1) f f f N(2) P(2) B(2) f T(2) T(2) E(1) Label(u,v) = f if all elements in u have a child in v P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
Forward-edge Stability XML Data Graph Synopsis Graph f P(1) f A(2) PB(1) f f f N(2) P(2) B(2) f T(2) T(2) E(1) B9 is in B(2) have no child in E(1) P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
XSketch Synopsis XML Data Graph XSketch Synopsis Graph f/b P(1) f/b A(2) PB(1) f/b f/b Ø f N(2) P(2) B(2) f/b f/b b T(2) T(2) E(1) XSketch is a Synopsis G. with Label(u,v)={b,f,b/f, Ø} P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14
Estimation Framework calculate the Selectivity for the PE. V=V1/…/Vn Count (V) = Count (Vn) * f( V ) 1.Case: For all i if Label (Vi, Vi+1) = {b} f (V) =1, so Count (V) = Count (Vn) Example : f/b P(1) f/b A(2) PB(1) f/b f/b f N(2) P(2) B(2) f/b f/b b T(2) T(2) E(1) Count (A/P/T) = Count (T) * f (A/P/T) = 2
Estimation Framework 2.Case: if exist i s.t. Label (Vi,Vi+1)≠ {b} A1. Path Independance Assum- ption: f (u/v | v/w) ≈ f(u/v) A2. B-Edge Uniformity Assum- ption: all U i in U such that: Label (U,V) ≠ b are uniformly distributed over all such parents Example : f/b P(1) f/b A(2) PB(1) f/b f/b Ø f N(2) P(2) B(2) f/b f/b b T(2) T(2) E(1) f (P/PB/B/T) = ???
Estimation Framework Example: f (P/PB/B/T) = ?? f (P/PB/B/T) = f (B/T) * f (P/PB/B | B/T) = f (B/T) * f (PB/B | B/T) * f (P/PB | PB/B/T) B-Stability = f (PB/B | B/T) A1: ≈ f (PB/B) A2: = Count (PB) / [ Count (PB) + Count (A) ] f (P/PB/B/T) = 1 / 1+2 = 1/3
Estimation Framework A3. Branch-Independence Assumption: Outgoing paths from v are conditionally independent of the existence of other outgoing paths A4. Forward-Edge Uniformity Assumption : The outgoing edges from v to all children u of v such that Label(u,v) ≠ F are uniformly distributed across all such children
XSketch Refinement Operations Goal : construct an efficient XSketch for given space budget Refinement Operations: B-Stabilize (Xs (G), u,v): Label(v,u) ≠ B. Refine node u into two element partitions u1,u2 with the same label s.t. Label(v,u1) = B or Label(v,u2) = B Example : V1 V2…Vn U V1 V2….Vn b U1 U2 b-Stabilize
XSketch Refinement Operations f-Stabilize (Xs(G),u,w): Label(u,w)≠ F Refine u into two nodes u1,u2 with same label s.t. Label (u1,w) = label(u,w)U{F} Example: U W1 W2….Wn U1 U2 f W1 W2…….Wn f - Stabilize
XSketch Refinement Operations A P 1... PiPi P i+1... PnPn PiPi PnPn P 1... A1A1 A2A2 PiPi c(A) P 1... PiPi P i+1... PnPn PnPn Backward Split
Experiment Nr. of elements Coarsest Summary (ΚΒ) Perfect Summary(MB) IMDB102, XMark206, DBLP1,399,
Workload 1000 Positive Pes Biased random sample from document Path Length: contain range predicates oPredicates: random, 10% of value domain Similar results with negative PEs
Accuracy Metric Average Absolute Relative Error
Markov Tables vs. XSketch
Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees and Markov Tables oXPathLearner oXSketches Summary
Definition Selectivity Summarizing XML Documents (Path Trees / Markov Tables) Application using Markov Tables: XPathLearner Extension of Selectivity Estimation on Graphs: XSketch
Questions?