Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze.

Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze

Outline Definition Selectivity Estimation Motivation Algorithms for Selectivity Estimation oPath Tree oMarkov Tables oXPathLearner oXSketches Summary

Selectivity Definition Selectivity of a path expression σ(p) is defined as the number of paths in the XML data tree that match the tag sequence in p ABCEDD Example: σ(A/B/D) = 2

Motivation Estimating the size of query results and intermediate results is neccessary for effective query optimization Knowing selectivities of sub-queries help identifying cheap query evaluation plans Internet Context: Quick feedback about expected result size before evaluating the full query result

Example XQuery-Expression: For $f IN document („personnel.xml“)//department/faculty WHERE count ($f/TA) > 0 AND count($f/RA) > 0 RETURN $f This expression matches all faculty members that has at least one TA and one RA one join for every edge is computed Presumption Number of nodes is known Join-Algorithm: Nested Loop Department Faculty RATA

NodeCount Dep.1 Faculty3 RA7 TA2 Department Name Faculty Secretary Name RA TA Faculty RA Scientist Name RA Method 1 Join 1: (Faculty) – TA Join 2: (Result Join 1) – RA Join 3: (Result Join 2) – Dep. Method 2 Join 1: (Faculty) – Dep. Join 2: (Result Join 1) – RA Join 3: (Result Join 2) – TA Evaluating the join Number of operations: Join 1: 3 * 2 = 6 Join 2: 1 * 7 = 7 Join 3: 1 * 1 = 1 Total = 14 Number of operations: Join 1: 3 * 1 = 3 Join 2: 3 * 7 = 21 Join 3: 3 * 2 = 6 Total = 30

Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees oMarkov Tables oXPathLearner oXSketches Summary

Representing XML data structure Path TreesMarkov Tables

A BC DDE 131 21 1 Path Trees Problem: The Path Tree may become larger than the available memory The tree has to be summarized

Summarizing a Path Tree 4 different Algorithms: Sibling-* Level-* Global-* No-* D elete the nodes with the lowest frequencies and replace them with a „* “ (star-node) to preserve some structural information Operation breakdown:

Sibling-* Operation breakdown: A BC EGHKKFD 1 9 106 1112 1557 13 K IJ 4 IJ 2 Mark the nodes with the lowest frequencies for deletion Check siblings, if sibling coalesce * n=2 f=6 Traverse Tree and compute average frequency 3 A BC * K F* * 1 9 8 f=23 n=2 3 156 13

Level-* A BC G K F* * 1 9 10 113 156 13 K 12 A BC EGH KK FD IJ 1 9 106 1112 2 1557 13 4 As before, delete the nodes with the lowest frequency One *-node for every level

Global-* A BC EGH KK FD IJ 1 9 106 1112 2 1557 13 4 Delete the nodes with the lowest frequency One *-node for the complete tree * BC GH KK FD 9 106 11 12 157 13 3

No-* Low frequency nodes are deleted and not replaced Tree may becomes a forest with many roots No-* conservatively assumes that nodes that do not exist in the summarized path tree did not exist in the original path tree

Selectivity-Estimation A BC * K F* * 1 9 8 11 3 156 13 find all matchings tags estimated selectivity = total frequency of these nodes Example: σ(A/B/F) = 15 + 6 = 21 σ(A/B/Z) = 6 σ(A/C/Z/K) = 11

What are Markov Tables ? Table, conaining all distinct paths in the data of length up to m and their selectivity m 2 Order: m - 1 Markov Table = Markov Histogramm ABC 1 611 D 4 C 9 D 7 D 8 PathSel.PathSel. A1AC6 B11AD4 C15BC9 D19BD7 AB11CD8

Selectivity Estimation The table provides selectivity estimates for all paths of length up to m Assumption that the occurence of a particular tag in a path is dependant only on m-1 tags occuring before it Selectivity estimation for longer path expressions is done with the following formula

Selectivity Estimation P[t n ]Propability of tag t n occuring in the xml data tree NTotal number of nodes in the xml data tree P[t i |t i+1 ]Probability of tag t i occuring before tag t i+1 E EPredictand for the occurence of tag t n E1 E1Predictand for the occurence of tag t i before tag t i+1 Markov Chain t1 t2 t3 t…

Selectivity Estimation = Selectivity of path p Example:

Summarizing Markov Tables The Nodes with the lowest selectivity are deleted and replaced 3 Algorithms: Suffix-* Global-* No-*

Suffix-* * - Path : representing all deleted paths of length 1 */* - Path : representing all deleted paths of length 2 Deleting a path of length 1 add to path * S D : Set of deleted paths with length 2 Deleting a path of length 2 add to S D and look for paths with the same start tag Example: S D ={(A/C), (G/H)} deleting (A/B) (A/*) Before checking S D, check Markov Table suffix-* path

Global-* * - Path : representing all deleted paths of length 1 */* - Path : representing all deleted paths of length 2 Deleting a path of length 1 add to path * Deleting a path of length 2 immediately add to path */*

No-* does not use *-Paths Low-frequency paths simply discarded If any of the required paths is not found (in the markov table) its selectivity is conservatively assumed to be zero

Which method should be used ? Path Trees vs. Markov Table Path exists in XML-Data * - Algorithm Path do not exist No - * - Algorithm „ * “ vs. „ No-* “ Data has common structure Markov Table Data has NO common structure Path Trees

Weaknesses of previous methods Off-line, scan of the entire data set Limited to simple path expressions Oblivious to workload distribution Updates too expensive

XPathLearner is... An on-line self-tuning Markov histogram for XML path selectivity estimation on-line: collects statistics from query feedback self-tuning: learns Markov model from feedback, adapts to changing XML data workload-aware supports simple, single-value and multi-value path expressions

Query Plan Enumerator Selectivity Estimator Execution Engine Query Plan Histogram Learner Histogram Result Query Optimizer Query feedback System Overview

Histogram Learner Histogram Training data Selectivity Estimator feedback, real selectivity updates estimated selectivity System uses feedback to update the statistics for the queried path. Updates are based on the observed estimation error. initial training Workflow observed estimation error

Basics Relies on path trees as intermediate representation Uses Markov histogram of order (m-1) to store the path tree and the statistics Henceforth m=2 table stores tag-tag and tag-value pairs and single tags

Data values Problem: Number of distinct data values is very large table may become larger than the available memory Solution Only the k most frequent tag-value pairs are stored exactly All other pairs are aggregated into buckets according to some feature Feature should distribute as uniform as possible

Example, k=1 TagCount A1 B6 C3 Tag Count AB6 AC3 TagValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11 Data value v1 begins with letter ‘a‘, v2 with the letter ‘b‘ A BC 1 36 1 V3V1V2 31

Selectivity Estimation P[t n ]Propability of tag t n occuring in the xml data tree NTotal number of nodes in the xml data tree P[t i |t i+1 ]Probability of tag t i occuring before tag t i+1 E EExpectation for the occurence of tag t n E1 E1Expectation for the occurence of tag t i before tag t i+1 (if n=2t i+1 = t n )

Selectivity Estimation Simple path p=//t 1 /t 2.../t n Analogous for single-value path p=//t 1 /t 2.../t n-1 =v n-1 Slightly more complicated for multi-value path

Selectivity Estimation Simple path p=//t 1 /t 2.../t n Single-value path p=//t 1 /t 2.../t n-1 =v n-1

Selectivity Estimation of a multi-value path p=//t 1 =v 1 /t 2 =v 2.../t n =v n Probability of v i occuring after t i, conditioned on observing t i

Example TagCount A1 B6 C3 Tag Count AB6 AC3 TagValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11 Real selectivity  =3

Updates Changes in the data require the statistics to be updated Done via query feedback tuple (p,  ) p denotes the path  denotes the accurate selectivity of p Feedback is contributed to all path p according to some strategies

Learning process Given Initially empty Markov Histogram f Query feedback (p,  ) Estimated selectivity  Learn any unknown length-2-path Update selectivities for known paths Two strategies oHeavy-Tail-Rule oDelta-Rule

Algorithm-Part 1 Learn new paths of length up to 2 UPDATE(Histogram f, Feedback(p,  ), Estimate  ) if |p|  2 then if not exists f(p) then add entry f(p)=  else f(p)  Example:  (AD)=1 (not in f),  (AD) = 2 TagCount A1 B6 C3 3CA 6BA Tag 2 DA ValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11

Algorithm-Part 2 Learn longer paths (decompose into paths of length 2) else for each (t i,t i+1 )  p if not exists f(t i,t i+1 ) then add entry f(t i,t i+1 )=1 f(t i,t i+1 )  update endfor f(t i,t i+1 )  update depends on update strategy

Example TagCount A1 B6 C3 5CA 1DC 6BA Tag ValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11  (ACD)=1,  (ACD)=5 f(CD)=4 decompose into AC and CD AC is present update the frequency CD is not present update f(CD) add f(CD)=1 4DC

Algorithm-Part 3 Learn frequency of single tags for each t i  p, i  1 if not exists f(t i ) then add entry f(t i ) f(t i )  max{f(t i ),   f( , t i )} endfor Example:  (AD)=1 (not in f),  (AD) = 2 3C 2D 6B 1A CountTag 3CA 6BA CountTag 2 DA ValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11

Update strategies Heavy-Tail-Rule Attribute more of the estimation error to the end of the path where w i weighting factors (increasing with i,e.g. 2 i )  learning rate W normalized weight W

Update strategies Delta-Rule Error reduction learning technique Minimizes an error function update to term f(t i,t i+1 ) proportional to the negative gradient of E with respect to f(t i,t i+1 )  determines the length of a step

Update strategies Delta-Rule update to term f( ,  ) proportional to the negative gradient of E with respect to f( ,  )  determines the length of a step

Evaluation Good on-line, adapts to changing data workload-aware after learning phase comparable to off-line methods update overhead nearly constant Bad still restricted to XML trees, no support for idrefs

Example Feedback for path ACD is (ACD,6)  (ACD) ≈3, ε = 6-3=3 Updates Pathbefore updateafter update Heavy-Tail,  =1 Delta,  =0.5 AC345 CD688 C789 D799 attributes more to the end

Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees and Markov Tables oXPathLearner oXSketches Summary

Preliminaries XML Data Graph A: Author P: Paper B: Book PB: Publisher T: Title N: Name P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

Preliminaries Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14 T11 T12

Preliminaries Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8 V4E14 V10V11V12 V13V14 T11

Preliminaries Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set: {T1,T2} P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14 T11 T12

Preliminaries Motivation Selectivity Estimation over XML Data Graphs Outline oXSketch Synopsis oEstimation Framework oXSketch Refinement Operations oExperiment

XSketch Synopsis XML Data Graph General Synopsis Graph P(1) A(2) PB(1) N(2) P(2) B(2) T(2) T(2) E(1) Count(A) = | Extent(A) | = |{A1,A2}| =2 P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

Backward-edge Stability XML Data Graph Synopsis Graph b P(1) b A(2) PB(1) b b N(2) P(2) B(2) b b b T(2) T(2) E(1) Label(u,v) = b if all elements in v have a parent in u P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

Backward-edge Stability XML Data Graph Synopsis Graph b P(1) b A(2) PB(1) b b N(2) P(2) B(2) b b b T(2) T(2) E(1) Label(A2,B2) & Label(PB1,B2) are empty P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

Forward-edge Stability XML Data Graph Synopsis Graph f P(1) f A(2) PB(1) f f f N(2) P(2) B(2) f T(2) T(2) E(1) Label(u,v) = f if all elements in u have a child in v P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

Forward-edge Stability XML Data Graph Synopsis Graph f P(1) f A(2) PB(1) f f f N(2) P(2) B(2) f T(2) T(2) E(1) B9 is in B(2) have no child in E(1) P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

XSketch Synopsis XML Data Graph XSketch Synopsis Graph f/b P(1) f/b A(2) PB(1) f/b f/b Ø f N(2) P(2) B(2) f/b f/b b T(2) T(2) E(1) XSketch is a Synopsis G. with Label(u,v)={b,f,b/f, Ø} P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

Estimation Framework calculate the Selectivity for the PE. V=V1/…/Vn Count (V) = Count (Vn) * f( V ) 1.Case: For all i if Label (Vi, Vi+1) = {b} f (V) =1, so Count (V) = Count (Vn) Example : f/b P(1) f/b A(2) PB(1) f/b f/b f N(2) P(2) B(2) f/b f/b b T(2) T(2) E(1) Count (A/P/T) = Count (T) * f (A/P/T) = 2

Estimation Framework 2.Case: if exist i s.t. Label (Vi,Vi+1)≠ {b} A1. Path Independance Assum- ption: f (u/v | v/w) ≈ f(u/v) A2. B-Edge Uniformity Assum- ption: all U i in U such that: Label (U,V) ≠ b are uniformly distributed over all such parents Example : f/b P(1) f/b A(2) PB(1) f/b f/b Ø f N(2) P(2) B(2) f/b f/b b T(2) T(2) E(1) f (P/PB/B/T) = ???

Estimation Framework Example: f (P/PB/B/T) = ?? f (P/PB/B/T) = f (B/T) * f (P/PB/B | B/T) = f (B/T) * f (PB/B | B/T) * f (P/PB | PB/B/T) B-Stability = f (PB/B | B/T) A1: ≈ f (PB/B) A2: = Count (PB) / [ Count (PB) + Count (A) ] f (P/PB/B/T) = 1 / 1+2 = 1/3

Estimation Framework A3. Branch-Independence Assumption: Outgoing paths from v are conditionally independent of the existence of other outgoing paths A4. Forward-Edge Uniformity Assumption : The outgoing edges from v to all children u of v such that Label(u,v) ≠ F are uniformly distributed across all such children

XSketch Refinement Operations Goal : construct an efficient XSketch for given space budget Refinement Operations: B-Stabilize (Xs (G), u,v): Label(v,u) ≠ B. Refine node u into two element partitions u1,u2 with the same label s.t. Label(v,u1) = B or Label(v,u2) = B Example : V1 V2…Vn U V1 V2….Vn b U1 U2 b-Stabilize

XSketch Refinement Operations f-Stabilize (Xs(G),u,w): Label(u,w)≠ F Refine u into two nodes u1,u2 with same label s.t. Label (u1,w) = label(u,w)U{F} Example: U W1 W2….Wn U1 U2 f W1 W2…….Wn f - Stabilize

XSketch Refinement Operations A P 1... PiPi P i+1... PnPn PiPi PnPn P 1... A1A1 A2A2 PiPi c(A) P 1... PiPi P i+1... PnPn PnPn Backward Split

Experiment Nr. of elements Coarsest Summary (ΚΒ) Perfect Summary(MB) IMDB102,7555.71.5 XMark206,1313.76.2 DBLP1,399,76617.00.1

Workload 1000 Positive Pes Biased random sample from document Path Length: 2-5 500 contain range predicates oPredicates: random, 10% of value domain Similar results with negative PEs

Accuracy Metric Average Absolute Relative Error

Markov Tables vs. XSketch

Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees and Markov Tables oXPathLearner oXSketches Summary

Definition Selectivity Summarizing XML Documents (Path Trees / Markov Tables) Application using Markov Tables: XPathLearner Extension of Selectivity Estimation on Graphs: XSketch

Questions?

Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze.

Similar presentations

Presentation on theme: "Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze.

Similar presentations

Presentation on theme: "Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze."— Presentation transcript:

Similar presentations

About project

Feedback