Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze.

Slides:



Advertisements
Similar presentations
Conceptual Clustering
Advertisements

CS4432: Database Systems II
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
Introduction to Histograms Presented By: Laukik Chitnis
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Use of Kalman filters in time and frequency analysis John Davis 1st May 2011.
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Heuristic alignment algorithms and cost matrices
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
Randomized Algorithms - Treaps
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Querying Structured Text in an XML Database By Xuemei Luo.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Database Systems Part VII: XML Querying Software School of Hunan University
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Probabilistic Networks Chapter 14 of Dechter’s CP textbook Speaker: Daniel Geschwender April 1, 2013 April 1&3, 2013DanielG--Probabilistic Networks1.
Chapter 7. Trees Weiqi Luo ( 骆伟祺 ) School of Software Sun Yat-Sen University : Office : A309
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Minas Gjoka, Emily Smith, Carter T. Butts
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
Advanced Algorithms for Fast and Scalable Deep Packet Inspection Author : Sailesh Kumar 、 Jonathan Turner 、 John Williams Publisher : ANCS’06 Presenter.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
A paper on Join Synopses for Approximate Query Answering
Structure and Value Synopses for XML Data Graphs
Probabilistic Data Management
COSC160: Data Structures Linked Lists
On Spatial Joins in MapReduce
Random Sampling over Joins Revisited
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 2- Query Processing (continued)
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Presentation transcript:

Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze

Outline Definition Selectivity Estimation Motivation Algorithms for Selectivity Estimation oPath Tree oMarkov Tables oXPathLearner oXSketches Summary

Selectivity Definition Selectivity of a path expression σ(p) is defined as the number of paths in the XML data tree that match the tag sequence in p ABCEDD Example: σ(A/B/D) = 2

Motivation Estimating the size of query results and inter- mediate results is neccessary for effective query optimization Knowing selectivities of sub-queries help identifying cheap query evaluation plans Internet Context: Quick feedback about expected result size before evaluating the full query result

Example XQuery-Expression: For $f IN document („personnel.xml“)//department/faculty WHERE count ($f/TA) > 0 AND count($f/RA) > 0 RETURN $f This expression matches all faculty members that has at least one TA and one RA one join for every edge is computed Presumption Number of nodes is known Join-Algorithm: Nested Loop Department Faculty RATA

NodeCount Dep.1 Faculty3 RA7 TA2 Department Name Faculty Secretary Name RA TA Faculty RA Scientist Name RA Method 1 Join 1: (Faculty) – TA Join 2: (Result Join 1) – RA Join 3: (Result Join 2) – Dep. Method 2 Join 1: (Faculty) – Dep. Join 2: (Result Join 1) – RA Join 3: (Result Join 2) – TA Evaluating the join Number of operations: Join 1: 3 * 2 = 6 Join 2: 1 * 7 = 7 Join 3: 1 * 1 = 1 Total = 14 Number of operations: Join 1: 3 * 1 = 3 Join 2: 3 * 7 = 21 Join 3: 3 * 2 = 6 Total = 30

Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees oMarkov Tables oXPathLearner oXSketches Summary

Representing XML data structure Path TreesMarkov Tables

A BC DDE Path Trees Problem: The Path Tree may become larger than the available memory The tree has to be summarized

Summarizing a Path Tree 4 different Algorithms: Sibling-* Level-* Global-* No-* D elete the nodes with the lowest frequencies and replace them with a „* “ (star-node) to preserve some structural information Operation breakdown:

Sibling-* Operation breakdown: A BC EGHKKFD K IJ 4 IJ 2 Mark the nodes with the lowest frequencies for deletion Check siblings, if sibling coalesce * n=2 f=6 Traverse Tree and compute average frequency 3 A BC * K F* * f=23 n=

Level-* A BC G K F* * K 12 A BC EGH KK FD IJ As before, delete the nodes with the lowest frequency One *-node for every level

Global-* A BC EGH KK FD IJ Delete the nodes with the lowest frequency One *-node for the complete tree * BC GH KK FD

No-* Low frequency nodes are deleted and not replaced Tree may becomes a forest with many roots No-* conservatively assumes that nodes that do not exist in the summarized path tree did not exist in the original path tree

Selectivity-Estimation A BC * K F* * find all matchings tags estimated selectivity = total frequency of these nodes Example: σ(A/B/F) = = 21 σ(A/B/Z) = 6 σ(A/C/Z/K) = 11

Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees oMarkov Tables oXPathLearner oXSketches Summary

What are Markov Tables ? Table, conaining all distinct paths in the data of length up to m and their selectivity m 2 Order: m - 1 Markov Table = Markov Histogramm ABC D 4 C 9 D 7 D 8 PathSel.PathSel. A1AC6 B11AD4 C15BC9 D19BD7 AB11CD8

Selectivity Estimation The table provides selectivity estimates for all paths of length up to m Assumption that the occurence of a particular tag in a path is dependant only on m-1 tags occuring before it Selectivity estimation for longer path expressions is done with the following formula

Selectivity Estimation P[t n ]Propability of tag t n occuring in the xml data tree NTotal number of nodes in the xml data tree P[t i |t i+1 ]Probability of tag t i occuring before tag t i+1 E EPredictand for the occurence of tag t n E1 E1Predictand for the occurence of tag t i before tag t i+1 Markov Chain t1 t2 t3 t…

Selectivity Estimation = Selectivity of path p Example:

Summarizing Markov Tables The Nodes with the lowest selectivity are deleted and replaced 3 Algorithms: Suffix-* Global-* No-*

Suffix-* * - Path : representing all deleted paths of length 1 */* - Path : representing all deleted paths of length 2 Deleting a path of length 1 add to path * S D : Set of deleted paths with length 2 Deleting a path of length 2 add to S D and look for paths with the same start tag Example: S D ={(A/C), (G/H)} deleting (A/B) (A/*) Before checking S D, check Markov Table suffix-* path

Global-* * - Path : representing all deleted paths of length 1 */* - Path : representing all deleted paths of length 2 Deleting a path of length 1 add to path * Deleting a path of length 2 immediately add to path */*

No-* does not use *-Paths Low-frequency paths simply discarded If any of the required paths is not found (in the markov table) its selectivity is conservatively assumed to be zero

Which method should be used ? Path Trees vs. Markov Table Path exists in XML-Data * - Algorithm Path do not exist No - * - Algorithm „ * “ vs. „ No-* “ Data has common structure Markov Table Data has NO common structure Path Trees

Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees oMarkov Tables oXPathLearner oXSketches Summary

Weaknesses of previous methods Off-line, scan of the entire data set Limited to simple path expressions Oblivious to workload distribution Updates too expensive

XPathLearner is... An on-line self-tuning Markov histogram for XML path selectivity estimation on-line: collects statistics from query feedback self-tuning: learns Markov model from feedback, adapts to changing XML data workload-aware supports simple, single-value and multi-value path expressions

Query Plan Enumerator Selectivity Estimator Execution Engine Query Plan Histogram Learner Histogram Result Query Optimizer Query feedback System Overview

Histogram Learner Histogram Training data Selectivity Estimator feedback, real selectivity updates estimated selectivity System uses feedback to update the statistics for the queried path. Updates are based on the observed estimation error. initial training Workflow observed estimation error

Basics Relies on path trees as intermediate representation Uses Markov histogram of order (m-1) to store the path tree and the statistics Henceforth m=2 table stores tag-tag and tag-value pairs and single tags

Data values Problem: Number of distinct data values is very large table may become larger than the available memory Solution Only the k most frequent tag-value pairs are stored exactly All other pairs are aggregated into buckets according to some feature Feature should distribute as uniform as possible

Example, k=1 TagCount A1 B6 C3 Tag Count AB6 AC3 TagValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11 Data value v1 begins with letter ‘a‘, v2 with the letter ‘b‘ A BC V3V1V2 31

Selectivity Estimation P[t n ]Propability of tag t n occuring in the xml data tree NTotal number of nodes in the xml data tree P[t i |t i+1 ]Probability of tag t i occuring before tag t i+1 E EExpectation for the occurence of tag t n E1 E1Expectation for the occurence of tag t i before tag t i+1 (if n=2t i+1 = t n )

Selectivity Estimation Simple path p=//t 1 /t 2.../t n Analogous for single-value path p=//t 1 /t 2.../t n-1 =v n-1 Slightly more complicated for multi-value path

Selectivity Estimation Simple path p=//t 1 /t 2.../t n Single-value path p=//t 1 /t 2.../t n-1 =v n-1

Selectivity Estimation of a multi-value path p=//t 1 =v 1 /t 2 =v 2.../t n =v n Probability of v i occuring after t i, conditioned on observing t i

Example TagCount A1 B6 C3 Tag Count AB6 AC3 TagValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11 Real selectivity  =3

Updates Changes in the data require the statistics to be updated Done via query feedback tuple (p,  ) p denotes the path  denotes the accurate selectivity of p Feedback is contributed to all path p according to some strategies

Learning process Given Initially empty Markov Histogram f Query feedback (p,  ) Estimated selectivity  Learn any unknown length-2-path Update selectivities for known paths Two strategies oHeavy-Tail-Rule oDelta-Rule

Algorithm-Part 1 Learn new paths of length up to 2 UPDATE(Histogram f, Feedback(p,  ), Estimate  ) if |p|  2 then if not exists f(p) then add entry f(p)=  else f(p)  Example:  (AD)=1 (not in f),  (AD) = 2 TagCount A1 B6 C3 3CA 6BA Tag 2 DA ValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11

Algorithm-Part 2 Learn longer paths (decompose into paths of length 2) else for each (t i,t i+1 )  p if not exists f(t i,t i+1 ) then add entry f(t i,t i+1 )=1 f(t i,t i+1 )  update endfor f(t i,t i+1 )  update depends on update strategy

Example TagCount A1 B6 C3 5CA 1DC 6BA Tag ValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11  (ACD)=1,  (ACD)=5 f(CD)=4 decompose into AC and CD AC is present update the frequency CD is not present update f(CD) add f(CD)=1 4DC

Algorithm-Part 3 Learn frequency of single tags for each t i  p, i  1 if not exists f(t i ) then add entry f(t i ) f(t i )  max{f(t i ),   f( , t i )} endfor Example:  (AD)=1 (not in f),  (AD) = 2 3C 2D 6B 1A CountTag 3CA 6BA CountTag 2 DA ValueCount Bv33 TagFeat.Sum#pairs Bb11 Ca11

Update strategies Heavy-Tail-Rule Attribute more of the estimation error to the end of the path where w i weighting factors (increasing with i,e.g. 2 i )  learning rate W normalized weight W

Update strategies Delta-Rule Error reduction learning technique Minimizes an error function update to term f(t i,t i+1 ) proportional to the negative gradient of E with respect to f(t i,t i+1 )  determines the length of a step

Update strategies Delta-Rule update to term f( ,  ) proportional to the negative gradient of E with respect to f( ,  )  determines the length of a step

Evaluation Good on-line, adapts to changing data workload-aware after learning phase comparable to off-line methods update overhead nearly constant Bad still restricted to XML trees, no support for idrefs

Example Feedback for path ACD is (ACD,6)  (ACD) ≈3, ε = 6-3=3 Updates Pathbefore updateafter update Heavy-Tail,  =1 Delta,  =0.5 AC345 CD688 C789 D799 attributes more to the end

Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees and Markov Tables oXPathLearner oXSketches Summary

Preliminaries XML Data Graph A: Author P: Paper B: Book PB: Publisher T: Title N: Name P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

Preliminaries Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14 T11 T12

Preliminaries Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8 V4E14 V10V11V12 V13V14 T11

Preliminaries Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set: {T1,T2} P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14 T11 T12

Preliminaries Motivation Selectivity Estimation over XML Data Graphs Outline oXSketch Synopsis oEstimation Framework oXSketch Refinement Operations oExperiment

XSketch Synopsis XML Data Graph General Synopsis Graph P(1) A(2) PB(1) N(2) P(2) B(2) T(2) T(2) E(1) Count(A) = | Extent(A) | = |{A1,A2}| =2 P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

Backward-edge Stability XML Data Graph Synopsis Graph b P(1) b A(2) PB(1) b b N(2) P(2) B(2) b b b T(2) T(2) E(1) Label(u,v) = b if all elements in v have a parent in u P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

Backward-edge Stability XML Data Graph Synopsis Graph b P(1) b A(2) PB(1) b b N(2) P(2) B(2) b b b T(2) T(2) E(1) Label(A2,B2) & Label(PB1,B2) are empty P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

Forward-edge Stability XML Data Graph Synopsis Graph f P(1) f A(2) PB(1) f f f N(2) P(2) B(2) f T(2) T(2) E(1) Label(u,v) = f if all elements in u have a child in v P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

Forward-edge Stability XML Data Graph Synopsis Graph f P(1) f A(2) PB(1) f f f N(2) P(2) B(2) f T(2) T(2) E(1) B9 is in B(2) have no child in E(1) P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

XSketch Synopsis XML Data Graph XSketch Synopsis Graph f/b P(1) f/b A(2) PB(1) f/b f/b Ø f N(2) P(2) B(2) f/b f/b b T(2) T(2) E(1) XSketch is a Synopsis G. with Label(u,v)={b,f,b/f, Ø} P0 A1 PB3 P6N4 T13 N8B5 T10 A2 P7B9 T12V8T11V4E14 V10V11V12 V13V14

Estimation Framework calculate the Selectivity for the PE. V=V1/…/Vn Count (V) = Count (Vn) * f( V ) 1.Case: For all i if Label (Vi, Vi+1) = {b} f (V) =1, so Count (V) = Count (Vn) Example : f/b P(1) f/b A(2) PB(1) f/b f/b f N(2) P(2) B(2) f/b f/b b T(2) T(2) E(1) Count (A/P/T) = Count (T) * f (A/P/T) = 2

Estimation Framework 2.Case: if exist i s.t. Label (Vi,Vi+1)≠ {b} A1. Path Independance Assum- ption: f (u/v | v/w) ≈ f(u/v) A2. B-Edge Uniformity Assum- ption: all U i in U such that: Label (U,V) ≠ b are uniformly distributed over all such parents Example : f/b P(1) f/b A(2) PB(1) f/b f/b Ø f N(2) P(2) B(2) f/b f/b b T(2) T(2) E(1) f (P/PB/B/T) = ???

Estimation Framework Example: f (P/PB/B/T) = ?? f (P/PB/B/T) = f (B/T) * f (P/PB/B | B/T) = f (B/T) * f (PB/B | B/T) * f (P/PB | PB/B/T) B-Stability = f (PB/B | B/T) A1: ≈ f (PB/B) A2: = Count (PB) / [ Count (PB) + Count (A) ] f (P/PB/B/T) = 1 / 1+2 = 1/3

Estimation Framework A3. Branch-Independence Assumption: Outgoing paths from v are conditionally independent of the existence of other outgoing paths A4. Forward-Edge Uniformity Assumption : The outgoing edges from v to all children u of v such that Label(u,v) ≠ F are uniformly distributed across all such children

XSketch Refinement Operations Goal : construct an efficient XSketch for given space budget Refinement Operations: B-Stabilize (Xs (G), u,v): Label(v,u) ≠ B. Refine node u into two element partitions u1,u2 with the same label s.t. Label(v,u1) = B or Label(v,u2) = B Example : V1 V2…Vn U V1 V2….Vn b U1 U2 b-Stabilize

XSketch Refinement Operations f-Stabilize (Xs(G),u,w): Label(u,w)≠ F Refine u into two nodes u1,u2 with same label s.t. Label (u1,w) = label(u,w)U{F} Example: U W1 W2….Wn U1 U2 f W1 W2…….Wn f - Stabilize

XSketch Refinement Operations A P 1... PiPi P i+1... PnPn PiPi PnPn P 1... A1A1 A2A2 PiPi c(A) P 1... PiPi P i+1... PnPn PnPn Backward Split

Experiment Nr. of elements Coarsest Summary (ΚΒ) Perfect Summary(MB) IMDB102, XMark206, DBLP1,399,

Workload 1000 Positive Pes Biased random sample from document Path Length: contain range predicates oPredicates: random, 10% of value domain Similar results with negative PEs

Accuracy Metric Average Absolute Relative Error

Markov Tables vs. XSketch

Outline Motivation Definition Selectivity Estimation Algorithms for Selectivity Estimation oPath Trees and Markov Tables oXPathLearner oXSketches Summary

Definition Selectivity Summarizing XML Documents (Path Trees / Markov Tables) Application using Markov Tables: XPathLearner Extension of Selectivity Estimation on Graphs: XSketch

Questions?