XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

XML Summarization Synopses are essential for XML data management Statistics for XML query optimization Approximate query answering Active research topic in the field of XML databases Markov Tables, XSketch, XPathLearner, CSTs, TreeSketch,... XML Data Synopsis count(Q) Selectivity of Q Estimated selectivity of Q count(Q)

Content Heterogeneity Data Queries 2003 The history of histograms (abridged) Yannis Ioannidis The history of histograms is long and rich, full of detailed information in every step. It... //paper[year>2000][author contains “Ioannidis”]// abstract[ftcontains histograms,history] Numerical String Text RangeSubstring Term Containment

Synopses and Heterogeneity Mixed predicates => Unified summarization model Path structure Values of different types Correlations between and across Summarization for textual values //paper[year>2000][author contains “Ioannidis”]// abstract[ftcontains histograms,history] XML Data Synopsis

XCluster Synopses Data synopses for heterogeneous XML content Unified summarization for path structure and numerical, string, and textual content Support for twig queries with mixed predicates XCluster model Element clustering Tight cluster Similar structure and values Extensibility to other value types Principled compression framework Experimental results: high accuracy with low storage requirements

Outline Preliminaries XCluster Model XCluster Compression Construction Algorithm Experimental Study

Data and Query Model Tree data with heterogeneous value content Tree-pattern queries with XPath expressions Result: set of binding tuples for $q0 in /, $q1 in $q0/p[y>1999], $q2 in $q1/t[contains(XML), $q3 in $q1/ab[ ftcontains(synopsis,data) ] q0 q1 q3 q2 Numerica l Text Text String Range Substring TermContainment DataQuery

Problem Definition Problem: build a data synopsis that can estimate the selectivity of any query Challenges: Heterogeneity of content Data correlations Synopsis

XCluster Model

Structural Summarization Node Elements of same tag Statistical information: node- and edge-counts Node-count: number of elements in cluster Edge-count: average number of children XClusterData

Value Summarization Value summary => Fractional value distribution Single-dimensional Approximation method depends on value type XClusterData

Types of Value Summaries Numerical Content => Histograms String Content => Pruned Suffix Tries Text Content => End-biased Term Histograms “The history of histograms is long and rich, full of detailed information in every step. It...” TermFreq 0 (history)2 1 (histogram)7 2 (data)6 3 (database)5 4 (information)3 5 (value)2 BucketFreq 0100007 0010006 0001005 1000117/3 TextTerm MatrixTerm Histogram

XCluster Model A node aggregates information about its elements Correspondence to clustering: node cluster centroid element Basic assumptions: independence and uniformity Tight clusters => Valid assumptions Each element in A has: - 2 children in B - 3 children in C - value x with prob 70% - value y with prob. 30%

Estimation Example XCluster Query sel(Q)=(1)*(2)*(1*s t )*(1/2*s k ) 1*s t children 1/2*s k children 2 children 1 element Two-step estimation algorithm: Identify embeddings Estimate selectivity of each embedding Accuracy depends on “tightness” of centroids Embedding

XCluster Compression

Structural Compression Merge two nodes of same tag New node acquires aggregate characteristics Node- and edge-counts are aggregated Value summaries are “fused” Conceptually equivalent to cluster merging

Value-Based Compression Reduce the storage of a single value summary Specifics depend on type of summary Histogram: merge k buckets Pruned Suffix Trie: prune k nodes Remove leaf nodes based on statistical independence Term Histogram: move k terms to the uniform bucket

Compression vs. Accuracy Δ(S,S’): difference in accuracy between S and S’ Key idea: apply operations with low Δ(S,S’) Absolute vs. Relative metric Original XCluster SCompressed XCluster S’ S S’ R S S’ Absolute Relative

Distance Metric Δ(S,S’) μ-query => basic query involving structure+values u[s]/c: the number of children in c per element in u that satisfies value predicate s Intuition: capture centroid information pertaining to c and s Δ(S,S’): difference of estimates for μ-queries SS’

XCluster Construction

Step 1: Build reference synopsis Count stability + Detailed value summaries Step 2: Compress structural information Step 3: Compress value-based information XML Data Reference Summary XCluster with detailed value distributions XCluster ç±ç± ç± Step 1Step 2Step 3

Structural Compression Algorithm sketch: 1. Generate pool of candidate merge operations 2. Apply operations in increasing order of Δ(S,S’) 3. Repeat until size < budget A-priori generation of candidates Merges at level l trigger merges at level l-1 Adaptive, leaf-to-root merging of nodes XML Data Reference Summary XCluster with detailed value distributions XCluster ç±ç± ç± Step 1Step 2Step 3

Value-Based Compression Algorithm sketch: 1. Generate one operation for each value summary 2. Apply value compression with least Δ(S,S’) 3. Repeat until size < budget Generate operations of “least effect”: Histograms: merge buckets with least difference PSTs: prune leaves with max independence Term Histograms: remove singletons of least freq. XML Data Reference Summary XCluster with detailed value distributions XCluster ç±ç± ç± Step 1Step 2Step 3

Experimental Study

Methodology Data sets: Workloads: random twig queries Structure only and with predicates Biased toward high selectivities Metrics: Absolute relative error: |true-estim|/max(true,s) Absolute error: |true-estim| #Elements#Value PathsRef. Size (KB) XMark 206130 9869 IMDB 236822 7462

Accuracy of XClusters IMDB

XCluster vs. TreeSketch XMark

Conclusions XML synopses are essential for XML query optimization Our contribution: XCluster Synopses XML summaries for heterogeneous content Support for twig queries with numerical, string, and textual predicates XCluster model: generalized element clustering Principled construction algorithm Experimental results: high accuracy with low storage requirements

XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Similar presentations

Presentation on theme: "XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Similar presentations

Presentation on theme: "XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)"— Presentation transcript:

Similar presentations

About project

Feedback