Download presentation
Presentation is loading. Please wait.
Published byLynne Dean Modified over 9 years ago
1
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
2
XML Summarization Synopses are essential for XML data management Statistics for XML query optimization Approximate query answering Active research topic in the field of XML databases Markov Tables, XSketch, XPathLearner, CSTs, TreeSketch,... XML Data Synopsis count(Q) Selectivity of Q Estimated selectivity of Q count(Q)
3
Content Heterogeneity Data Queries 2003 The history of histograms (abridged) Yannis Ioannidis The history of histograms is long and rich, full of detailed information in every step. It... //paper[year>2000][author contains “Ioannidis”]// abstract[ftcontains histograms,history] Numerical String Text RangeSubstring Term Containment
4
Synopses and Heterogeneity Mixed predicates => Unified summarization model Path structure Values of different types Correlations between and across Summarization for textual values //paper[year>2000][author contains “Ioannidis”]// abstract[ftcontains histograms,history] XML Data Synopsis
5
XCluster Synopses Data synopses for heterogeneous XML content Unified summarization for path structure and numerical, string, and textual content Support for twig queries with mixed predicates XCluster model Element clustering Tight cluster Similar structure and values Extensibility to other value types Principled compression framework Experimental results: high accuracy with low storage requirements
6
Outline Preliminaries XCluster Model XCluster Compression Construction Algorithm Experimental Study
7
Data and Query Model Tree data with heterogeneous value content Tree-pattern queries with XPath expressions Result: set of binding tuples for $q0 in /, $q1 in $q0/p[y>1999], $q2 in $q1/t[contains(XML), $q3 in $q1/ab[ ftcontains(synopsis,data) ] q0 q1 q3 q2 Numerica l Text Text String Range Substring TermContainment DataQuery
8
Problem Definition Problem: build a data synopsis that can estimate the selectivity of any query Challenges: Heterogeneity of content Data correlations Synopsis
9
XCluster Model
10
Structural Summarization Node Elements of same tag Statistical information: node- and edge-counts Node-count: number of elements in cluster Edge-count: average number of children XClusterData
11
Value Summarization Value summary => Fractional value distribution Single-dimensional Approximation method depends on value type XClusterData
12
Types of Value Summaries Numerical Content => Histograms String Content => Pruned Suffix Tries Text Content => End-biased Term Histograms “The history of histograms is long and rich, full of detailed information in every step. It...” TermFreq 0 (history)2 1 (histogram)7 2 (data)6 3 (database)5 4 (information)3 5 (value)2 BucketFreq 0100007 0010006 0001005 1000117/3 TextTerm MatrixTerm Histogram
13
XCluster Model A node aggregates information about its elements Correspondence to clustering: node cluster centroid element Basic assumptions: independence and uniformity Tight clusters => Valid assumptions Each element in A has: - 2 children in B - 3 children in C - value x with prob 70% - value y with prob. 30%
14
Estimation Example XCluster Query sel(Q)=(1)*(2)*(1*s t )*(1/2*s k ) 1*s t children 1/2*s k children 2 children 1 element Two-step estimation algorithm: Identify embeddings Estimate selectivity of each embedding Accuracy depends on “tightness” of centroids Embedding
15
XCluster Compression
16
Structural Compression Merge two nodes of same tag New node acquires aggregate characteristics Node- and edge-counts are aggregated Value summaries are “fused” Conceptually equivalent to cluster merging
17
Value-Based Compression Reduce the storage of a single value summary Specifics depend on type of summary Histogram: merge k buckets Pruned Suffix Trie: prune k nodes Remove leaf nodes based on statistical independence Term Histogram: move k terms to the uniform bucket
18
Compression vs. Accuracy Δ(S,S’): difference in accuracy between S and S’ Key idea: apply operations with low Δ(S,S’) Absolute vs. Relative metric Original XCluster SCompressed XCluster S’ S S’ R S S’ Absolute Relative
19
Distance Metric Δ(S,S’) μ-query => basic query involving structure+values u[s]/c: the number of children in c per element in u that satisfies value predicate s Intuition: capture centroid information pertaining to c and s Δ(S,S’): difference of estimates for μ-queries SS’
20
XCluster Construction
21
Step 1: Build reference synopsis Count stability + Detailed value summaries Step 2: Compress structural information Step 3: Compress value-based information XML Data Reference Summary XCluster with detailed value distributions XCluster ç±ç± ç± Step 1Step 2Step 3
22
Structural Compression Algorithm sketch: 1. Generate pool of candidate merge operations 2. Apply operations in increasing order of Δ(S,S’) 3. Repeat until size < budget A-priori generation of candidates Merges at level l trigger merges at level l-1 Adaptive, leaf-to-root merging of nodes XML Data Reference Summary XCluster with detailed value distributions XCluster ç±ç± ç± Step 1Step 2Step 3
23
Value-Based Compression Algorithm sketch: 1. Generate one operation for each value summary 2. Apply value compression with least Δ(S,S’) 3. Repeat until size < budget Generate operations of “least effect”: Histograms: merge buckets with least difference PSTs: prune leaves with max independence Term Histograms: remove singletons of least freq. XML Data Reference Summary XCluster with detailed value distributions XCluster ç±ç± ç± Step 1Step 2Step 3
24
Experimental Study
25
Methodology Data sets: Workloads: random twig queries Structure only and with predicates Biased toward high selectivities Metrics: Absolute relative error: |true-estim|/max(true,s) Absolute error: |true-estim| #Elements#Value PathsRef. Size (KB) XMark 206130 9869 IMDB 236822 7462
26
Accuracy of XClusters IMDB
27
XCluster vs. TreeSketch XMark
28
Conclusions XML synopses are essential for XML query optimization Our contribution: XCluster Synopses XML summaries for heterogeneous content Support for twig queries with numerical, string, and textual predicates XCluster model: generalized element clustering Principled construction algorithm Experimental results: high accuracy with low storage requirements
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.