A Quick Introduction to Approximate Query Processing Part-IV CS286, Spring’2007 Minos Garofalakis.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
CS4432: Database Systems II
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Optimal Workload-Based Weighted Wavelet Synopsis
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Introduction to XML, XPath, & XQuery CS186, Fall 2005 R &G - Chapters 7-27 Bill Gates, The Revolution, and a Network of Trees ( based on a true story)
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
A Quick Introduction to Approximate Query Processing CS286, Spring’2007 Minos Garofalakis.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
Evaluating Hypotheses
A Quick Introduction to Approximate Query Processing Part II
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Internet Management Research Dept. Bell Labs, Lucent
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
A PPROXIMATE Q UERY P ROCESSING U SING W AVELETS Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
Histograms for Selectivity Estimation
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Presented By Anirban Maiti Chandrashekar Vijayarenu
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
Presented by: Sandeep Chittal Minimum-Effort Driven Dynamic Faceted Search in Structured Databases Authors: Senjuti Basu Roy, Haidong Wang, Gautam Das,
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Dense-Region Based Compact Data Cube
Data Transformation: Normalization
A paper on Join Synopses for Approximate Query Answering
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Structure and Value Synopses for XML Data Graphs
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
ICICLES: Self-tuning Samples for Approximate Query Answering
AQUA: Approximate Query Answering
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Probabilistic Databases
Presentation transcript:

A Quick Introduction to Approximate Query Processing Part-IV CS286, Spring’2007 Minos Garofalakis

2CS286, Spring’07 – Minos Garofalakis #Logistics… Draft CS286 web site is finally up! – Project list and guidelines being worked on –Please me & Raghu to discuss your own project ideas…

3CS286, Spring’07 – Minos Garofalakis # Approximate Query Processing using Data Synopses How to construct effective data synopses ?? SQL Query Exact Answer Decision Support Systems (DSS) Long Response Times! GB/TB Compact Data Synopses “Transformed” Query KB/MB Approximate Answer FAST!!

4CS286, Spring’07 – Minos Garofalakis # Relations as Frequency Distributions salary age name age salary sales One-dimensional distribution Age (attribute domain values) tuple counts Three-dimensional distribution tuple counts

5CS286, Spring’07 – Minos Garofalakis #Outline Intro & Approximate Query Answering Overview –Synopses, System architectures, Commercial offerings One-Dimensional Synopses –Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning –Samples: Basics, Sampling from DBs, Reservoir Sampling –Wavelets: 1-D Haar-wavelet histogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions

6CS286, Spring’07 – Minos Garofalakis #Outline Intro & Approximate Query Answering Overview –Synopses, System architecture, Commercial offerings One-Dimensional Synopses –Histograms, Samples, Wavelets Multi-Dimensional Synopses and Joins –Multi-D Histograms, Join synopses, Wavelets Set-Valued Queries –Error metrics; Using Histograms, Samples, Wavelets Discussion & Comparisons Advanced Techniques & Future Directions –Dependency-based, Streaming data

7CS286, Spring’07 – Minos Garofalakis # Two-dimensional Haar Wavelets -- Non-standard decomposition Wavelet Transform Array: Averaging & Differencing (a+b-c-d)/4 (a+b+c+d)/4 (a-b-c+d)/4 (a-b+c-d)/4 RECURSE “Supports”

8CS286, Spring’07 – Minos Garofalakis # Multi-dimensional Haar Wavelets Haar decomposition in d dimensions = d-dimensional array of wavelet coefficients –Coefficient support region = d-dimensional rectangle of cells in the original data array –Sign of coefficient’s contribution can vary along the quadrants of its support Support regions & signs for the 16 nonstandard 2-dimensional Haar coefficients of a 4X4 data array A

9CS286, Spring’07 – Minos Garofalakis # Range-sum Estimation Using Wavelet Synopses Coefficient thresholding –As in 1-d case, normalizing by appropriate constants and retaining the largest coefficients minimizes the overall L2 error Range-sums: selectivity estimation or OLAP-cube aggregates [VW99] (“measure attribute” as count) Only coefficients with support regions intersecting the query hyper- rectangle can contribute –Many contributions can cancel each other [CGR00, VW99] +- Query Range Contribution to range sum = 0 Only nodes on the path to range endpoints can have nonzero contributions (Extends naturally to multi-dimensional range sums) Decomposition Tree (1-d)

10CS286, Spring’07 – Minos Garofalakis # Approximate Query Processing Using Wavelets [CGR00] Wavelet Synopses Approximate Relations Query Results in Wavelet Domain Final Approximate Results Render Querying in Wavelet Domain Querying in Relation Domain Compressed domain (FAST) Relation domain (SLOW) Reduce relations into compact wavelet-coefficient synopses Entire query processing in the compressed (wavelet) domain

11CS286, Spring’07 – Minos Garofalakis # Wavelet Query Processing join projec t select set of coefficients set of coefficients set of coefficients Each operator (e.g., select, project, join, aggregates, etc.) – input: set of wavelet coefficients – output: set of wavelet coefficients Finally, rendering step – input: set of wavelet coefficients – output: (multi)set of tuples render

12CS286, Spring’07 – Minos Garofalakis # Selection -- Relational Domain In relational domain, interested in only those cells inside query range In wavelet domain, interested in only the coefficients that contribute to those cells Dim. D Query Range Dim. D1 Joint Data Distribution Array Relation

13CS286, Spring’07 – Minos Garofalakis # Selection -- Wavelet Domain D2 D1 Query Range D2 D1

14CS286, Spring’07 – Minos Garofalakis # Equi-join -- Relational Domain Relational domain: Join count= 7*3 = (A1-A3)*(B2+B3) Wavelet domain: A1*B2 + A1*B3 - A3*B2 - A3*B3 Consider all pairs of coefficients: (1) check joinability (overlap in join dimension(s)), (2) compute output coefficients 3 Coefficients A1 (+) and A3 (-) contribute to this cell Coefficients B2 (+), and B3 (+) contribute to this cell Join along D1 Joint Data Distribution of Relation 1 of Relation 1 Joint Data Distr. of Relation 2 of Relation Dim. D2 Dim. D3 Join Dim. D1 Relation 1 Relation 2

15CS286, Spring’07 – Minos Garofalakis # Equi-join -- Wavelet Domain - + D3 D D2 D1 v1 v2 Join output coefficient: D3 D1 + D2 - v = v1 * v2

16CS286, Spring’07 – Minos Garofalakis # Wavelet Query Processing join projec t select set of coefficients set of coefficients set of coefficients Each operator (e.g., select, project, join, aggregates, etc.) – input: set of wavelet coefficients – output: set of wavelet coefficients Finally, rendering step – input: set of wavelet coefficients – output: (multi)set of tuples render

17CS286, Spring’07 – Minos Garofalakis #Outline Intro & Approximate Query Answering Overview One-Dimensional Synopses Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Conclusions

18CS286, Spring’07 – Minos Garofalakis # Discussion & Comparisons (1) Histograms & Wavelets: Limited by “curse of dimensionality” –Rely on data space partitioning in “regions” –Ineffective above 5-6 dimensions Value/frequency uniformity assumptions within buckets break down in medium-to-high dimensionalities!!Value/frequency uniformity assumptions within buckets break down in medium-to-high dimensionalities!! Sampling: No such limitations, BUT... –Ineffective for ad-hoc relational joins over arbitrary schemas Uniformity property is lost Quality guarantees degrade –Effectiveness for set-valued approximate queries is unclear Only (very) small subsets of the answer set are returned (especially, when joins are present)

19CS286, Spring’07 – Minos Garofalakis # Discussion & Comparisons (2) Histograms & Wavelets: Compress data by accurately capturing rectangular “regions” in the data space –Advantage over sampling for typical, “range-based” relational DB queries –BUT, unclear how to effectively handle unordered/non-numeric data sets (no such issues with sampling...) Sampling: Provides strong probabilistic quality guarantees (unbiased answers) for individual aggregate queries answers to individual queries can be heavily biased!! –Histograms & Wavelets: Can guarantee a bound on the overall error (e.g., L2) for the approximation, BUT answers to individual queries can be heavily biased!! No clear winner exists!! (Hybrids??)

20CS286, Spring’07 – Minos Garofalakis #Outline Intro & Approximate Query Answering Overview One-Dimensional Synopses Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions –Dependency-based Synopses –Streaming Data –XML Synopses Conclusions

21CS286, Spring’07 – Minos Garofalakis # Dependency-based Histogram Synopses [DGR01] Extremes in terms of the underlying correlations!! Dependency-Based Histograms: explore space between extremes by explicitly identifying data correlations/independences – Build a statistical interaction model on data attributes – Based on the model, build a collection of low-dimensional histograms – Use this histogram collection to provide approximate answers General methodology, also applicable to other synopsis techniques (e.g., wavelets) Attribute Value Independence * simplistic * inaccurate Multi-dimensional histograms on joint data distribution * expensive * ineffective in high dimensions Fully independent attributes attributes Fully correlated attributes attributes

22CS286, Spring’07 – Minos Garofalakis # Dependency-based Histograms Identify (and exploit) attribute correlation and independence –Partial Independence : p(salary, height, weight) = p(salary) * p(height, weight) –Conditional Independence : p(salary, age | YPE) = p(salary| YPE) * p(age | YPE) Use forward selection to build a decomposable statistical model [BFH75], [Lau96] on the attributes –A,D are conditionally independent given B,C p(AD|BC) = p(A|BC) * p(D|BC) –Joint distribution p(ABCD) = p(ABC) * p(BCD) / p(BC) –Build histograms on model cliques Significant accuracy improvements (factor of 5) over pure MHIST New histogram construction & usage algorithms, etc. A B C D

23CS286, Spring’07 – Minos Garofalakis # Workload-tuned Biased Sampling -- Congressional Samples [AGP00] Decision support queries routinely segment data into groups & then aggregate the information within each group –Each table has a set of “grouping columns”: queries can group by any subset of these columns Goal: Maximize the accuracy for all groups (large or small) in each Group-by query –E.g., census DB with state (s), gender(g), and income (i) –Q: Avg(i) group-by s : seek good accuracy for all 50 states –Q: Avg(i) group-by s,g : seek good accuracy for all 100 groups Technique: Congressional Samples –House: Uniform sample: good for when no group-by –Senate: Same size sample per group when use all grouping columns: good for queries with all columns –Congress: Combines House & Senate, but considers all subsets of grouping columns, and then scales down

24CS286, Spring’07 – Minos Garofalakis # Workload-tuned Biased Sampling -- ICICLES [GLR00] Biased sampling scheme that dynamically adapts to query workload –Exploit data locality -- more focus (i.e., #sample points) in frequently-queried regions –Let Q = {q1, q2,...} be a query workload, R(qi) = subset of R used in answering query qi –L(R, Q) = Extension of R wrt Q = R R(qi) (multiset of tuples) Icicle: Uniform random sample of L(R,Q) Incrementally maintained and adapt (“self-tune”) to workload through Reservoir Sampling technique [Vit85] Unbiased Icicle estimators: New formulas to account for duplicates and bias in sample selection Provably better (smaller variance) than uniform for “focused” queries (that follow the workload model)

25CS286, Spring’07 – Minos Garofalakis # Workload-tuned Biased Sampling -- Lifted Workloads [CDN01] Formulate sample selection as an optimization problem –Minimize query-answering error for a given workload model Technique for “lifting a fixed workload W” to produce a probability distribution over all possible queries –Similar to kernel density estimation (queries in W = “sample points ”) “Fundamental regions” induced by W q1 q2 R q W = { q1, q2 } prob(q|W) = parametric function of q’s overlap with queries in W

26CS286, Spring’07 – Minos Garofalakis # Workload-tuned Biased Sampling -- Lifted Workloads Problem: Find sample of size k that minimizes expected error for a given “lifted” workload Solution: Stratified sampling [Coc77] –Collection of uniform samples (of total size k) over disjoint subsets (“strata”) of the population –Much better estimates when variance within strata is small [Coc77] Stratification: Selecting appropriate partitioning of R –Using “fundamental regions” as strata is optimal for COUNT –For SUM, partition “fundamental regions” further to reduce variance of the aggregated attribute (Neymann technique [Coc77]) Allocation: Dividing k among strata –Closed form solutions (valid under certain simplifying assumptions)

27CS286, Spring’07 – Minos Garofalakis # Data Streams Data is continually arriving. Collect & maintain synopses on the data. Goal: Highly-accurate approximate answers –State-of-the-art: Good techniques for narrow classes of queries –E.g., Any one-pass algorithm for collecting & maintaining a synopsis can be used effectively for data streams Alternative scenario: A collection of data sets. Compute a compact sketch of each data set & then answer queries (approximately) comparing the data sets –E.g., detecting near-duplicates in a collection of web pages: Altavista –E.g., estimating join sizes among a collection of tables [AGM99]

28CS286, Spring’07 – Minos Garofalakis # Looking Forward... Optimizing queries for approximation –e.g., minimize length of confidence interval at the plan root Exploiting mining-based techniques (e.g., decision trees) for data reduction and approximate query processing –see, e.g., [BGR01], [GTK01], [JMN99] Dynamic maintenance of complex (e.g., dependency-based [DGR01] or mining-based [BGR01]) synopses Synopses and approximate query processing for richer data models and data streams –e.g., XPath/XQuery over XML databases

29CS286, Spring’07 – Minos Garofalakis # XML Data (Text) Richard Feynman The character of Physical Law R.K. Narayan Waiting for the Mahatma 1981 Element ContentNesting

30CS286, Spring’07 – Minos Garofalakis # XML Data (Tree) booklist book atp at “Richard” “The character of physical “Hardcover” fl “Feynman” fl “…”

31CS286, Spring’07 – Minos Garofalakis # XML Basics Elements –Encode “concepts” in the XML database –Nesting denotes association/inclusion Attributes –Record information specific to an element (e.g., the genre of a book) References –Links between elements in different parts of the document

32CS286, Spring’07 – Minos Garofalakis # XML vs. Relational Data row name phone “John”3634 “Sue” “Dick” Relation XML

33CS286, Spring’07 – Minos Garofalakis # XML vs. Relational Data A relation instance is basically a tree with: –Unbounded fanout at level 1 (i.e., any # of rows) –Fixed fanout at level 2 (i.e., fixed # fields) XML data is essentially an arbitrary tree –Unbounded fanout at all nodes/levels –Any number of levels –Variable # of children at different nodes, variable path lengths

34CS286, Spring’07 – Minos Garofalakis # XPath Expressions Examples: /booklist/book /booklist/book/author /booklist/book/author/lastname Given an XML document, the value of a path expression p is a set of elements (= XML subtrees)

35CS286, Spring’07 – Minos Garofalakis # Path Expressions XPath expressions –Simple: /A/P/T –Branching: /A[B]/P/T –Values: /A/P/T[=v11] Result is a set / PB 3 P6P6 B9B9 T 13 A1A1 T 11 P7P7 T 12 A2A2 B5B5 T 10 E 14 N4N4 N8N8 V4V4 V 10 V 11 V 12 V 13 V8V8 V 14

36CS286, Spring’07 – Minos Garofalakis # Path Expressions XPath expressions –Simple: /A/P/T –Branching: /A[B]/P/T –Values: /A/P/T[=v11] Result is a set / PB 3 P6P6 B9B9 T 13 A1A1 T 11 P7P7 T 12 A2A2 B5B5 T 10 E 14 N4N4 N8N8 V4V4 V 10 V 11 V 12 V 13 V8V8 V 14

37CS286, Spring’07 – Minos Garofalakis # Path Expressions XPath expressions –Simple: /A/P/T –Branching: /A[B]/P/T –Values: /A/P/T[=v11] Result is a set / PB 3 P6P6 B9B9 T 13 A1A1 T 11 P7P7 T 12 A2A2 B5B5 T 10 E 14 N4N4 N8N8 V4V4 V 10 V 11 V 12 V 13 V8V8 V 14

38CS286, Spring’07 – Minos Garofalakis # Path Expressions XPath expressions –Simple: /A/P/T –Branching: /A[B]/P/T –Values: /A/P/T[=v11] Result is a set / PB 3 P6P6 B9B9 T 13 A1A1 T 11 P7P7 T 12 A2A2 B5B5 T 10 E 14 N4N4 N8N8 V4V4 V 10 V 11 V 12 V 13 V8V8 V 14

39CS286, Spring’07 – Minos Garofalakis # Path Expressions XPath expressions –Simple: /A/P/T –Branching: /A[B]/P/T –Values: /A/P/T[=v11] Result is a set / PB 3 P6P6 B9B9 T 13 A1A1 T 11 P7P7 T 12 A2A2 B5B5 T 10 E 14 N4N4 N8N8 V4V4 V 10 V 11 V 12 V 13 V8V8 V 14

40CS286, Spring’07 – Minos Garofalakis # XPath Syntax Path wildcards – // = descendant at any level (or self) –* = any (single) tag –Example: /booklist//lastname Query attributes and attribute content –Use –Examples: Branching predicates: A[pred] –Predicate on A’s subtree using logical connectives (and, or, etc.), path expressions, built-in functions (e.g., contains()), etc. –Example: //author[contains(./lastname, “Fey”)]

41CS286, Spring’07 – Minos Garofalakis # Synopses for XML Summarize labeled tree/graph structure for approximate path navigation queries –Selectivity estimation: How many elements satisfy p? –Approximate answers: Return an approximate XML document as output of an XQuery fragment Key idea: Build a concise Graph Synopsis that captures the path/branching distribution in limited space –Use appropriate uniformity/independence assumptions to approximate path structure –Refine synopsis in parts of the XML document where assumptions fail –XSketches [SIGMOD’02,VLDB’02], TreeSketches [SIGMOD’04]

42CS286, Spring’07 – Minos Garofalakis #Conclusions Commercial data warehouses: approaching several 100’s TB and continuously growing –Demand for high-speed, interactive analysis (click-stream processing, IP traffic analysis) also increasing Approximate Query Processing –“Tame” these TeraBytes and satisfy the need for interactive processing and exploration –Great promise –Commercial acceptance still lagging, but will most probably grow in coming years –Still lots of interesting research to be done!!