A Quick Introduction to Approximate Query Processing Part II

Slides:

Advertisements

Similar presentations

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

CS4432: Database Systems II

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Fast Algorithms For Hierarchical Range Histogram Constructions

Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.

Introduction to Histograms Presented By: Laukik Chitnis

STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.

Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.

Multidimensional Data

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

Optimal Workload-Based Weighted Wavelet Synopsis

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

A Quick Introduction to Approximate Query Processing Part-IV CS286, Spring’2007 Minos Garofalakis.

A Quick Introduction to Approximate Query Processing CS286, Spring’2007 Minos Garofalakis.

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.

Wavelet Packets For Wavelets Seminar at Haifa University, by Eugene Mednikov.

Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.

Evaluating Hypotheses

Approximation Techniques for Data Management Systems “We are drowning in data but starved for knowledge” John Naisbitt CS 186 Fall 2005.

16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.

Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.

Wavelet Synopses with Error Guarantees Minos Garofalakis Intel Research Berkeley

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Internet Management Research Dept. Bell Labs, Lucent

CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

EN : Adv. Storage and TP Systems Cost-Based Query Optimization.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.

Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:

A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.

End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.

The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis.

Histograms for Selectivity Estimation

The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.

One-Pass Wavelet Synopses for Maximum-Error Metrics Panagiotis Karras Trondheim, August 31st, 2005 Research at HKU with Nikos Mamoulis.

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

EXCEL DECISION MAKING TOOLS AND CHARTS BASIC FORMULAE - REGRESSION - GOAL SEEK - SOLVER.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.

Wavelets Chapter 7 Serkan ERGUN. 1.Introduction Wavelets are mathematical tools for hierarchically decomposing functions. Regardless of whether the function.

1 R-Trees Guttman. 2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects.

Dense-Region Based Compact Data Cube

Data Transformation: Normalization

A paper on Join Synopses for Approximate Query Answering

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Lattice Histograms: A Resilient Synopsis Structure

AQUA: Approximate Query Answering

Efficient Aggregation over Objects with Extent

Presentation transcript:

A Quick Introduction to Approximate Query Processing Part II CS286, Spring’2007 Minos Garofalakis

Decision Support Systems Data Warehousing: Consolidate data from many sources in one large repository. Loading, periodic synchronization of replicas. Semantic integration. OLAP: Complex SQL queries and views. Queries based on spreadsheet-style operations and “multidimensional” view of data. Interactive and “online” queries. Data Mining: Exploratory search for interesting trends and anomalies. (Another lecture!)

Approximate Query Processing using Data Synopses Decision Support Systems (DSS) SQL Query Exact Answer Long Response Times! GB/TB Compact Data Synopses “Transformed” Query KB/MB Approximate Answer FAST!! How to construct effective data synopses ??

Relations as Frequency Distributions name salary sales age One-dimensional distribution counts tuple Age (attribute domain values) Three-dimensional distribution tuple counts 8 10 10 age 30 20 50 sales 25 8 15 salary

Outline Intro & Approximate Query Answering Overview Synopses, System architectures, Commercial offerings One-Dimensional Synopses Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning Samples: Basics, Sampling from DBs, Reservoir Sampling Wavelets: 1-D Haar-wavelet histogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions

One-Dimensional Haar Wavelets Wavelets: mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: simplest wavelet basis, easy to understand and implement Recursive pairwise averaging and differencing at different resolutions Resolution Averages Detail Coefficients 3 [2, 2, 0, 2, 3, 5, 4, 4] ---- 2 [2, 1, 4, 4] [0, -1, -1, 0] Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] 1 [1.5, 4] [0.5, 0] [2.75] [-1.25]

Haar Wavelet Coefficients Hierarchical decomposition structure (a.k.a. “error tree”) Coefficient “Supports” + - + -1 0.5 2.75 -1.25 2 2 0 2 3 5 4 4 -1.25 2.75 0.5 -1 + - Original data

Wavelet-based Histograms [MVW98] Problem: range-query selectivity estimation Key idea: use a compact subset of Haar/linear wavelet coefficients for approximating the data distribution Steps compute (cumulative) data distribution C compute Haar (or linear) wavelet transform of C coefficient thresholding : only b<<|C| coefficients can be kept take largest coefficients in absolute normalized value Haar basis: divide coefficients at resolution j by Optimal in terms of the overall Mean Squared (L2) Error Greedy heuristic methods Retain coefficients leading to large error reduction Throw away coefficients that give small increase in error

Using Wavelet-based Histograms Selectivity estimation: sel(a<= X<= b) = C’[b] - C’[a-1] C’ is the (approximate) “reconstructed” cumulative distribution Time: O(min{b, logN}), where b = size of wavelet synopsis (no. of coefficients), N= size of domain C[a] At most logN+1 coefficients are needed to reconstruct any C value

Haar Wavelet Coefficients Reconstruct data values d(i) d(i) = (+/-1) * (coefficient on path) Range sum calculation d(l:h) d(l:h) = simple linear combination of coefficients on paths to l, h Only O(logN) terms 2 2 0 2 3 5 4 4 -1.25 2.75 0.5 -1 + - Original data 3 = 2.75 - (-1.25) + 0 + (-1) 6 = 4*2.75 + 4*(-1.25)

Dynamic Maintenance of Wavelet-based Histograms [MVW00] Build Haar-wavelet synopses on the original data distribution Key issues with dynamic wavelet maintenance Change in single distribution value can affect the values of many coefficients (path to the root of the decomposition tree) d Change propagates up to the root coefficient d+ As distribution changes, “most significant” (e.g., largest) coefficients can also change! Important coefficients can become unimportant, and vice-versa

Effect of Distribution Updates Key observation: for each coefficient c in the Haar decomposition tree c = ( AVG(leftChildSubtree(c)) - AVG(rightChildSubtree(c)) ) / 2 - + d+ d’+ h c = c - /2^h c’ = c’ + /2^h Only coefficients on path(d) are affected and each can be updated in constant time

Maintenance Architecture Histogram H m+m’ top coefficients INSERTIONS/ DELETIONS Auxiliary Histogram H’ Activity Log m’ “Shake up” when log reaches max size: for each insertion at d for each coefficient c on path(d) and in H’ : update c for each coefficient c on path(d) and not in H or H’: insert c into H’ with probability proportional to 1/2^h, where h is the “height” of c (Probabilistic Counting [FM85]) Adjust H and H’ (move largest coefficients to H)

Problems with Conventional Wavelet Synopses An example data vector and wavelet synopsis (|D|=16, B=8 largest coefficients retained) Always accurate! Over 2,000% relative error! Original Data Values 127 71 87 31 59 3 43 99 100 42 0 58 30 88 72 130 Wavelet Answers 65 65 65 65 65 65 65 65 100 42 0 58 30 88 72 130 Estimate = 195, actual values: d(0:2)=285, d(3:5)=93! Large variation in answer quality Within the same data set, when synopsis is large, when data values are about the same, when actual answers are about the same Heavily-biased approximate answers! Root causes Thresholding for aggregate L2 error metric Independent, greedy thresholding ( large regions without any coefficient!) Heavy bias from dropping coefficients without compensating for loss

Approach: Optimize for Maximum-Error Metrics Key metric for effective approximate answers: Relative error with sanity bound Sanity bound “s” to avoid domination by small data values To provide tight error guarantees for all reconstructed data values Minimize maximum relative error in the data reconstruction Another option: Minimize maximum absolute error Algorithms can be extended to general “distributive” metrics (e.g., average relative error) Minimize

Our Approach: Deterministic Wavelet Thresholding for Maximum Error Key Idea: Dynamic-Programming formulation that conditions the optimal solution on the error that “enters” the subtree (through the selection of ancestor nodes) Our DP table: M[j, b, S] = optimal maximum relative (or, absolute) error in T(j) with space budget of b coefficients (chosen in T(j)), assuming subset S of j’s proper ancestors have already been selected for the synopsis Clearly, |S| min{B-b, logN+1} Want to compute M[0, B, ] j 2j 2j+1 + - root=0 S = subset of proper ancestors of j included in the synopsis Basic Observation: Depth of the error tree is only logN+1 we can explore and tabulate all S-subsets for a given node at a space/time cost of only O(N) !

Base Case for DP Recurrence: Leaf (Data) Nodes Base case in the bottom-up DP computation: Leaf (i.e., data) node Assume for simplicity that data values are numbered N, …, 2N-1 j/2 + - root=0 Selected coefficient subset S M[j, b, S] is not defined for b>0 Never allocate space to leaves For b=0 for each coefficient subset with |S| min{B, logN+1} Similarly for absolute error Again, time/space complexity per leaf node is only O(N)

DP Recurrence: Internal (Coefficient) Nodes Two basic cases when examining node/coefficient j for inclusion in the synopsis: (1) Drop j; (2) Keep j Case (1): Drop Coefficient j In this case, the minimum possible maximum relative error in T(j) is S = subset of selected j-ancestors root=0 Optimally distribute space b between j’s two child subtrees Note that the RHS of the recurrence is well-defined Ancestors of j are obviously ancestors of 2j and 2j+1 j + - 2j 2j+1

DP Recurrence: Internal (Coefficient) Nodes (cont.) Case (2): Keep Coefficient j In this case, the minimum possible maximum relative error in T(j) is S = subset of selected j-ancestors root=0 j Take 1 unit of space for coefficient j, and optimally distribute remaining space Selected subsets in RHS change, since we choose to retain j Again, the recurrence RHS is well-defined + - 2j 2j+1 Finally, define Overall complexity: time, space

Outline Intro & Approximate Query Answering Overview One-Dimensional Synopses Multi-Dimensional Synopses and Joins Multi-dimensional Histograms Join sampling Multi-dimensional Haar Wavelets Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Conclusions

Relations as Frequency Distributions name salary sales age One-dimensional distribution counts tuple Age (attribute domain values) Three-dimensional distribution tuple counts 8 10 10 age 30 20 50 sales 25 8 15 salary

Multi-dimensional Data Synopses Problem: Approximate the joint data distribution of multiple attributes 10 20 40 35 90 120 Age Salary Motivation Selectivity estimation for queries with multiple predicates Approximating OLAP data cubes and general relations Conventional approach: Attribute-Value Independence (AVI) assumption sel(p(A1) & p(A2) & . . .) = sel(p(A1)) * sel(p(A2) * . . . Simple -- one-dimensional marginals suffice BUT: almost always inaccurate, gross errors in practice (e.g., [Chr84, FK97, Poo97]

Multi-dimensional Histograms Use small number of multi-dimensional buckets to directly approximate the joint data distribution Uniform spread & frequency approximation within buckets n(i) = no. of distinct values along Ai, F = total bucket frequency approximate data points on a n(1)*n(2)*. . . uniform grid, each with frequency F / (n(1)*n(2)*. . .) 16 Approximate Distribution Actual Distribution (ONE BUCKET) 35 90 40 120 20 10

Multi-dimensional Histogram Construction Construction problem is much harder even for two dimensions [MPS99] Multi-dimensional equi-depth histograms [MD88] Fix an ordering of the dimensions A1, A2, . . ., Ak, let kth root of desired no. of buckets, initialize B = { data distribution } For i=1, . . ., k: Split each bucket in B in equi-depth partitions along Ai; return resulting buckets to B Problems: limited set of bucketizations; fixed and fixed dimension ordering can result in poor partitionings MHIST-p histograms [PI97] At each step Choose the bucket b in B containing the attribute Ai whose marginal is the most in need of partitioning Split b along Ai into p (e.g., p=2) buckets

Equi-depth vs. MHIST Histograms Equi-depth (a1=2,a2=3) [MD88] MHIST-2 (MaxDiff) [PI97] A2 A2 1 1 4 150 30 280 150 30 280 460 360 250 4 3 2 5 150 200 10 150 200 10 5 2 3 150 50 50 150 50 50 A1 A1 450 280 340 MHIST: choose bucket/dimension to split based on its criticality ; allows for much larger class of bucketizations (hierarchical space partitioning) Experimental results verify superiority over AVI and equi-depth

Other Multi-dimensional Histogram Techniques -- GENHIST [GKT00] Key idea: allow for overlapping histogram buckets Allows for a much larger no. of distinct frequency regions for a given space budget (= #buckets) a+c a b c d b+d + a+b+c+d 9 distinct frequencies (13 if different-size buckets are used) a b c d Greedy construction algorithm: Consider increasingly-coarser grids At each step select the cell(s) c of highest density and move enough randomly-selected points from c into a bucket to make c and its neighbors “close-to-uniform” Truly multi-dimensional “split decisions” based on tuple density -- unlike MHIST

Other Multi-dimensional Histogram Techniques -- STHoles [BCG01] Multi-dimensional, workload-based histograms Allow bucket nesting -- “bucket tree” Intercept query result stream and count |q b| for each bucket b (< 10% overhead in MS SQL Server 2000) Drill “holes” in b for regions of different tuple density and “pull” them out as children of b (first-class buckets) Consolidate/merge buckets of similar densities (keep #buckets constant) |q&b|=160 40 100 300 150 160 b1 b2 b3 b4 b5 Refine q b1 b2 b3 200 b4 150 100 300

Sampling for Multi-D Synopses Taking a sample of the rows of a table captures the attribute correlations in those rows Answers are unbiased & confidence intervals apply Thus guaranteed accuracy for count, sum, and average queries on single tables, as long as the query is not too selective Problem with joins [AGP99,CMN99]: Join of two uniform samples is not a uniform sample of the join Join of two samples typically has very few tuples 3 1 0 3 7 3 7 1 4 2 4 0 1 2 1 2 7 0 8 5 1 9 1 0 7 1 3 8 2 0 0 1 2 3 4 5 6 7 8 9 Foreign Key Join 40% Samples in Red Size of Actual Join = 30 Size of Join of samples = 3

Join Synopses for Foreign-Key Joins [AGP99] Based on sampling from materialized foreign key joins Typically < 10% added space required Yet, can be used to get a uniform sample of ANY foreign key join Plus, fast to incrementally maintain Significant improvement over using just table samples E.g., for TPC-H query Q5 (4 way join) 1%-6% relative error vs. 25%-75% relative error, for synopsis size = 1.5%, selectivity ranging from 2% to 10% 10% vs. 100% (no answer!) error, for size = 0.5%, select. = 3%