New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang 02-07-2006.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Mobility Increase the Capacity of Ad-hoc Wireless Network Matthias Gossglauser / David Tse Infocom 2001.
Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Fast Algorithms For Hierarchical Range Histogram Constructions
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
Tutorial 8 CSI 2132 Database I. Exercise 1 Both disks and main memory support direct access to any desired location (page). On average, main memory accesses.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
ICNP'061 Benefit-based Data Caching in Ad Hoc Networks Bin Tang, Himanshu Gupta and Samir Das Computer Science Department Stony Brook University.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Vassilios V. Dimakopoulos and Evaggelia Pitoura Distributed Data Management Lab Dept. of Computer Science, Univ. of Ioannina, Greece
Adaptive Sampling  Based on a hot-list algorithm by Gibbons and Matias (SIGMOD 1998)  Sample elements from the input set Frequently occurring elements.
Data Warehouse View Maintenance Presented By: Katrina Salamon For CS561.
Section 6.2 ~ Basics of Probability Introduction to Probability and Statistics Ms. Young.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Independence and Bernoulli.
Mining frequency counts from sensor set data Loo Kin Kong 25 th June 2003.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1 Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND STATISTICS FOR SCIENTISTS AND ENGINEERS Systems.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Stochastic Models Lecture 2 Poisson Processes
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
CSIT 301 (Blum)1 Cache Based in part on Chapter 9 in Computer Architecture (Nicholas Carter)
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Random Sampling Approximations of E(X), p.m.f, and p.d.f.
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
1 Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND STATISTICS FOR SCIENTISTS AND ENGINEERS Systems.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Frequency Counts over Data Streams
Parallel Databases.
A paper on Join Synopses for Approximate Query Answering
ICICLES: Self-tuning Samples for Approximate Query Answering
Spatial Online Sampling and Aggregation
Pyramid Sketch: a Sketch Framework
AQUA: Approximate Query Answering
Indexing and Hashing Basic Concepts Ordered Indices
Farzaneh Mirzazadeh Fall 2007
Approximate Frequency Counts over Data Streams
Approximation and Load Shedding Sampling Methods
Presentation transcript:

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

2 Outline Introduction Concise samples Counting samples Application to hot list queries Conclusion Reference

3 Introduction In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. Effectiveness of a synopsis is evaluated as a function of its footprint, i.e., the number of memory words to store the synopsis. Data Warehouse New Data Queries Response Figure 1:A traditional data warehouse Data Warehouse New Data Queries Response Figure 2: Data warehouse set-up for providing approximate query answers. Approx. Answer Engine

4 Definition Concise samples a uniform random sample of the data set such that values appearing more than once in the sample are represented as a value and a count, ex:. Counting samples a variation on concise samples in which the counts are used to keep track of all occurrences of a value inserted into the relation since the value was selected for the sample. Hot list queries request an ordered set of pairs for the k most frequently occurring data values, for some k. ex: the top selling items in a database of sales transactions.

5 Outline Introduction Concise samples Counting samples Application to hot list queries Conclusion Reference

6 Concise samples Consider a relation R with n tuples and an attribute A. The goal is to obtain a uniform random sample of R.A, i.e., the values of A for a random subset of the tuples in R. Definition: Let S = {,…,, v j+1,..., v l } be a consice sample. Then sample-size(S) = l-j+∑ j i = 1 c i, and footprint(S) = l+j Lemma 1 For any footprint m ≥ 2, there exists data sets for which the sample-size of a consice sample is n/m times lager than its footprint, where n is the size of the data set.

7 Concise samples – offline/static Offline/static computation Repeat m times: select a random tuple from the relation and extract its value for attribute A. Semi-sort the set of values, and replace every value occurring multiple times with a pair. Continue to sample until either adding the sample point would increase the concise sample footprint to m+1 or n samples have been taken. For each new value sampled, look-up to see if it is already in the concise sample and then either add a new singleton value, convert a singleton to a pair, or increment the count for a pair.

8 Concise samples – online With concise samples, the sample-size depends on the data distribution to date, and any changes in the data distribution must be reflected in the sampling frequency. Maintenance algorithm: Let S be the current concise sample and consider a new tuple t. Set up an entry thresholdβ(initially 1) for new tuples to be selected for the sample. I. Add t.A to S with probability 1/ β. II. Do a look-up on t.A in S. a) if it is represented by a pair, its count is incremented. b) if t.A is a singleton in S, a pair is created, c) if it is not in S, a singleton is created. III. Increase footprint by 1 in cases b) and c) IV. Raise the threshold to some β’. Subject each sample point in S to this higher threshold. Subsequent inserts are selected for the sample with probability 1/ β’

9 Concise samples (cont.) Theorem 2 Consider the family of exponential distributions: for I = 1,2,…,Pr(v=i) = α -i (α-1), for α>1. For any footprint m≥2, the expected sample-size of a concise sample with footprint m is at least α m/2 Theorem 3 For any data set, when using a concise sample S with sample-size m, the expected gain is E[m-number of distinct values in S] =

10 Concise samples (cont.) Update time overheads The coin flips that must be performed to decide which inserts are added to the concise sample and to evict values from the concise sample when the threshold is raised The lookups into the current concise sample to see if a value is already present in the sample

11 Concise Samples – experimental evaluation Figure 3: Comparing sample-sizes of concise and traditional samples as a function of skew, for varying footprints and D/m ratios. In (a) and (b), authors compare footprint 100 and footprint 1000, respectively, for the same data sets. In (c) and (d), authors compare D/m = 50 and D/m = 5, respectively, for the same footprint D: potential number of distinct values m: footprint size

12 Outline Introduction Concise samples Counting samples Application to hot list queries Conclusion Reference

13 Counting samples Counting samples – a variation on concise samples in which the counts are used to keep track of all occurrences of a value inserted into the relation since the value was selected for the sample. Definition: A counting sample for R.A with thresholdβis any subset of R.A obtained as follows: 1. For each value v occurring c times in R, we flip a coin with probability 1/βof heads until the first heads, up to at most c coin tosses in all; if the ith coin toss is heads, then v occurs c-i+1 times in the subset, else v is not in the subset. 2. Each value v occurring c>1 times in the subset is represented as a pair, and each value v occurring exactly once is represented as a singleton v.

14 Counting samples (cont.) An algorithm for incremental maintenance is introduced. Theorem 4 Let R be an arbitrary relation, and let β be the currentthreshold for a counting sample S. (i) Any value v that occurs at least βtimes in R is expected to be in S. (ii) Any value v that occurs f v times in R will be in S with probability 1-(1-1/β) fv. (iii) For all α>1, if f v ≥ αβ, then with probability ≥ 1 - e -α, the value will be in S and its count will be at least fv - αβ

15 Outline Introduction Concise samples Counting samples Application to hot list queries Conclusion Reference

16 Application to hot list queries Hot list queries request an ordered set of pairs for the k most frequently occurring data values, for some k. Algorithms Using traditional samples Using concise samples Using counting samples Using histogram on disk – maintains a full histogram on disk, i.e., pairs for all distinct values in R, with a copy of the top m/2 pairs stored as a synopsis within the approximate answer engine. -- is considered only as a baseline for accuracy comparisons

17 Application to hot list queries (cont.) x-axis: rank of a value y-axis: count for the values

18 Application to hot list queries (cont.)

19 Application to hot list queries (cont.)

20 Application to hot list queries – overheads

21 Conclusion Using concise samples may offer the best choice when considering both accuracy and overheads. In this paper, a batch-like processing of data warehouse inserts, in which inserts and queries do not intermix, is assumed. To address the more general case, issues of concurrency bottlenecks need to be addressed. Future work is to explore the effectiveness of using concise samples and counting samples for other concrete approximate answer scenarios.

22 Reference P. B. Gibbons and Y. Matias. New Sampling-Based Summary Statistics for Improving Approximate Query Answers. ACM SIGMOD 1998.