Introduction to Histograms Presented By: Laukik Chitnis

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Design of the fast-pick area Based on Bartholdi & Hackman, Chpt. 7.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
CS4432: Database Systems II
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
C-Store: Self-Organizing Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 17, 2009.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
Statistics It is the science of planning studies and experiments, obtaining sample data, and then organizing, summarizing, analyzing, interpreting data,
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
DENSITY CURVES and NORMAL DISTRIBUTIONS. The histogram displays the Grade equivalent vocabulary scores for 7 th graders on the Iowa Test of Basic Skills.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Histograms, Frequency Distributions and Related Topics These are constructions that will allow us to represent large sets of data in ways that may be more.
A quick introduction to the analysis of questionnaire data John Richardson.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Chapter 3 Parallel Search 3.1Search Queries 3.2Data Partitioning 3.3Search Algorithms 3.4Summary 3.5Bibliographical Notes 3.6Exercises.
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Mercury: Supporting Scalable Multi-Attribute Range Queries A. Bharambe, M. Agrawal, S. Seshan In Proceedings of the SIGCOMM’04, USA Παρουσίαση: Τζιοβάρα.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 Query Optimization Vishy Poosala Bell Labs. 2 Outline Introduction Necessary Details –Cost Estimation –Result Size Estimation Standard approach for.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Portfolio Management-Learning Objective
Graphical Summary of Data Distribution Statistical View Point Histograms Skewness Kurtosis Other Descriptive Summary Measures Source:
Some Background Assumptions Markowitz Portfolio Theory
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Graphs, Charts, and Tables - Describing Your Data ©
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Chapter 2 Describing Data.
Fateme Shirazi Spring Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum.
Investment Analysis and Portfolio Management First Canadian Edition By Reilly, Brown, Hedges, Chang 6.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
Histograms for Selectivity Estimation
Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Lecture 14- Parallel Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Distributed Database Design Bayu Adhi Tama, MTI Fasilkom-Unsri Adapted from Connolly, et al., Database Systems 4 th Edition, Pearson Education Limited,
Cost Estimation For each plan considered, must estimate cost: –Must estimate cost of each operation in plan tree. Depends on input cardinalities. –Must.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Data Preprocessing: Data Reduction Techniques Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
7 1 Database Systems: Design, Implementation, & Management, 7 th Edition, Rob & Coronel 7.6 Advanced Select Queries SQL provides useful functions that.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Data Transformation: Normalization
Data Mining: Concepts and Techniques
Parallel Databases.
A paper on Join Synopses for Approximate Query Answering
Data-Streams and Histograms
Lecture 5,6: Measures in Statistics
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Data Transformations targeted at minimizing experimental variance
A Privacy – Preserving Index
Presentation transcript:

Introduction to Histograms Presented By: Laukik Chitnis

2 Introduction to Histograms This presentation is mostly based on the following work done by Yannis Ioannidis and Viswanath Poosala – Histogram Based Solutions to Diverse Database Estimation Problems (1995) – Improved Histograms for Selectivity Estimation of Range Predicates (1996) (with IBM Almaden research guys Peter Haas and Eugene Shekita) – History of Histograms (presented by Yannis)

3 Motivation and Expected Features Why Histograms? – Size estimates needed for Estimating the costs of access plans – Query profiling for user feedback – Load balancing for parallel join execution Expected features – Should produce estimates with small errors – Inexpensive to construct, use and maintain – Utility across diverse estimation problems

4 Data Distribution Histograms as approximations of data distribution Data distribution is a set of (attribute value, frequency) pairs

5 Data Distribution Set of (attribute value, frequency) pairs This data distribution has all the information required to answer query (count, join, aggregate,..) But, it is too bulky! So, we “approximate” it to a histogram!

6 From Data Distribution to Histogram Freq Value

7 From Data Distribution to Histogram Freq Value bucket 1 bucket 2

9 Definitions equi-width histograms – A histogram type wherein the ‘spread’ of each bucket is same. Spread S and Area A – S i = V i+1 – V i – A i = f i * S i – where V i is an attribute value f i is its frequency

10 Spread S and Area A Freq Value Spread Area

11 Equi-width Histograms Divide the value axis into buckets of equal ‘width’ (range equalized) Advantages: – natural order of the attribute-values is preserved. – Light on storage requirements Example: Count of tuples having x<5 Another example: What if the query range boundary does not match the bucket boundary? Scale the last bucket! Assumption: uniform distribution within a bucket! Disadvantages: High variance! – Difficult to estimate errors – example: Our previous graph was ‘continuous’. What if we get crazy graphs? Big guys besides short guys! – Consider self-join with this histogram

12 Alternative histogram Don’t equalize ranges of values but number of tuples in bucket equi-depth histograms Disadvantages: – Variance within a bucket may be still very high – Storage requirement same as equi-width, but more complex to maintain Work well for range queries only when the data distribution has low skew

13 Frequency based histograms Minimize variance – group the items having similar frequencies in a bucket Freq Value Serial Histogram

14 Serial Histogram H2

15 Serial Histogram The frequencies of the attribute values associated with each bucket are either all greater than or all less than the frequencies of the attribute values associated with other bucket Advantage: Optimal for reducing errors in estimation Disadvantage: Storage requirement high – No order-correlation between values and frequencies  index required leading to approximate frequency of every individual attribute value – Optimal for equality joins and selection only when attribute value list is maintained for each bucket Can we reduce the storage requirement?

16 End-biased histograms Some of the highest frequencies and some lowest frequencies are explicitly and accurately maintained in separate individual buckets Remaining (middle) frequencies are all approximated together in a single bucket Storage requirement: very little and no index required – Only store attribute values in individual buckets and their corresponding frequencies

17 Histograms… formally Definition – A histogram on attribute is constructed by partitioning the data distribution D into mutually disjoint β subsets called buckets and approximating the frequencies f and values V in each bucket in some common fashion.

18 Key properties that characterize histograms and determine their effectiveness in query result size estimation. – Partition class – Sort parameter – Partition constraint – Source parameter These properties are mutually orthogonal and form the basis for a general taxonomy of histograms.

19 Orthogonal Properties of Histogram Partition Class – Indicates restrictions on partitioning Serial: non-overlapping ranges of sort parameter values End-biased: at most one non-singleton bucket Partition Constraint – The mathematical constraint that uniquely identifies the histogram within its partition class. Equi-depth: sum of number of tuples in a bucket should be equal

20 Orthogonal Properties of Histogram… Sort parameter and Source parameter – Derivative of data distribution element (its value and/or frequency) Attribute values (V) Frequencies (F) Areas (A) = spread x frequency – Serial: buckets must contain contiguous sort parameter values

21 Orthogonal Properties of Histogram… Approximation of values within a bucket – The assumptions that determine the approximate values within a bucket Uniform distribution of values within a bucket Approximation of frequency of a value within a bucket – assumptions that determine the approximate frequency of each value within a bucket Frequency of each value assumed to be arithmetic mean

22 Histogram Taxonomy Each histogram is primarily identified by its partition constraint and its sort and source parameters. If the choice in the above three properties is p, s, and u, respectively, then the histogram is named p(s,u).

23 Histogram Taxonomy Trivial Histogram Equi-sum(V,S) alias Equi-width Equi-sum(V,F) alias Equi-depth V-optimal (F,F) V-optimal-end-biased (F,F) Spline-based(V,C)

24 Histogram Taxonomy… Trivial Histogram – All values in a single bucket Equi-sum(V,S) alias Equi-width – group contiguous ranges of attribute values into buckets, and the sum of the spreads in each bucket (i.e., the maximum minus the minimum value in the bucket) is approximately equal to times 1/ β times the domain range

25 Histogram Taxonomy… Equi-sum(V,F) alias Equi-depth – are like equi-width histograms but have the sum of the frequencies in each bucket be equal rather than the sum of the spreads V-optimal(F,F) histograms – group contiguous sets of frequencies into buckets so as to minimize the variance of the overall frequency approximation. – Were simply called serial histograms before

26 Performance Comparison

27 Observations Performance – Frequency-based much better than traditional – End-based almost as good as serial Construction cost – Serial histograms are exponential in the number of buckets whereas end-based are almost linear Complexity in storage and usage – Serial has much higher complexity than end- based

28 More Histograms Max-diff and compressed as partition constraint

29 More partition constraints Max-diff – a bucket boundary between two source parameter values that are adjacent (in sort parameter order) if the difference between these values is one of the β -1 largest such differences. – Goal is to avoid grouping attribute values with vastly different source parameter values into a bucket – Efficiently constructed by first computing differences between adjacent* source parameters Compressed – Highest source values are stored separately in singleton buckets; the rest are partitioned as in an equi-sum histogram. – Achieve great accuracy in approximating skewed frequency distributions and/or non-uniform spreads

Any (more) Questions?

Thank You!