Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.

Slides:



Advertisements
Similar presentations
CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
CMPT 354 Views and Indexes Spring 2012 Instructor: Hassan Khosravi.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Fast Algorithms For Hierarchical Range Histogram Constructions
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Introduction to Histograms Presented By: Laukik Chitnis
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Tutorial 8 CSI 2132 Database I. Exercise 1 Both disks and main memory support direct access to any desired location (page). On average, main memory accesses.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Approximate querying about the Past, the Present, and the Future in Spatio-Temporal Databases Jimeng Sun, Dimitris Papadias, Yufei Tao, Bin Liu.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
1 Query Optimization Vishy Poosala Bell Labs. 2 Outline Introduction Necessary Details –Cost Estimation –Result Size Estimation Standard approach for.
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Histograms for Selectivity Estimation
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Bootstrapped Optimistic Algorithm for Tree Construction
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Data on External Storage – File Organization and Indexing – Cluster Indexes - Primary and Secondary Indexes – Index data Structures – Hash Based Indexing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
8 Copyright © 2005, Oracle. All rights reserved. Gathering Statistics.
SQL Server Statistics DEMO SQL Server Statistics SREENI JULAKANTI,MCTS.MCITP,MCP. SQL SERVER Database Administration.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
SQL Server Statistics DEMO SQL Server Statistics SREENI JULAKANTI,MCTS.MCITP SQL SERVER Database Administration.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Hash 2004, Spring Pusan National University Ki-Joune Li.
1 Overview of Storage and Indexing Chapter 8. 2 Review: Architecture of a DBMS  A typical DBMS has a layered architecture.  The figure does not show.
Mining Data Streams (Part 1)
SQL Server Statistics and its relationship with Query Optimizer
Data Transformation: Normalization
CSCI5570 Large Scale Data Processing Systems
CS 540 Database Management Systems
Dynamic Hashing (Chapter 12)
Hashing CENG 351.
Database Management Systems (CS 564)
File Processing : Query Processing
File organization and Indexing
Chapter 11: Indexing and Hashing
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
CPS216: Advanced Database Systems
Data Transformations targeted at minimizing experimental variance
Statistics Profile For Query Optimization
Indexing, Access and Database System Architecture
Chapter 11: Indexing and Hashing
Chapter 8 Views and Indexes
Presentation transcript:

Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala (Bell Laboratories) Presented by: Amrita Tamrakar CSE feb-2006

Introduction What is a histogram? Issues in Histogram maintenance Novel concept of “Backing sample” Types of approximate histograms Incremental maintenance of approx histograms Challenges and solutions Conclusion

What is a histogram? maintained to approximate the distribution of data in the attributes constructed by partitioning the data into mutually disjoint subsets Frequency as y axis and the data intervals as x axis Oracle, DB2, SQLserver, Sybase, Informix… Data value interval FrequencyFrequency Commercial Vendors Histograms IBM DB2 Compressed (V,F) Oracle Equi-depth Sqlserver Equi-depth Sybase Equi-depth

History of Histogram Equi-width histogram Compressed histogram Learn more on Histogram

precomputed on underlying data Stored in main memory, less overhead What about the maintenance ??  Database is modified  Query is changed(?)  Outdated histogram  Does periodic updates solve the problem? Recomputing from the scratch Poor estimation during the in-between period What’s the solution ? Issues on Histogram Maintenance

The solution to outdated histograms Maintain Approximate histogram in presence of database updates Split and merge technique for quick adjustment “Backing sample” stored in secondary mm

Backing Sample Only row id and the necessary attributes At any time, backing sample = random sample No entire table scan Records in Consecutive disk blocks Histogram Relation (20GB) Backing sample (100KB) 2 KB Main memory

During insertions  Reservoir sampling technique  Obtain sample of data from a single scan without a priori knowledge of no of tuples.  Length of random skip chosen such that each tuple is likely to be in the reservoir. 1 2 n First n n+1 Skip random no of record and replace How to maintain a backing sample? MaintainBackingSample

During modification  Modify if tuple present in sample During Deletion  Remove from the sample  If sample size decrease below lower bound L, then recompute from disk. How to maintain a backing sample?

Maintain approximate Histograms : Different Classes of Histograms Equidepth histograms  No. of tuples in each bucket is same  Contiguous ranges of attribute values Data value Frequency of occurrence

Compressed (V,F) histogram  N highest frequencies stored in singleton buckets  For other values, use equi-depth histogram Both histograms needs to store for each bucket  The largest value in the bucket B.maxval  The Count B.count Approximate histograms are calculated from the random sample of the Relation How to maintain these histograms? Different Classes of Histograms

Fast Incremental maintenance of approximate equi-depth histograms During Insertion  Maintain a threshold (T) upper bound  If no of tuples < T, insertion will increment the bucket count.  Else recompute the histogram Split and merge algorithm  Reduce the no. of recomputations from the sample  When bucket count reaches T, instead of recomputing split the bucket in half.  But maintain the number of bucket as fixed by merging two buckets whose total count<T

Split n merge algorithm Insert threshold

To handle modify and delete Deletion can lower the bucket count Maintain a T l as lower threshold Merge if below threshold Split bucket with largest count Delete threshold

Fast Incremental maintenance of approximate compressed histograms Values with high frequencies can span more than one bucket – replace by single bucket with single count –singleton buckets Construct compressed histogram on the sample and scale it by N/k factor. During insertions  If the count doesn’t exceed threshold, add to the bucket, else update bucket boundaries

Challenges to maintain compressed histograms New values may lead to data skew, which may lead to new singleton buckets Values may not belong to singleton buckets if tuples increase in equi-depth buckets Number of equi-depth buckets needs adjustment No. of tuples in equi-depth buckets needs adjustment

Solutions to the challenges Large number of same value will cause an equi-depth bucket to split but the adjacent boundaries will have same value, hint create singleton bucket for that value allow singleton buckets with small counts to be merged back into equi-depth buckets. Split and merge technique to control imbalance between equi-depth buckets and their tuples without recomputation

To handle deletion and modification Deletion can decrease number of tuples in a bucket relative to another bucket, making a singleton bucket can drop a bucket count to the lower threshold TL. What to do?  Merge the pair with smallest combined count and split the bucket with largest count  Else recompute from backing sample

Conclusion Backing sample Incremental maintenance of equi-depth and compressed histograms Split and merge technique to reduce access to backing sample

Use of histograms in Commercial database Commercial Vendors Histograms IBM DB2 Compressed (V,F) -SASH (Self Adaptive Set of histograms) Research at Watson -Two phase of automatically building/maintaining histograms based on query feedback chap-3 Oracle Equi-depth -Oracle optimizer decide whether to use index vs full-table scan -use of dbms_stats, ANALYZE -Oracle 10g claims to generate histograms automatically when appropriate SqlServer Equi-depth - a query processor can make more accurate cardinality estimates us/oledb/htm/oledbrowsets_special_purpose_rowsets.asp