Overcoming Limitations of Sampling for Aggregation Queries

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Fast Algorithms For Hierarchical Range Histogram Constructions
Chapter 6 Sampling and Sampling Distributions
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
An Efficient Cost-Driven Selection Tool for Microsoft SQL Server Surajit ChaudhuriVivek Narasayya Indian Institute of Technology Bombay CS632 Course seminar.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
O VERCOMING L IMITATIONS OF S AMPLING FOR A GGREGATION Q UERIES Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Query Optimization Allison Griffin. Importance of Optimization Time is money Queries are faster Helps everyone who uses the server Solution to speed lies.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Histograms for Selectivity Estimation
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Presented By Anirban Maiti Chandrashekar Vijayarenu
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Chapter 6 Sampling and Sampling Distributions
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Dense-Region Based Compact Data Cube
Confidence Intervals Cont.
OPERATING SYSTEMS CS 3502 Fall 2017
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Data Mining: Concepts and Techniques
A paper on Join Synopses for Approximate Query Answering
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
ICICLES: Self-tuning Samples for Approximate Query Answering
Spatial Online Sampling and Aggregation
Load Shedding Techniques for Data Stream Systems
Random Sampling over Joins Revisited
Introduction to Instrumentation Engineering
Lecture 2 – Monte Carlo method in finance
Reading Report 6 Yin Chen 5 Mar 2004
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Data Transformations targeted at minimizing experimental variance
A Framework for Testing Query Transformation Rules
Presentation transcript:

Overcoming Limitations of Sampling for Aggregation Queries Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Mayur Datar, Stanford University Rajiv Motwani, Stanford University Vivek Narasayya, Microsoft Research Presented by Ranjan Dash CSE@UTA

Outline Abstract Introduction Related Work Study of Limitations Handling Data skew Handling low selectivity & small groups Implementation and experimental result Conclusion and future work 9/20/2018 Ranjan Dash CSE@UTA

Abstract Approximate aggregation query processing using sampling Uniform sampling performs poorly when - Distribution of aggregated attribute is skewed Queries with low selectivity Solution proposed by the paper - Outlier indexing Weighted sampling based on workload information A Combined approach significantly reduces the approximation error. 9/20/2018 Ranjan Dash CSE@UTA

Introduction Approximate query processing used extensively by OLAP and Data mining tools. Use of uniform random sampling is error prone because of - Presence of data skew – presence of outlier values that are significantly different from the rest in terms of their contribution to the aggregate Low selectivity – Very few or no tuples may satisfy the query predicate. Extrapolating from such a small set of tuples led to error. Two techniques to overcome these limitations of URS. Isolate values in the data that could contribute heavily to the error in sampling – Outlier-indexing Exploit the workload information for queries with low selectivity. Single table queries involving selection and group by queries with sum aggregate – can be extended to other aggregation functions and foreign key joins. Combination of outlier-indexing and weighted sampling based on workload info results in significant error reduction. 9/20/2018 Ranjan Dash CSE@UTA

Related Work Related research in approximately answering aggregation queries – Use of online sampling technique for data with little or no variability – [12] Use of pre-computed samples or synopsis in answering aggregation queries – [2] Techniques for fast incremental maintenance of summery statistics to provide approximate query answer – [9] Proposal of a weighted sampling scheme – [1] Use of weighted sampling to continuously tuning representative sample of data Concept of detecting and removing deviants in time series data – [14] 9/20/2018 Ranjan Dash CSE@UTA

Limitations of uniform random sampling – an example Adverse impact of data skew on aggregation queries – Relation R with 10,000 tuples – aggregate on column ‘C’ 99%(9900) tuples 1 – 9900 1%(100) tuples 1000 – 100, 000 1% URS of R – Sample size = 100 result = sum of sample x scale factor(100) Quite likely sample would not include any tuple of value 1000 – estimated result = 100x100 = 10, 000 2 or more tuples of value 1000 gets included in sample – estimated result > 209, 800 Reasonable result only when we get exactly one tuple of value 1000 per sample – This event has probability 0.37. With probability of 0.63, we would get a large error in the estimate. 109, 900 Estimated result >> true sum 109, 900 9/20/2018 Ranjan Dash CSE@UTA

Limitations of uniform random sampling – skewed data Contribution of outliers to the error in estimating sum via uniform sampling - Relation R of size N having values {y1, y2, ……yN}. Let U be a uniform random sample of yi’s of size n. Then Unbiased estimator Actual sum Standard error Where S is the standard deviation of the values in the relation defined as If there are outliers in the data then S could be very large. For a given error bound we need to increase the sample size n. Theorem 1 Ye = (N/n) yi 9/20/2018 Ranjan Dash CSE@UTA

Limitations of uniform random sampling – low selectivity Selection query partitions relation into 2 sub-relation – Tuples that satisfy the condition – relevant sub-relation Tuples that do not satisfy the condition Number of tuples sampled from relevant sub-relation is proportional to its size. If this relevant sample size is small due to low selectivity of the query, it may lead to larger error Group by queries partition the relation into many sub-relations Success of uniform sampling depends on the size of the relevant sub-relation 9/20/2018 Ranjan Dash CSE@UTA

Handling data skew – outlier indexes Large variance in aggregate column due to presence of outliers or deviants lead to large error. Natural idea is to deal with it separately – outlier indexing Example: selection query (Q) with sum aggregate Separate the outliers into a separate sub-relation (RO) Apply the query to this sub-relation of outliers and get the true result of this query. Pick a uniform random sample from the non-outlier part of the relation (RNO) and estimate the approximate true result using theorem 1. Combine the above two to get overall estimate of query’s true result. 9/20/2018 Ranjan Dash CSE@UTA

Handling data skew – outlier indexes – an example Aggregate Query Q - Select sum (sales) from lineitem Preprocessing 1. Partition the relation into RO (outliers) and RNO (non-outliers) 2. Ro is lineitem_outlier a view 3. select a uniform random sample T from relation RNO and materialize it in a table called lineitem_samp Query processing 1. Aggregate outliers – sum(sales) from lineitem_outlier 2. Aggregate non-outliers – apply the query to sample T and extrapolate – sum(sales) for lineitem_samp x scale factor 3. Combine aggregates – approximate result from RNO + exact result for RO = approximate result for R 9/20/2018 Ranjan Dash CSE@UTA

Handling data skew – outlier indexes – selection of outliers Query error is solely due to error in estimating the aggregate from non-outliers Additional overhead of maintaining and accessing the outlier index – Ro contains at most τtuples Paper proposes an algorithm based on a theorem to decide the optimal size of outlier-index An optimal outlier-index Ro(R,C,τ) is defined as a sub-relation Ro c R such that |Ro| ≤ τ E(R\ Ro) = min R’C R ,|R’|≤ τ{E(R\R’)} Defines outlier index as an optimal sub-relation Ro that leads to the minimum possible sampling error, subject to the constraint that Ro has at most τ tuples in R Definition 1 9/20/2018 Ranjan Dash CSE@UTA

Handling data skew – outlier indexes – selection of outliers Theorem 2 Consider a multiset R = {y1, y2,…yN} in sorted order. Ro C R such that |Ro| ≤ τ S(R\ Ro) = min R’C R ,|R’|≤ τ{S(R\R’)} There exists some 0 ≤ τ’ ≤ τ such that Ro = {yi|1 ≤ i≤ τ’ }U{yi|(N+ τ’ +1- τ) ≤i ≤N} States that the subset that minimizes the standard deviation over the remaining set consists of the leftmost τ’ elements and the right most τ- τ’ elements from the multiset R when elements are arranged in sorted order. 9/20/2018 Ranjan Dash CSE@UTA

Handling data skew – outlier indexes – selection of outliers Alogirithm Outlier-Index(R,C, τ): Let the values in col C be sorted. For i = 1 to τ+1, compute E(i) = S({yi, yi+1,…yN- τ+i-1}) Let i’ be the value of i where E(i) is minimum. Outlier-index is the tuples that correspond to the set of values {yj|1 ≤ j≤ τ’ }U{yj|(N+ τ’ +1- τ) ≤j ≤N} where τ’ = i’ - 1 9/20/2018 Ranjan Dash CSE@UTA

Handling data skew – outlier indexes – storage allocation Given sufficient space to store m tuples, how do we allocate storage between samples and outlier-index inorder to minimize error? S(τ) is the SD in non-outliers for an optimal outlier-index of τ. If we allocate space in m such that τ tuples in the outlier and m- τ in the sample. Then error will be proportional to S(τ)/√(m- τ). Then the optimal allocation will be the value of τ for which S(τ)/√(m- τ) will be minimum. 9/20/2018 Ranjan Dash CSE@UTA

Use of outlier-indexing – extension to other aggregates Count aggregate: outlier-indexing not beneficial as there is no variance among the data values Avg aggregate: same as sum aggregate Extensible to class of aggregates that satisfy certain algebraic property Real valued function aggregate: like sum(price*quantity) Rank order dependent aggregates: like min, max etc – outlier not useful Extension to foreign-key join Outlier index and samples are computed over the relation having the aggregation column and then joined with other relations at query processing time. 9/20/2018 Ranjan Dash CSE@UTA

Handling low selectivity and small groups Problem of low selectivity queries and small groups in group-by queries. Use weighted sampling instead of uniform sampling with the help of workload information Sample more from the subsets that are small in size but are important having high usage Usage of a database is typically characterized by considerable locality in the access pattern i.e queries against the database access certain parts more than others. Tune the sample to a representative workload This technique uses pre-computed samples 9/20/2018 Ranjan Dash CSE@UTA

Handling low selectivity and small groups – Exploit workload info Workload collection – collect workload consisting of representative queries from query profilers and query logs Trace query patterns – set of selection condition Trace tuple usage – usage of specific tuples like frequency of access to each tuple tuple ti has weight wi if the tuple ti is required to answer wi of the queries in the workload. Weighted sampling – perform sampling taking into account weights of the tuples The weights are normalized 9/20/2018 Ranjan Dash CSE@UTA

Handling low selectivity and small groups – Exploit workload info – an example Use of weighted sample to answer aggregation query Let the weight of tuple ti be wi Normalized weight be wi’ = wi/ Σwi probability of acceptance of this tuple in the sample pi = n.wi’ where n is the sample size How to answer aggregation queries approximately? For each tuple included in the sample, store the corresponding pi. Each aggregate computed over this tuple gets multiplied by the inverse of pi (multiplication factor). This works well if Access pattern is local i.e. most queries access a small part of relation Workload is representative of actual queries 9/20/2018 Ranjan Dash CSE@UTA

Implementation Outlier-indexing Table: lineitem, Aggregation column: I_extendedprice CREATE VIEW I-extendedprice-otl-idx AS SELECT * from lineitem WHERE (I-extendedprice 5 54819.46) OR (I-extendedprice 2 71442.88) predicate is determined using the algorithm and storage allocated for the outlier-index. Module that automatically rewrites a, query to use the outlier-index and the sample rather than the fact table. Uniform and weighted sampling. Modify the execution tree generated by the SQL Server optimizer For uniform sampling accepts tuples with the specified probability (i.e., the sampling fraction) and stores the accepted tuples in a table. For weighted sampling, the probability of accepting a tuple is proportional to the weight associated with the tuple. Calculated exact weights for each tuple for a given workload and store in an additional column in fact table (lineitem) . Converted a SELECT query in the workload into the corresponding UPDATE statement that increments the weight of all the tuples satisfying the selection predicates. Automatically substitute the sample table for the fact table of an incoming query. 9/20/2018 Ranjan Dash CSE@UTA

Implementation & Experimental Result Goal is to compare performance of uniform sampling, weighted sampling and weighted sampling + outlier-indexing Platform. Microsoft SQL Server 2000. Dell Precision 610 system with a Pentium I11 Xeon 450 Mhz processor with 128 MB RAM and an external 23GB hard drive. Databases. TPC-R benchmark that supports data generation from a uniform distribution. Modified the TPC-R data generation program to generate data with varying degree of skew – using varying Zipfian parameter (z) The ratio of the maximum value to the minimum value of the aggregation column varied between 76 and 106. Workloads. generated using a random query generation program. The program generates queries with foreign key joins between tables, aggregations on the fact table (lineitem) grouping selection. The sum aggregation function is used. 9/20/2018 Ranjan Dash CSE@UTA

Implementation & Experimental Result Parameters (1) skew of the data (z): over 1, 1.5, 2, 2.5, and 3 (2) the sampling fraction (f) - from 1% to loo% (3) the storage for the outlier-index - 1%, 5%, I0%, and 20%. Error metric For each query compute the relative error by dividing the difference between the approximate estimate for the aggregate (sum) and the accurate value of the aggregate by the latter. For queries with a group-by clause: Consider a query that has k groups in the answer obtained from actually executing the query. Build a k dimensional vector where the ith dimension contains the relative error in the aggregate expression for that ith group. The error is the mean of all the points in the vector. Thus, the average relative error over all groups is reported. The error metric for a workload is average error over all queries in the workload. 9/20/2018 Ranjan Dash CSE@UTA

Implementation & Experimental Result Uniform sampling weighted sampling (WSAMP) weighted sampling + outlier-indexing (WSAMP+OTLIDX). 9/20/2018 Ranjan Dash CSE@UTA

Experimental Result - Varying the data skew Vary the data skew, while keeping the sampling fraction constant (5%). Weighted sampling does consistently better than uniform sampling across all data skews Weighted sampling + outlier-indexing performs significantly better than weighted sampling alone. 9/20/2018 Ranjan Dash CSE@UTA

Experimental Result - Varying the sampling fraction varied the sampling fraction with fixed data skew (z=2). weighted sampling gives lower error than uniform sampling across all sampling fractions. greatest benefit occurs at low sampling fractions (e.g., 1%). use of outlier-indexing in addition to weighted sampling improves accuracy significantly. 9/20/2018 Ranjan Dash CSE@UTA

Experimental Result - Varying the selectivity of queries vary selectivity of queries between 1 % and 100% with fixed data skew (z=2) and the sampling fraction (f=l%). weighted sampling better than uniform sampling at very low selectivity. when the workload references large portions of the data (e.g., at 100% selectivity) uniform and weighted sampling are not significantly different. Weighted sampling + outlier-indexing performs well across different selectivities. Its relative improvement over weighted sampling is smaller at lower selectivities. 9/20/2018 Ranjan Dash CSE@UTA

Conclusion and Open Problems Explored problems encountered when using uniform sampling as a means for approximate query answering. Observations: Skew in the aggregation attribute can lead to large errors - proposed outlier-indexing to improve the accuracy. Demonstrated that this technique improves accuracy significantly Problem of low selectivity of queries - Outlined approaches based on workload information. Combination of outlier-indexing and weighted sampling based on workload information has proved to be a significant step forward. Future works: Investigating the problem of building a single outlier-index for different aggregates and aggregate expressions. Tuning the selection of the outlier-index using the workload information is another interesting issue. Investigating extensions of this techniques to a wider class of join queries. 9/20/2018 Ranjan Dash CSE@UTA

References [l] Acharya S., Gibbons P., and Poosala V. Congressional samples for approximate answering of group-by queries. In Proceedings of the ACM SIGMOD Conference, 487-498, 2000. [2] Acharya S., Gibbons P., Poosala V., and Ramaswamy S. Join synopses for approximate query answering. In Proceedings of the ACM SIGMOD Conference, 1999. [3] Bamett V. and Lewis T. Outliers in Statistical Data. John Wiley, 3rd edition, 1994. [4] Chatfield C. The Analysis of Time Series. Chapman and Hall, 1984. [5] Chaudhuri S., Motwani R., and Narasayya V. On random sampling over joins. In Proceedings of the ACM SIGMOD Conference, 1998. [6] Chaudhuri S. and Narasayya V. Program for TPC-D data generation with skew. ftp.research.microsoft.comlpub/users/viveknar/tpcdskew. [7] Cochran W. G. Sampling Techniques. John Wiley & Sons, New York, third edition, 1977. [8] Ganti V., Lee M. L., and Ramakrishnan R. ICICLES: selftuning samples for approximate query answering. In Proceedings of 26th lntemational Conference Very Large Data Bases, 2000. [9] Gibbons P., and Matias Y. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the ACM SIGMOD Conference, 33 1-342, 1998. [IO] Haas P. and Hellerstein J. Ripple joins for online aggregation. In Proceedings of the ACM SIGMOD Conference, 287- 298, 1999. [11] Hawkins D. Identification of Outliers. Chapman and Hall, london, 1980. 9/20/2018 Ranjan Dash CSE@UTA

References [I2] Hellerstein J., Haas P., and Wang H. Online aggregation. In Proceedings of the ACM SIGMOD Conference, 1997. [I3] Hou W., Ozsoyoglu G., and Dogdu E. Error-constrained COUNT query evaluation in relational databases. In Proceedings of the ACM SIGMOD Conference, 218-281, 1991. [I4] Jagdish H., Koudas N. and Muthukrishnan S. Mining deviants in times series database. In Proceedings of 25th International Conference Very Large Data Bases, 102-1 13, 1999. [15] Knorr E. and Ng R. Algorithms for mining distance-based outliers in large datasets. In Proceedings of 24th International Conference Very Large Data Bases, 392-403, 1998. [I6] Ramaswamy S., Rastogi R., and Shim K. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM SIGMOD Conference, 427-438, 2000. [I7] Manku G., Rajagopalan S., and Lindsay B. Random sampling techniques for space efficient online computation of order statistics of large datasets. In Proceedings of the ACM SIGMOD Conference, 251-262, 1999. [ 18] Olken E Random sampling from databases - bibliography. http://pueblo.lbl.gov/ olken/mendel/sampling/bibliography.html. [19] Seshadri P. and Swami A. Generalized partial indexes. In Proceedings of the I Ith International Conference on Data Engineering, 420427, 1995. [20] Zipf G. E. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949. 9/20/2018 Ranjan Dash CSE@UTA

Questions? 9/20/2018 Ranjan Dash CSE@UTA