SAMPLING-BASED SELECTIVITY ESTIMATION

Slides:



Advertisements
Similar presentations
Introduction Simple Random Sampling Stratified Random Sampling
Advertisements

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Sampling Distribution Models.
I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the "Law of Frequency of Error". The law.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Economics 105: Statistics Review #1 due next Tuesday in class Go over GH 8 No GH’s due until next Thur! GH 9 and 10 due next Thur. Do go to lab this week.
Chapter 18 Sampling Distribution Models
Sampling Distribution of & the Central Limit Theorem.
Chapter 6 Introduction to Sampling Distributions
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Combinatorics. If you flip a penny 100 times, how many heads and tales do you expect?
Evaluating Hypotheses
Lec 6, Ch.5, pp90-105: Statistics (Objectives) Understand basic principles of statistics through reading these pages, especially… Know well about the normal.
Chapter 6 Confidence Intervals.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.
Sampling Distributions & Point Estimation. Questions What is a sampling distribution? What is the standard error? What is the principle of maximum likelihood?
Sample Distribution Models for Means and Proportions
QUIZ CHAPTER Seven Psy302 Quantitative Methods. 1. A distribution of all sample means or sample variances that could be obtained in samples of a given.
Standard error of estimate & Confidence interval.
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 10 Sampling Distributions.
F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Random Sampling, Point Estimation and Maximum Likelihood.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Stat 13, Tue 5/8/ Collect HW Central limit theorem. 3. CLT for 0-1 events. 4. Examples. 5.  versus  /√n. 6. Assumptions. Read ch. 5 and 6.
Normal Probability Distributions
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Lab 3b: Distribution of the mean
Chapter 7: Sample Variability Empirical Distribution of Sample Means.
Chapter 7 Probability and Samples: The Distribution of Sample Means
Determination of Sample Size: A Review of Statistical Theory
Chapter 6.3 The central limit theorem. Sampling distribution of sample means A sampling distribution of sample means is a distribution using the means.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Sampling Distribution Models.
8 Sampling Distribution of the Mean Chapter8 p Sampling Distributions Population mean and standard deviation,  and   unknown Maximal Likelihood.
Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.
Chapter 18: Sampling Distribution Models
Statistics and Quantitative Analysis U4320 Segment 5: Sampling and inference Prof. Sharyn O’Halloran.
Chapter 18 Sampling distribution models math2200.
Introduction to Inference Sampling Distributions.
Unit 6 Section : The Central Limit Theorem  Sampling Distribution – the probability distribution of a sample statistic that is formed when samples.
Chapter Eleven Sample Size Determination Chapter Eleven.
Sampling and Sampling Distributions. Sampling Distribution Basics Sample statistics (the mean and standard deviation are examples) vary from sample to.
Lecture notes 5: sampling distributions and the central limit theorem
Confidence Intervals Cont.
Chapter 6: Sampling Distributions
Sampling and Sampling Distributions
Sampling Distributions – Sample Means & Sample Proportions
Introduction to Inference
Chapter 7 Review.
Sampling Distribution Models
Virtual University of Pakistan
Target for Today Know what can go wrong with a survey and simulation
Computational Learning Theory
6-3The Central Limit Theorem.
Chapter 6: Sampling Distributions
Streaming & sampling.
Chapter 18: Sampling Distribution Models
Combining Random Variables
Sampling Distribution Models
Spatial Online Sampling and Aggregation
Degrees of Freedom The number of degrees of freedom, n, equal the number of data points, N, minus the number of independent restrictions (constraints),
Sampling Distribution Models
Econ 3790: Business and Economics Statistics
Sampling Distributions
Sampling Distributions
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Statistical inference for the slope and intercept in SLR
Presentation transcript:

SAMPLING-BASED SELECTIVITY ESTIMATION Jayant Haritsa Computer Science and Automation Indian Institute of Science

BIG PICTURE Sampling Histograms Size Estimation

SAMPLING Pluses: Minuses: Fairly easy to compute and maintain samples Multi-dimensional / correlations are not a problem Gives whole tuple info For queries with output from sample, “instant answers” Minuses: Space consuming Sensitive to data skew, especially low frequencies May need to repeat for each query Probabilistic guarantees (histogram gives deterministic guarantees) The instant answers refers to queries such as existence queries or queries on primary key attributes.

TODAY’s PAPER How to efficiently estimate selection and join result sizes using sampling Received Best Paper award in Sigmod 1990 ! Lipton is a top theoretician Basic Issues: What kind and how many samples to take? one disk I/O per sample, hence expensive and would like to minimize What are the error bounds?

Partition Approach n partitions with Pi  Pj = 0 Result Result =  Pi Randomly sample m partitions Let s =  | Pi | of samples Estimate | Result | = (s/m) n What should n and m be ? n : partition = tuple (or = page) m: ? Partition result size (with tuple partitions) = 0 or 1 for selection = t ⨝ S for join, 0 to |S| Result P1 P2 … Here, n refers to source relation. So many partitions may have 0 value. Pn

Central Limit Theorem Number of samples = m Sample Mean (avg) = M Sample Variance = S2 True Mean =  True Variance = 2 Then, CLT states that the statistic (M – ) /  2 / m will have N(0,1) distribution (a) = (Approximate 2 by S2) What is remarkable is that this is true independent of the sample distribution ! (assuming i.i.d.) Paper has mistake in missing square root and negative sign.

CLT (de Moivre, Laplace, Lyapunov) 1733 Sir Francis Galton (1889): I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the "Law of Frequency of Error". The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.

Estimation Then  = M  t 1–/2, m–1  2 / m that is, with a confidence of (1 – ), the true value of the distribution mean is within the term in the  Rewritten version of CLT in Defn. 2.1 with different symbols and equation organization

Unit Normal Distribution Student t distribution 1–  80% 90% 95% 99% 1.376 3.078 6.314 31.82 0.879 1.372 1.812 2.821 0.845 1.290 1.660 2.364 0.842 1.282 1.645 2.326 M-1 1 10 100 In the English-language literature it takes its name from William Sealy Gosset's 1908 paper in Biometrika under the pseudonym "Student".[8][9] Gosset worked at the Guinness Brewery inDublin, Ireland, and was interested in the problems of small samples, for example of the chemical properties of barley where sample sizes might be as low as 3. One version of the origin of the pseudonym is that Gosset's employer forbade members of its staff from publishing scientific papers, so he had to hide his identity. Another version is that Guinness did not want their competitors to know that they were using the t-test to test the quality of raw material.[10] ∞

Algorithm (Figure 1) Average case Worst case s = 0 m = 0 while ( (s < k1 b d (d+1)) and (m < k2 e2) ) do s = s + RandomSample () m = m + 1 end à = n s / m within (A/d, bn/e) of true A with prob p b is upper bound on value of Random Sample() d and e are accuracy indicators k1 and k2 are functions of p A ± A/d and A ± bn/e are the bounds

Algorithm Parameter Settings d, e and p are given by the user k1 = [-1 ((1 + p) / 2)]2 k2 = [-1 ((1 + p) / 2)]2 Is k1 bigger or k2 bigger? k1 > k2

Theorem 2.2 Suppose that in a run of the algorithm in Figure 1, the while loop terminates because s ≥ k1 b d (d+1), and let the CLT approximation apply. Then for 0 ≤ p < 1, if k1 ≥ [-1 ((1 + p)/ 2)]2 the error in à is less than A/d with probability p.

Lemma 2.1 Squared coefficient of variation coefficient of dispersion The ≥ should really be only > because CDF is defined P(X≤ x). Subtract –mE, divide by √mV , multiply numerator/denominator of RHS with E/V Known in stats literature that m is proportional to squared coefficient of variation

Lemma 2.2

Theorem 2.2

Theorem 2.2 (contd)

Skewed Data b >> E, meaning “10 percent of the samples produced 90 percent of the output tuples” 1 tuple out of a million Expected size of a random sample is 1/million sampling requires 106 * k1 d (d+1) samples more than the number of tuples itself! happens because of sampling with replacement!

Theorem 3.2 Suppose that in a run of the algorithm of Figure 1, the while loop terminates because m > k2e2 . Then for 0 ≤ p < 1, if k2 ≥ [-1 ((1 + p)/ 2)]2 the error in à is less than Amax / e with probability p. The tech report with proof of 3.2 is LNS90 which has been downloaded

Implementation Selection: Join: Sampling via Index (on any attribute of relation) for Selection Join: Sampling via Index (on any attribute of source relation) and via Index (on join attribute of target relation)

END SAMPLING E0 261