Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar.

Similar presentations


Presentation on theme: "Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar."— Presentation transcript:

1 Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar

2 Central Limit Theorem Assume a given population of numbers P={ x 1,x 2,…….infinity} x i x j

3 Let x p = average of P, σ p = variance of P, k = tuples from sample, µ s = average of sample. -Does µ s remain fixed? Standard Error formula says, E(µ s ) = x p If σ s = variance of the average of sample then E(µ s ) = x p σ s 2 = σ p 2 / k Interesting phenomenon If we plot µ, it is not going to be skewed but give a bell curve even though the actual population may be any distribution.

4 The Central limit theorem says: As we repeat sampling random distribution, the randomness disappears and gets a bell shaped curve which gets tighter as we proceed. 0k40k 200k Skewed Distribution of salary x = exact avg Plot µ

5 Our main objective is Not to reduce the error but to give exact error interval. Hence we need to find the variance. There are two options to find variance σ p 1) Use a materialized view with an extra column e.g.. 0 for females, 1 for males 2) Calculate the sample variance many times to get an unbiased original variance.i.e. Use sample variance as a surrogate of original variance. Which one will be better?

6 http://www.math.duke.edu/~wka/math135/confidence.pdf x-dx+dx Area=0.95 ∫ =1 Error Interval with Confidence level To give the error interval with 95% confidence. Find a point d which will give an area=0.95 from the curve, then x±d will be the error with 95% confidence Alternatively, to find out d we can calculate 1.96*sd Where standard deviation (sd)= σp /√ k

7 Stratified Sampling Will stratification of salary give a more accurate results? 50k100k 200k 0 kN1N1 N2N2 NrNr Population P broken into r strata (P 1 …P r ) : Sample Mean σ 1 Sample Size k 1 P 1 σ2k2P2σ2k2P2 σrkrPrσrkrPr Technique to stratify is to minimize variance in each strata.

8 Total sample = k 1 +k 2 +……+k r Mean of sample µ s = Challenges : 1)Stratification : How to break into strata 2)Allocation : How many samples from 1 st group, 2 nd group…….? i.e. how to allocate samples In this graph, can we say get more samples from 30-70k range (allocation strategy) ? 0k 30k 40k 70k

9 How data is organized in database? in disc blocks To read a single record, need to read the entire disc block Clustered index, B+ tree are some of the indexing techniques. Two approaches for sampling Online sampling Offline sampling also called pre-computed sampling

10 Effects : Online sampling costly in-terms of response time. Offline sampling can be done during pre-processing time. Reuse the sample again. How to get sample data : Generate a random number between 0-10 6 and pull out the record with that record id. OR Bernoulli's theorem : Go to each record Toss a coin If head then pull out the record, else leave it. Note: May not get the exact sample size

11 How to maintain freshness of data in random sample via offline method? Doesn’t matter much as they are done for history data What if the original query changes? May be it was directed towards particular field only.. Generate the random sample again as it doesn’t matter much towards the performance since it is pre-processed. E.g. generate once in 3 months. Oracle, sqlserver are having the random sampling functionality added in their newer versions.


Download ppt "Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar."

Similar presentations


Ads by Google