Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar.

Slides:



Advertisements
Similar presentations
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Advertisements

Mean, Proportion, CLT Bootstrap
Estimation in Sampling
Psych 5500/6500 The Sampling Distribution of the Mean Fall, 2008.
Statistics : Statistical Inference Krishna.V.Palem Kenneth and Audrey Kennedy Professor of Computing Department of Computer Science, Rice University 1.
Sampling Distributions (§ )
Introduction to Statistics
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
The Central Limit Theorem
1 MF-852 Financial Econometrics Lecture 4 Probability Distributions and Intro. to Hypothesis Tests Roy J. Epstein Fall 2003.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 9: Hypothesis Tests for Means: One Sample.
Statistical Inference and Sampling Introduction to Business Statistics, 5e Kvanli/Guynes/Pavur (c)2000 South-Western College Publishing.
Introduction to Probability and Statistics Chapter 7 Sampling Distributions.
Experimental Evaluation
Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.
“There are three types of lies: Lies, Damn Lies and Statistics” - Mark Twain.
Sampling Distributions & Point Estimation. Questions What is a sampling distribution? What is the standard error? What is the principle of maximum likelihood?
12.3 – Measures of Dispersion
The Central Limit Theorem For simple random samples from any population with finite mean and variance, as n becomes increasingly large, the sampling distribution.
R. Kass/S07 P416 Lec 3 1 Lecture 3 The Gaussian Probability Distribution Function Plot of Gaussian pdf x p(x)p(x) Introduction l The Gaussian probability.
Review of normal distribution. Exercise Solution.
UNIT FOUR/CHAPTER NINE “SAMPLING DISTRIBUTIONS”. (1) “Sampling Distribution of Sample Means” > When we take repeated samples and calculate from each one,
Chapter 5 Sampling Distributions
Chapter 7 Sampling and Sampling Distributions Sampling Distribution of Sampling Distribution of Introduction to Sampling Distributions Introduction to.
June 18, 2008Stat Lecture 11 - Confidence Intervals 1 Introduction to Inference Sampling Distributions, Confidence Intervals and Hypothesis Testing.
Introduction to Data Analysis Probability Distributions.
Many times in statistical analysis, we do not know the TRUE mean of a population of interest. This is why we use sampling to be able to generalize the.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Chapter Twelve Census: Population canvass - not really a “sample” Asking the entire population Budget Available: A valid factor – how much can we.
Copyright ©2011 Nelson Education Limited The Normal Probability Distribution CHAPTER 6.
Biostatistics: Measures of Central Tendency and Variance in Medical Laboratory Settings Module 5 1.
LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.
LECTURE 3 SAMPLING THEORY EPSY 640 Texas A&M University.
Sampling distributions chapter 7 ST210 Nutan S. Mishra Department of Mathematics and Statistics University of South Alabama.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
Sampling Theory The procedure for drawing a random sample a distribution is that numbers 1, 2, … are assigned to the elements of the distribution and tables.
Distributions of the Sample Mean
Chapter Eight McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved. Sampling Methods and the Central Limit Theorem.
Chapter 1 Overview and Descriptive Statistics 1111.1 - Populations, Samples and Processes 1111.2 - Pictorial and Tabular Methods in Descriptive.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 7 Sampling and Sampling Distributions.
Section 9.3: Confidence Interval for a Population Mean.
CpSc 881: Machine Learning Evaluating Hypotheses.
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Summary A confidence interval for the population mean, is constructed using the formula: sample mean ± z multiplied by σ/√n where σ is the population.
Section 6.3: How to See the Future Goal: To understand how sample means vary in repeated samples.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Sampling and Sampling Distributions Basic Business Statistics 11 th Edition.
Statistical Analysis Of Population Prepared by, Sushruth Puttaswamy.
Chapter 5 Sampling Distributions. The Concept of Sampling Distributions Parameter – numerical descriptive measure of a population. It is usually unknown.
Basic Business Statistics
Chapter 18 - Part 2 Sampling Distribution Models for.
INFERENTIAL STATISTICS DOING STATS WITH CONFIDENCE.
Sampling Distributions Statistics Introduction Let’s assume that the IQ in the population has a mean (  ) of 100 and a standard deviation (  )
Learning Objectives Determine when to use sampling. Determine the pros and cons of various sampling techniques. Be aware of the different types of errors.
Chapter Eleven Sample Size Determination Chapter Eleven.
Many times in statistical analysis, we do not know the TRUE mean of a population on interest. This is why we use sampling to be able to generalize the.
THE NORMAL DISTRIBUTION
And distribution of sample means
Introduction to estimation: 2 cases
Sec. 7-5: Central Limit Theorem
Sampling Distributions & Point Estimation
Sampling Distribution Models
Social Science Statistics Module I Gwilym Pryce
Chapter 7 – Statistical Inference and Sampling
CHAPTER 12 More About Regression
Sampling Distributions (§ )
Simulation Berlin Chen
Ch. 8 Central Limit Theorem 8.1: Sample means 8.2: Sample proportions
CSE 6392 – Data Exploration and Analysis in Relational Databases
Presentation transcript:

Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar

Central Limit Theorem Assume a given population of numbers P={ x 1,x 2,…….infinity} x i x j

Let x p = average of P, σ p = variance of P, k = tuples from sample, µ s = average of sample. -Does µ s remain fixed? Standard Error formula says, E(µ s ) = x p If σ s = variance of the average of sample then E(µ s ) = x p σ s 2 = σ p 2 / k Interesting phenomenon If we plot µ, it is not going to be skewed but give a bell curve even though the actual population may be any distribution.

The Central limit theorem says: As we repeat sampling random distribution, the randomness disappears and gets a bell shaped curve which gets tighter as we proceed. 0k40k 200k Skewed Distribution of salary x = exact avg Plot µ

Our main objective is Not to reduce the error but to give exact error interval. Hence we need to find the variance. There are two options to find variance σ p 1) Use a materialized view with an extra column e.g.. 0 for females, 1 for males 2) Calculate the sample variance many times to get an unbiased original variance.i.e. Use sample variance as a surrogate of original variance. Which one will be better?

x-dx+dx Area=0.95 ∫ =1 Error Interval with Confidence level To give the error interval with 95% confidence. Find a point d which will give an area=0.95 from the curve, then x±d will be the error with 95% confidence Alternatively, to find out d we can calculate 1.96*sd Where standard deviation (sd)= σp /√ k

Stratified Sampling Will stratification of salary give a more accurate results? 50k100k 200k 0 kN1N1 N2N2 NrNr Population P broken into r strata (P 1 …P r ) : Sample Mean σ 1 Sample Size k 1 P 1 σ2k2P2σ2k2P2 σrkrPrσrkrPr Technique to stratify is to minimize variance in each strata.

Total sample = k 1 +k 2 +……+k r Mean of sample µ s = Challenges : 1)Stratification : How to break into strata 2)Allocation : How many samples from 1 st group, 2 nd group…….? i.e. how to allocate samples In this graph, can we say get more samples from 30-70k range (allocation strategy) ? 0k 30k 40k 70k

How data is organized in database? in disc blocks To read a single record, need to read the entire disc block Clustered index, B+ tree are some of the indexing techniques. Two approaches for sampling Online sampling Offline sampling also called pre-computed sampling

Effects : Online sampling costly in-terms of response time. Offline sampling can be done during pre-processing time. Reuse the sample again. How to get sample data : Generate a random number between and pull out the record with that record id. OR Bernoulli's theorem : Go to each record Toss a coin If head then pull out the record, else leave it. Note: May not get the exact sample size

How to maintain freshness of data in random sample via offline method? Doesn’t matter much as they are done for history data What if the original query changes? May be it was directed towards particular field only.. Generate the random sample again as it doesn’t matter much towards the performance since it is pre-processed. E.g. generate once in 3 months. Oracle, sqlserver are having the random sampling functionality added in their newer versions.