Statistical Analysis Of Population Prepared by, Sushruth Puttaswamy
Contents Population Sampling a Population Relation between Mean of a Sample & Mean of the population Estimation of Error Sample Query
Population Basically a bunch of Numbers. P= {y1, y2, y3 ………………yn} Objective is to do some statistical analysis on the population.
Sampling a Population Consider x to be a random number from the population. Each element of P has an equal opportunity to be selected. A sample is a subset of random numbers from P. Sampling in this situation is assumed to be done with replacement.
Relation between Mu & Y Let the Average/Mean of ‘P’ be ‘Y’. Suppose we sample ‘k’ numbers out of ‘P’ (with replacement). Let ‘Mu’ be the mean of the sample ‘Y’. Objective is to find a relation between ‘Mu’ & ‘Y’. Estimate mod( ‘Mu’ – ‘Y’). Instead of mod, ( ‘Mu’ – ‘Y’) is better. 2
Standard Error Formula The Standard Error Formula gives us an estimate of the error in the sampling process. It is given by E[ ( ‘Mu’ – ‘Y’) ] = (Var) / k. Var is the variance of the population ‘P’. The RHS in the formula gives us the standard error of the sampling process. The formula does not depend on ‘n’, the number of elements in the population. 22
Sampling Methods Sampling must get all columns of a row from the database. The aim is to reduce the error of the estimate. The estimate should be unbiased each time, that is E[ ‘Mu’ ]= ‘Y’. Random Sampling doesn’t give a good estimate when the query has low selectivity.
Sample Query Let us apply Random Sampling to a Database Query. Let ‘Emp’ be a DB table which has Gender as 1 of the columns along with 100,000 records. How many female employees are there? The SQL Query for this is “SELECT COUNT(*) FROM Emp WHERE gender=‘F’.
Query Using Random Sampling Let us select a sample of size 100 (Emp_sam) & assume that no extra overhead is required for getting the samples. Now the query on the sample is “SELECT COUNT(*)*n/k FROM Emp_sam WHERE gender=‘F’; To find this value lets assume a hypothetical column in the DB which has a 0 for Male & 1 for Female. Now adding all 1’s in the result, find the average & multiplying by n gives us the number of females. Let the number of females got by this be 20,000, which means there are 80,000 males.
Estimation of Error To find the error we need to find the variance. From previous result, number of females=20,000 which means there are ’s. Mean of the sample ‘Mu’=20000/100000=0.2. Var= [(0-0.2) * (1-0.2) *20000]/ We get the Variance as From the Standard Error formula we have E=Var / k, that is =
Estimation of Error is the square of the error when trying to estimate ratio of females to the population. By taking the Square Root we get the value as Multiplying this value by ‘n’ we get the value 670. This tells us the error. This means the number of females is / The error in our calculation is 670.
Conclusion The error can be reduced by increasing the sample size. According to the formula, reducing the variance also lessens the error. Without going through all the records we could find the result of the query along with the level of error associated with it.