Quantitative Methods – Week 6: Inductive Statistics I: Standard Errors and Confidence Intervals Roman Studer Nuffield College
Repetition: Fitting the Regression Line The regression line predicts the values of Y based on the values of X. Thus, the best line will minimise the deviation between the predicted and the actual values (the error, e) =Y UK - Ŷ UK Regression line IP=a+bWage
Repetition: The Goodness of Fit x Regression line Total variation = explained variation + unexplained variation TSS ESS USS R²=ESS/TSS
Homework This means that the regression coefficient is (Product of the Deviations divided the Deviation squared of X). Therefore a = (29826-(703,55 x 51)
Homework (II) The Coefficient of Determination will be therefore: / = This will mean that Education is able to account for 72% of the GDP per person. Complete data set: Coefficient of determination: R-squared = a= ; b= XiYiYi-MeanSquares Norway Switrzerlan d US Brazil Iran Total Sum of Squares XiYiYi-Mean Y PredictedExplained variation Norway Switrzerla nd US Brazil Iran Explained sum of squares XiYiYi-Mean Y PredictedUnexplained variation Norway Switrzerlan d US Brazil Iran Residual sum of squares
Inductive Statistics: Introduction So far, we have only looked at samples, and we will most often only have samples and not entire populations We have described and analysed these samples and computed means, standard deviations, correlation coefficients, regression coefficient, etc. However, because of "the luck of the draw“, the estimated parameters will deviate from the ‘true’ parameters of the whole population (sampling error) We now move from descriptive statistics to inductive statistics… We no longer only describe samples, but we now draw conclusions about characteristics of the entire statistical population based on our sample Chapters 5 & 6 provide the tools necessary to make inferences from a sample
Inductive Statistics: Introduction (II) What can we infer from a sample? If we know the sample mean, how good is this an estimator of the population mean? If we calculated the correlation and regression coefficient from a sample of observations, how good is this an estimator of the ‘true’ correlation and regression coefficient? How reliable are our estimates?
Sample Biases In a first step, especially when working with historical data, we need to ascertain whether our sample is likely to be representative or whether is may suffer from some serious bias problems… Is the sample of records that has survived representative of the full set of records that was originally recorded? Business records, household inventories Did all records have an equal chance to make their way to the archive (success bias)? Is the sample drawn from the records representative of the information in those records? Should you computerise information of people whose surname begins with W? Is B possibly a better choice? Rate of return on equity (survivorship bias) Is the information in the records representative of a wider population than that covered by the records? Height records of recruits Tax records (selection bias) Sampling will affect the inferences we (can) draw
Sampling Distribution Sampling distribution refers to the distribution of the parameters that would be obtained if a large number of random samples of a given size were drawn from a given population; it is a hypothetical distribution Example: We draw a sample of 20 rabbits and then we calculate the mean ear length. After this we let the rabbits free. We repeat this 100 times. We get 100 estimates of mean ear length based on 100 samples of 20 rabbits. The distribution may look like this 4 times, we calculated 52.5<mean<=55 15 times, we calculated 55<mean<= times, we calculated 57.5<mean<=60
Sampling Distribution (II) The standard error is the estimated standard deviation of the sampling distribution X Sample mean estimates X Probability X SE(X) Sampling distribution of the sample mean : population mean X X: sample means
Central Limit Theorem 1.Regardless of shape of the population distribution, as the sample size (of samples used to create the sampling distribution of the mean) increases, the shape of the sampling distribution becomes normal X 2.The mean of the sampling distribution will be equal to the ‘true’ but unknown population mean. On average, the known sample mean X will be equal to μ, the unknown population mean 3.The standard deviation of the sample (s) can be taken as the best estimate of the population standard deviation (σ). The standard error (SE) of the sample mean, i.e. the standard deviation of the sampling distribution is therefore
Standard Normal Probability Distribution X With the mean ( X ) and the standard deviation (SE) of the sampling distribution, we have all the information about the distribution However, we now want to standardise this sampling distribution using The distribution of Z has always a mean of zero and a standard deviation of 1 The proportion of under the curve up to or beyond any specific value of Z can now be obtained from a published table with
Standard Normal Probability Distribution (II) with 0-1,96+1,96 =1=1 2,5% of cases 95% of cases A standard normal distribution is a normal distribution N(0,1) with mean =0 and standard deviation =1
Student’s t-distribution Student’s t-distribution is very similar to the standard normal Z- distribution, but adjust for the degrees of freedom (df) As the sample size N tends to infinity the t-distribution approximates the standard normal Z-distribution We know the proportion of cases below a certain t-value, e.g. 2.5% of the cases are below t=1.98 for N-1=120 and t=1.96 when N approaches infinity
Confidence Intervals We now come back to the question asked before: how good are our estimates of some parameters obtained by the sample? How good an estimator is, say, the sample mean, X, of the what we really want to know, which is the population mean μ? The sample mean can be taken as an estimate of the unknown population mean Though correct on average, a single estimate from an individual sample might differ from the true mean to some extent We can generate an interval in which the "true" (population) mean is located with a specified probability 90% CI: With a probability of 90%, the interval includes m 95% CI: In 95 times out of 100, the interval includes m 99% CI: There is a 99% probability that the interval includes m
Confidence Intervals (II) How many standard errors either side of the sample do we have to add to achieve a degree of confidence of 95%? The t-distribution gives the exact value! We know the proportion of cases below a certain t-value, e.g. 2.5% of the cases are below t=1.98 for N-1=120 and t=1.96 when N approaches infinity Example: Birth rate in English parishes N= 214 parishes The mean is births per 100 families Standard error (SE) is births The t value for the t-distribution for 213 degrees of freedom is The 95% confidence interval for the mean birth rate of the population is therefore: / (1.971 x 0.308) = / 0.607
Repetition & Confidence Intervals Computer Class:
Exercises Weimar elections: Unemployment and votes for the Nazi Get the dataset about the Weimar election of at Look at the variables (votes for the Nazi party, level of unemployment) in turn Get a first visualisation of the data; does it look normally distributed? Compute the mean, median, standard deviation, coefficient of variation, kurtosis and skewness for voting share of the Nazi party and the level of unemployment Estimate the following regression for each of the first two of the four elections (09/30, 03/33): Nazi=a+bUnemployment Explain in words what the two regression tell you Draw the respective scatter plots and draw the regression lines Calculate the 90%, 95% and 99% confidence intervals for a and b Are the b and the explanatory power of the regression the same for the election in 1930 and the one in 1933?
Homework Readings: Feinstein & Thomas, Ch. 6 Repeat what we have learned today Problem Set 5: Finish the exercises from today’s computer class if you haven’t done so already. Include all the results and answers in the file you send me.