Probability & Statistical Inference Lecture 4 MSc in Computing (Data Analytics)

Slides:



Advertisements
Similar presentations
Chapter 6 Sampling and Sampling Distributions
Advertisements

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Estimation in Sampling
Chapter 10: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 10: Estimating with Confidence
Statistics and Quantitative Analysis U4320
Copyright © 2010 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Chapter 19 Confidence Intervals for Proportions.
Chapter 7 Introduction to Sampling Distributions
Estimation Procedures Point Estimation Confidence Interval Estimation.
Chapter 7 Sampling and Sampling Distributions
Part III: Inference Topic 6 Sampling and Sampling Distributions
BCOR 1020 Business Statistics Lecture 18 – March 20, 2008.
BCOR 1020 Business Statistics
Chapter 10: Estimating with Confidence
Inferential Statistics
Statistics for Managers Using Microsoft® Excel 7th Edition
Business Statistics: Communicating with Numbers
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 10 Sampling Distributions.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Overview Definition Hypothesis
Hypothesis Testing. Distribution of Estimator To see the impact of the sample on estimates, try different samples Plot histogram of answers –Is it “normal”
Chapter 5 Sampling Distributions
Copyright © 2009 Pearson Education, Inc. Chapter 23 Inferences About Means.
Slide 23-1 Copyright © 2004 Pearson Education, Inc.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Probability & Statistical Inference Lecture 4
Chap 8-1 Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall Chapter 8 Confidence Interval Estimation Business Statistics: A First Course.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Estimation Bias, Standard Error and Sampling Distribution Estimation Bias, Standard Error and Sampling Distribution Topic 9.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Copyright © 2012 Pearson Education. All rights reserved © 2010 Pearson Education Copyright © 2012 Pearson Education. All rights reserved. Chapter.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Sampling Distribution Models.
Sampling Distribution Models Chapter 18. Toss a penny 20 times and record the number of heads. Calculate the proportion of heads & mark it on the dot.
1 Chapter 6 Estimates and Sample Sizes 6-1 Estimating a Population Mean: Large Samples / σ Known 6-2 Estimating a Population Mean: Small Samples / σ Unknown.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Statistics : Statistical Inference Krishna.V.Palem Kenneth and Audrey Kennedy Professor of Computing Department of Computer Science, Rice University 1.
Sampling distributions rule of thumb…. Some important points about sample distributions… If we obtain a sample that meets the rules of thumb, then…
Chapter 8 Parameter Estimates and Hypothesis Testing.
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a measure of the population. This value is typically unknown. (µ, σ, and now.
Chapter 8, continued.... III. Interpretation of Confidence Intervals Remember, we don’t know the population mean. We take a sample to estimate µ, then.
Copyright © 1998, Triola, Elementary Statistics Addison Wesley Longman 1 Assumptions 1) Sample is large (n > 30) a) Central limit theorem applies b) Can.
Inference About Means Chapter 23. Getting Started Now that we know how to create confidence intervals and test hypotheses about proportions, it’d be nice.
1 Probability and Statistics Confidence Intervals.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Business Statistics: A First Course 5 th Edition.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Chapter 8: Estimating with Confidence
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a number that describes the population. In statistical practice, the value of.
Sampling: Distribution of the Sample Mean (Sigma Known) o If a population follows the normal distribution o Population is represented by X 1,X 2,…,X N.
Chapter 6 Sampling and Sampling Distributions
CIVE Engineering Mathematics 2.2 (20 credits) Statistics and Probability Lecture 6 Confidence intervals Confidence intervals for the sample mean.
Chapter 9 Introduction to the t Statistic
Sampling Distribution Models
Sampling Distributions and Estimation
Chapter 5 Sampling Distributions
Chapter 22 Inference About Means.
Chapter 23 Inference About Means.
Chapter 8: Estimating with Confidence
Presentation transcript:

Probability & Statistical Inference Lecture 4 MSc in Computing (Data Analytics)

Lecture Outline  Modern statistics uses a number of mathematical results to relate descriptive statistics and probability theory.  These can be divided (roughly) under three headings: - Central Limit theorem (large samples) - Maximum Likelihood Methods (large samples) - Small sample results  Although the mathematical details are quite different in each case – the end results and the reasoning used are almost identical.  We will look in detail at the Central Limit Theorem but without the higher mathematics.  If you can understand the working of the Central Limit Theorem – then you also get the essential understanding of the other methods as well.

Sampling Theory – Statistical Models  Central Limit Theorem (CLT) – A description  How many voters will give F.F. a first preference in the next general election? We have 2 different estimates 1. Researcher A (10 people) => 40% 2. Researcher B (100 people) => 25%  How much 'better' is estimate B than estimate A ?  Real Question: What makes a 'good' estimate ? unbiased low variability i.e. if the survey was repeated should get 'similar' answer

Example  Suppose an engineer wants to estimate the lifetime of a electronic component.  using simple random sampling they select a sample and test. The sample is taken so that the component lifetimes can be considered to be independent of each other.  Gods eye view: mean lifetime, µ= 4,900 hours σ = 3959 hours (you would never know this in practice however) This is the population Note: it is highly skewed and is NOT normal What would happen if we took repeated samples of the same size and calculated their means?

Example Continued  Experiment: take a sample of size 2 from this population and get the mean of the sample  Repeat this 2,000 times  Now have 2,000 means - what would the histogram of all these means look like?  What would happen if you did the same experiment, but with samples of sizes 10, 20 and 30?

Note that the histogram become more Normal as the sample size increases Original Distribution Distribution of the Sample Means varying the sample size

Note the spread decreases with increasing sample size Same result but plotted on same scale

Central Limit Theorem  What has happened?  As the sample sizes increased the shape of the histogram of means => normal  As the sample sizes increased the spread (standard deviation) between the sample means decreased  These histograms are pictures of The Sampling Distribution of the Mean  This phenomenon will happen in ALL cases  The proof of this is called the Central Limit Theorem (CLT)  The CLT involves some fairly non-trivial mathematics

Central Limit Theorem  Since bigger samples are more representative, two means from samples of size=100 are more likely to be closer together than two means from samples of size=10  The larger the sample size is the more the sample means will tend to agree, so the standard deviation of the Sampling Distribution of the Mean will decrease  When the sample size is sufficiently large, the Sampling Distribution of the Mean will be Normally distributed

Central Limit Theorem  If a random sample is taken from a population, where:  Each member of the sample can be considered to be independent of each other  The are all members of the same population  That population has a mean value μ and a standard deviation σ  Then, A sample mean ( ) can be considered a random variable sampled from a probability distribution of possible sample means of the same size called the Sampling Distribution of the Mean.

Definition: Central Limit Theorem continued…  The sampling distribution of the mean has a average value =  (the population mean).  The sampling distribution of the mean has a standard deviation =  Where σ is the population standard deviation, and n is the sample size taken.  This value is called the standard error of the mean.  The Sampling Distribution of the Mean will be a Normal distribution if the sample size is large.

CLT - Summary  When the sample size is sufficiently large, the Sampling Distribution of the Mean will be  normally distributed  with a mean = ,  and a standard deviation (i.e. standard error) =

From the simulation above; For a sample size of 2, the standard error of the mean should be = 3959 / √2 = 2,799 Mean from 2,000 samples Standard Deviation predicted by CLT Actual Standard Deviation Population4, Size = 25,0172,7992,805 Size = 104,8991,2511,232 Size = 204, Size = 304,

Practical use for the CLT continued…  This avoids the necessity of specifying a complete statistical model for all the sampled data.  All we have to do is specify a probability model for the sample mean.  For any sample mean, calculated from a large independent random sample taken from ANY population with a mean μ and standard deviation σ, we know from the CLT, that this sample mean is a random variable from a Normal distribution with a mean = μ and a standard deviation =

Practical use for the CLT continued…  Take a single sample and calculate  This is an estimate of μ – the true (but unknown) population mean.  But, how good is this estimate?  We assume that is not exactly , but  is somewhere near - but how near is it likely to be?

Confidence Intervals Intoduction  We would like to make probability statements as to how close is likely to be to .  If sample size is sufficiently large – then the estimate can be considered as:  a random variable from a Normal distribution,  so probability statements are possible.  This is how we use the CLT in practical data analysis.

 For a Normal distribution, we know that 95% of values will be within 1.96 Standard deviations of   So, given one estimate we can say that this estimate is within 1.96 standard errors of the actual population mean , with 95% confidence 95% in shaded area We can turn this knowledge on its head: given we can be 95% confident that the true mean  is within 1.96 standard errors of it.

Confidence Interval  From this we can specify a range of values within which we are 95% confident that the population mean (  ) lies  This is called a confidence interval  95% Confidence Interval for a population mean (from large enough sample):.  Remarkably, this result holds for samples of size 30 or more. So, a large sample in this context, is a sample of 30 or more.

is So, we would say that the average lifetime of all components (μ) is between 4,456 and 7,290 hours with 95% confidence Example  One sample of size 30 from the electronic components yields a sample mean = 5,873 hours.We know  = 3,959 so a 95% confidence interval would be;

Confidence Intervals  Why is this any good?  Before: one estimate, = 5,873 but no idea of how good or bad it was, i.e. how close to μ is was likely to be.  Now: 95% confident that μ is between 4,456 and 7,290 hours.  So, using CLT ~> Confidence Intervals ~> able to get an estimate with certain level of confidence that can be justified,  i.e. it gives us an objective measure of the actual amount of information contained in our sample about the likely location of μ.

General Confidence Interval for μ ( σ known)  The general formula is:  Where:  is between a value between 0-1, (1-  )×100% is the confidence level you want Z 1-  /2 is a value from the Normal distribution table. Example: for a 95% CI,  = 0.05  (1-  )×100% = 95%  Z 1-  /2 = 1.96

Problem with σ  All of the above assumes that the population standard deviation (i.e.  ) is known.  In practice this is not known (just like  ). => So, we need to estimate  as well as  => we get this estimate from the standard deviation of the sample  Sample Standard Deviation is called ‘s’  => Estimate  by s, When sample size is large

Confidence Levelα/2Z 1-  /2 90%0.05 (5%) %0.025 (2.5%) %0.005 (0.5%) % (0.05%) Z-Values  The value of Z 1-  /2 for other % confidence intervals are given in standard tables.

Confidence Level Z 1-  /2 CI 90% to % to % to % to 9067 Example  Using these we get the following results for the electronic component example:  Note as  gets smaller the CI gets wider  Also, at the same time as n gets bigger the CI narrows – So big samples leads to more precise estimates (i.e. narrower confidence intervals)

What CI’s and sample sizes should I use? You can’t control s – it is inherent in the data (population). You can’t control x-bar either. You can control Z 1-  /2 but in practice scientific convention sets this to reflect 90%, 95% or 99% confidence, with 95% being the accepted default. You can choose n – but resources may limit you. There is a whole topic called sample size determination which you may want to review before collecting data or starting research

Assumptions for hypothesis testing about μ (large sample) and Calculation of CIs  Sample size 30 or greater  Experimental units are independent or each other  Experimental units were randomly sampled  The independence assumption requires that value of the variable for one experimental unit should not tell us anything about the value of another. e.g. in the rats experiment – different and unrelated rats should be used – not 1 rat tested 100 times.  Randomness is required to avoid systematic bias in selection.

Exercise  Complete Exercise 1 & 2

Calculation of CIs for small samples  What about small samples?  In the case of CIs about a mean we can use the Student-t distribution.  The process turns of to be very similar – but the CLT no longer works

History of the Student t test  William Gosset used the publishing pseudonym ‘Student’. He derived the correct sampling distribution for the mean of samples < 30 – and called it the ‘t distribution’.  In his honour, it is often called the ‘Student t’ distribution.  Gosset was a chief brewer for Guinness.  The mathematical details are complicated, but, it turns out that we perform exactly the same calculations as before, with the one change that the t distribution instead of the normal distribution is used.

Assumptions  Student t’s result only referred to a mean where the distribution of the population was normally distributed with some mean μ and finite standard deviation σ.  This is in contrast to the CLT for large samples that required no such assumption about normality.  The t-test also requires the assumption regarding independence in the sample.

Statistical Model for mean from small samples  The experimental units are independently sampled from a population with mean= μ and standard deviation = σ  The population is normally distributed (we don’t need this with large samples)  So, to use the t-test for a small sample, you need to establish that data is sampled from a population that is normally distributed – you could look at the histogram of the sample and see if it is symmetric and bell shaped – or use other methods.

 If Assumptions met: The statistic:  Can be shown to be distributed according to a (student) t-distribution.  The t-distribution has one parameter, called ‘degrees of freedom’ (df). The t - Statistic

The t-Distribution  The t-distribution itself is bell shaped and symmetric – just like the normal distribution but is ‘flatter’.  There are many t distributions – one for each sample size.  The rule used is: for a sample of size n – use the t distribution with degrees of freedom = n − 1 Example: if the sample size is 15, then use a t distribution with degrees of freedom 15 − 1=14.  Note the degrees of freedom often abbreviated to df.

The t probability density function with k degrees of freedom: The t-Distribution

General Confidence Interval for μ (small Samples)  The general formula is:  Where (1-  )  100% is the confidence level you want and t(n-1,  /2) is a value from the t distribution with df=n-1, and with a specified  level.  What is t(n − 1, 1 −  /2)?  A value from the t distribution with n − 1 df such that 100(1 −  )% of values lie within that range around the mean.

 How do you find t(n − 1, 1 −  /2)?  from a table specifically designed to give it to you or use a computer Note: as  gets smaller then CI gets wider as df gets smaller then CI gets wider Confidence Level  /2 t(df=1)t(df=10)t(df=30) 90%0.05 (5%) %0.025 (2.5%) %0.005 (0.5%) % (0.05%)

Example  Internal temperature of autoclaved aerated concrete used in building. An engineer recorded the following data: 23.01, 22.22, 22.04, 22.62,  95% CI for the population mean?

Exercise  Answer Questions 3-6

Confidence Intervals for Proportions (Large Samples)  Proportions (including %) are often a statistic of interest  Think of the proportion of defective items on a production line, the proportion of people who respond favourably to a survey question, to proportion of success versus failures in some experiment  Proportions are also covered by the CLT - remember that a proportion is a different kind of average

Confidence Intervals for Proportions (Large Samples)  Take a sample of size n of electronic components coming off a production line, a test each one for defects. The statistic of interest is the proportion of defectives produced by the production process.  The estimated proportion from the sample is,  where (p-hat) is the symbol used for the estimated proportion from the sample

Confidence Intervals for Proportions (Large Samples)  If the sample size is sufficiently large and we repeat the experiment a large number of times, then:  The sampling distribution of the proportion will be normally distributed by the CLT  The mean of this distribution will be p - i.e. the 'true' population proportion  The standard deviation of the sampling distribution of the proportion, called the standard error of the proportion is estimated by

Example:  A pharmaceutical company produces 400,000 capsules per day of a particular drug. They test 200 of the capsules for defects (too much/little active compound). If the population p = 0.05, and they take 10,000 repeated samples this is the histogram they would get

Sample Size  How big does the sample have to be for the CLT to work with proportions?  The rule is different than the rule for means. Do the following test.  A rule of thumb: the sample size is big enough if 1. np > 5 and 2. n(1-p) > 5

General Confidence Interval Formula for a Population Proportion (large Sample)  where  = the confidence level and Z 1-  /2 = a value from the standard normal distribution such that 100(1-  )% of values of a standard normal distribution lie within that range around the mean  So the Z 1-  /2 values used for a population proportion are the same as those used for a population mean

Example  How many voters will give F.F. a first preference in the next general election ? There are 2 different estimates  Researcher A (10 people) => 40%  Researcher B (100 people) => 25%  How much 'better' is estimate B than estimate A ?  Step one: Can we use the formula for large numbers 1. Researcher A: np = 10 * 0.4 = 4 => 4 is not greater than 5 therefore you cannot used the large number method 2. Researcher B: np = 100 * 0.25 = 25 n(1-p) = 100 * (1-0.25) = 75 both figures are greater than 5 therefore you can used the large number method

Example Continued  Researcher B - 95% Confidence Interval  So, the 95% CI is 17% to 33%.

Example Continued  NB: If fact we can get a 95% CI for researcher A's findings using small sample theory (exact CI) - this is available in SAS and other software:  Exact CI’s are often based on direct use of probability models.  The method is based directly on calculations for the binomial distribution (see lecture 3)  What do we have to do?  Using the CLT, we found, that the 95% CI was composed of the set of values for the mean, such that an hypothesis test would not reject the null hypotheses for any of those values in the set using the α = 0.05 level.

 Using SAS we can calculate a 95% CI for Researcher A:  CI 95% for Researcher A = 12% to 74%  which is too wide to be informative anyway!  If we use the same technique for researcher B we get:  CI95 for Researcher B = 17% to 35%  Which is virtually the same as before using the CLT.

Exact CI and tests for population proportions  These work for small samples as well as large samples  With large sample will give essentially the same results as CLT  Must be used for small samples, however  Based on the binomial probability distribution.

Difference between Exact and CLT based methods  When sample sizes are ‘large’ they will give the same results – but exact tests can be very hard to compute even with modern PCs  When sample sizes are small exact methods must be used  The CIs from small samples tend to be very wide – there is no short cut from collecting as much high quality data as you can manage.

Exercise  Answer Question 7-9