Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sample-Based Epidemiology Concepts Infant Mortality in the USA (1991) Infant Mortality in the USA (1991) UnmarriedMarriedTotal Deaths16,71218,78435,496.

Similar presentations


Presentation on theme: "Sample-Based Epidemiology Concepts Infant Mortality in the USA (1991) Infant Mortality in the USA (1991) UnmarriedMarriedTotal Deaths16,71218,78435,496."— Presentation transcript:

1 Sample-Based Epidemiology Concepts Infant Mortality in the USA (1991) Infant Mortality in the USA (1991) UnmarriedMarriedTotal Deaths16,71218,78435,496 Alive1,197,1422,878,4214,075,563 Total1,213,8542,897,2054,111,059 We rarely have the luxury of having the entire population at our disposal so we usually take a small (or large, if you have the money and time and even larger if you also have lots of post-docs to collate data) random sample from our selected population and estimate the population incidence (probabilities) based on the sample. This means that we will have errors in estimation; with big errors if we use small numbers of people in our samples and smaller errors if we use bigger numbers of people in our samples. Because of the error in estimating the population parameter, we have to calculate confidence limits for our estimate; our sample predicts a parameter but the parameter could be smaller or larger than the predicted value – so we need to know the range of possible values for the predicted parameter. To see how this works we have to delve into the incredibly cool Universe of Statistical Analysis. Universe of Statistical Analysis.

2 The terms confidence limits and estimate of population parameters are highly relevant to research in the health sciences because they are statistical concepts. Statistics and statistical analysis is nothing more than calculating measures of probability, association, central tendency and variance of sample data (statistics) and the probabilities that the calculated statistics relate to the target population (statistical analysis). Of course statistical probabilities are not exactly the same as the actual population probabilities of infant mortality (0.0086) and infant non-mortality (0.9914) for the USA in 1991; two separate population parameters. A parameter is any measure from a population while a statistic is any measure from a sample.

3 If we test entire populations then we do not need statistical analysis. For example: If another population (lets say, another country) was measured in its’ entirety and the other country’s infant mortality and infant non-mortality were calculated as 0.0085 and 0.9915, respectively [compared to infant mortality (0.0086) and infant non-mortality (0.9914) for the USA in 1991] we could conclude with absolute certainty (100% confidence) that the two populations were completely different with regard to these two parameters because we would be absolutely certain that the calculated numbers are exactly descriptive of the respective populations (even though there is just a tiny difference between the two populations). Different numbers means different! However, because samples are not necessarily exactly representative of the population from which they came, differing numbers from two (or more) different samples do not necessarily guarantee that the samples came from two (or more) different populations.

4 As previously mentioned, we simply NEVER (well, not very often anyway) have the luxury of being able to measure the entire population so we have to suffer with a (usually) small sample that was selected from the population. We then measure whatever it is we are interested in; lets say: “Infant Mortality” or “Height”, and then assume that our sample represents our population and that whatever the sample statistic is, that same number is an estimate of the parameter of the population from which the sample was selected. Because such an assumption may not be absolutely true; ie. the sample doesn’t perfectly represent the population, we need to have some idea of where the actual population parameter might be … To do this, we simply perform a particular type of statistical analysis to estimate a range of possible values that would include the population parameter... we use the sample data to do so: the sample statistic is used to estimate the exact middle of the range and the variability of the numbers in the sample is used to estimate the highest and lowest value of the range … To understand how these statistical calculations are made we need to start with a frequency distribution of the data:

5 Once we have a frequency distribution of the data then the mathematical properties of the frequency distribution can be used to estimate the range of values that the population parameter might exist – within certain confidence limits or confidence intervals... The predicted range of values within which the population parameter might exist is calculated on the basis of Confidence Intervals and these are defined by percentages: 95% confidence interval, 90% confidence interval, 99% confidence interval... These percentages relate to statistical probabilities... 95% CI: There is a probability of 0.95 that the population parameter exists within the calculated range of values – or a probability of 0.05 that it does not... 90% CI: There is a probability of 0.90 that the population parameter exists within the calculated range of values – or a probability of 0.10 that it does not... 99% CI: There is a probability of 0.99 that the population parameter exists within the calculated range of values – or a probability of 0.01 that it does not...

6

7

8

9

10 An extremely accurate, but rather cumbersome way to describe data; especially if there were hundreds or thousands of people in the population.....

11 A little less accurate of a description but a whole lot easier to describe because only the shape of the line is being described; not each of the individual data points. Note that the shape of the line still accurately describes how the data is distributed on the number line, we just need a more accurate way to describe the line …

12 And there even is a way to calculate those two parts of the curve. (If you look at the right and left halves of the curve separately, you may recognize them as sigmoid curves.)

13 The measure of central tendency most often used to describe the peak of the data curve is called mu (µ - population parameter) or mean ( x – sample statistic) and the measure of variability most often used to describe the dispersion of the data along the number line is called the standard deviation (σ – parameter; sd - statistic); which is equal to the square root of the variance (σ 2 ) or (V). µ = ∑ x / n (commonly called the average – add up all the scores and divide by the total number of scores) ∑ (x - µ) 2 σ 2 =————— (subtract the mean from each score, square each result, add n up all the squares, and then divide by n; then take the square root to get σ) The µ corresponds to the exact point on the number line where the central peak of the frequency distribution curve sits and the σ corresponds to the exact point on the number line where the data starts to spread out faster away from the mid-point.

14

15 An advantage of describing your population in terms of how the data is distributed on a number line using µ and σ is that any population can be represented by this exact same kind of a curved line; a line often called a normal curve. An important property of these curves is that they are very easy to describe in terms of mathematical probabilities. For example, we know that 50% of all the body weights (data points) in the population are greater than the center point (µ = 5’ 6.25”) which means there is a 0.50 probability that a randomly selected individual is taller than 5’ 6.75”. We also know that 68.26% of all the data points are between the 2 σ limits (4’ 1.75” to 6’ 10.75”) which means there is a 0.6826 probability that a randomly selected individual will be between 4’ 1.75” tall and 6’ 10.75” tall.

16 This graph simply illustrates more “percentages of the data distributed along the number line” in different sections of the curve; based on how far along the number line you go in σ units. Again, using percent as probabilities, there is a 0.3413 probability that a randomly selected individual would be between the mean and one standard deviation above the mean, or to put it a different way, we would be 34.13% confident that a randomly selected individual would be somewhere between the mean and +1 sd, or 2.28% confident that a randomly selected individual would be +2sd above the mean... Note that the z-score number corresponds to the sd unit.

17 Now... from this curve you notice that standard deviation units and z-score units are the same thing. In between the +1 and -1 units are found 68.26% of all the scores in the frequency distribution. In between the +2 and -2 units are found 95.54% of all the scores in the frequency distribution To make things easier, tables of z-scores and the % of scores in between the z-score limits are available in most statistics textbooks... A few of those values are reproduced here: Z-Score %Z-Score% 1.0068.262.598.76 1.586.602.5799.00 1.6590.003.099.74 1.9695.00* 3.2799.90 2.0095.543.3+~100 * Traditional level for “statistical significance”

18 Now...to figure out where the confidence limits actually come from in all those epidemiology papers... The “baby” data illustrates this fairly well...

19 UnmarriedMarriedTotal Sample1Births3565100 Sample2Births2971100 Sample3Births3367100 Sample4Births4159100 If we randomly sampled 100 live births from all of the 4,111,059 live births in the USA in 1991 we might find that 35 births were associated with unmarried mothers. This would give a sample probability (statistic) of 35 unwed mothers / 100 live births = 0.35 - an estimate of the population probability (parameter) that a birth is associated with an unmarried mother. The sample probability (statistic) is not the correct probability for the entire population, just the correct probability for the sample. If we took 3 more (different) random samples from the same population, each of 100 live births, we would probably find a different probability that the birth is associated with unwed mothers for each sample that was randomly selected; we might get 29 / 100 = 0.29; 33 / 100 = 0.33; 41 / 100 = 0.41; and so on... and we would never be 100% certain (confident) that any one sample probability would exactly represent the population parameter. We need some way to deal with this uncertainty so we construct confidence limits or a confidence interval.

20 Marital status of samples of new mothers in the USA (1991) Marital status of samples of new mothers in the USA (1991) UnmarriedMarriedTotal Sample1Births3565100 Sample2Births4159100 Sample3Births3367100 Sample4Births2971100 … If we could keep sampling samples (of n = 100) and calculating probabilities forever we would end up with an infinite number of sample probabilities. Sample probabilities close to the true population probability would appear numerous times while those far away would appear less frequently; the most frequently occurring sample probability (from the infinite number of samples) would correspond to the population probability while the least frequent probabilities would correspond to the extreme values (again, from the infinite number of samples). This infinite number of theoretical sample probabilities would obviously fit into some kind of frequency distribution curve that is normally distributed. From this theoretical Normal Distribution we can construct a confidence interval using standard percentile scores (actually the same sd units called z-scores illustrated in previous slides) which will then be related to just how confident we want to be; 95% confident? 90% confident? 99% confident? – just plug in the sample values you are interested in, and appropriate z-score value that corresponds to your chosen %-confidence level into the formula and voila: Confidence Intervals

21 This is another figure of that same normal curve with z-scores and percentages; the actual z-scores that correspond to 95% and 90% of the data have been added … Just imagine that this curve illustrates the distribution of an infinite number of probabilities calculated from the infinite number of samples (n = 100) that were randomly selected from the same population) We already have some idea where the middle of this “population curve” fits on a number line because we have the (ONE) sample estimate of that point; we are just not 100% confident that the sample statistic is exactly the same as the population parameter. What we need to know is the range of possible values that the actual population center-point might be within – so we calculate that range using the above theoretical curve …

22 Marital status of a sample of new mothers in the USA (1991) Marital status of a sample of new mothers in the USA (1991) UnmarriedMarriedTotalProbability Births35651000.35 Confidence Interval - 95% (use z-score of 1.96) 0.35 x 0.65 0.35 x 0.65 0.35 ± ( 1.96 √ —————— ) = 0.35 ± (1.96 √0.002275) 100=0.35 (0.257, 0.443) 100=0.35 (0.257, 0.443) Confidence Interval - 90% (use z-score of 1.644) =0.35 ± (1.644 √0.002275) = 0.35 (0.272, 0.428) *True population probability =0.295 (1,213,854 / 4,111,059) The confidence interval is simply the range of values in a frequency distribution of values from all possible samples of the same size between which you might expect to find the true population value (parameter), ie. The sample statistic predicts that the parameter is 0.35 but it is 90% probable the true parameter is somewhere between 0.272 and 0.428; and 95% probable the parameter is between 0.257 & 0.443.

23 These two graphs illustrate the previous calculations as well as the effect of sample size on the “accuracy” of using the sample statistics to predict the population variance. From the previous formula, the z- score values (1.96 or 1.644) describe the confidence limits between which we will look for our predicted population “value” The term √ (0.35 x 0.65) / 100 is a calculation of the sample variance – note that the sample n is part of the equation. The larger the n, the narrower the variance (n=1000 =.285 -.305 vs. n=100 =.3 -.4) in predicting the population variance.

24 With smaller sample sizes, or with highly variable data, or with p ~ 0 or 1, it is problematic to accurately predict population variance using the sample variance, so this next formula is actually used a lot more: (2 x 100 x 0.35) + 1.96 2 ± 1.96 √1.96 2 + (4 x 100 x 0.35 x 0.65) (2 x 100 x 0.35) + 1.96 2 ± 1.96 √1.96 2 + (4 x 100 x 0.35 x 0.65) —————————————————————————————— —————————————————————————————— 2 ( 100 + 1.96 2 ) =35 ± (0.264, 0.447) =35 ± (0.264, 0.447) [ previous calculation =35 ± (0.257, 0.43) ] True population probability =0.295 *You will notice that all epidemiology publications will give the confidence intervals associated with each variable measured. **and since computers do all the work nowadays and they can calculate exact intervals based on the sampling distribution of P, based on the binomial distribution, we don’t have to bother with knowing any of these formulas, just have an idea about what the formulas are actually calculating …


Download ppt "Sample-Based Epidemiology Concepts Infant Mortality in the USA (1991) Infant Mortality in the USA (1991) UnmarriedMarriedTotal Deaths16,71218,78435,496."

Similar presentations


Ads by Google