Categorical data 1 Single proportion and comparison of 2 proportions دکتر سید ابراهیم جباری فر( (Dr. jabarifar تاریخ : 1388 / 2010 دانشیار دانشگاه علوم پزشکی اصفهان بخش دندانپزشکی جامعه نگر
The objectives of the session Sampling distribution of simple proportion Calculation of 95% confidence interval for a proportion The comparison of two proportions (or percentages) Statistical test of significance for comparison of two proportions Calculation of 95% Confidence interval for the difference in two proportions.
Categorical data What is categorical data? Examples?
Examples of categorical data Education primary, secondary, university Marital status: married, single,divorced, widowed Cigarette smoking history: never smoker, ex-smoker, current smoker
More examples of categorical data Endpoint in a study Person is dead or alive Person with MI or without MI Person can rate their own health as very good, good, average, bad or very bad
More examples of categorical data Quantitative measurements or assessments can be used as categorical data: Hypertension: Yes (for example systolic BP≥ 160 or diastolic BP ≥ 90 mm Hg) or no Alcohol consumption : none, light(<200 ml of ethanol/ week, heavy ≥ 200 ml of ethanol/week)
Proportions and percentages In this session, we will concentrate on the use of binomial data( = data with just two categories) Example: in a survey interviews were conducted with 5335 middle- aged women. Of these, 1476 were current smokers while 3859 were not. Proportion of smokers= =0.277 Percentage of smokers= 0.277×100=27.7%
Sampling variability of a proportion It is important to take into account the number of subjects included The greater the number of subjects the more reliable our estimates are Example: if we want to estimate proportion of men in a population who smoke cigarettes study of 1000 men will be more trustworthy than study of 10 men
Important assumption We need to know that the sample of individuals studied has been randomly selected from some population of interest
Sampling distribution of single proportion Let ’ s continue with the example of middle aged women. Among 5335 women, there were 1476 smokers If we want to say something about the population which this study sample represents, we need the concept of a sampling distribution.
Let ’ s assume that we repeatedly took a sample of 5335 women and clculated the proportion of smokers For each sample, we calculate the proportion of smokers and then construct a histogram of these values This histogram represents the sampling distribution of the proportion and will take the following shape.
The curve is centred over value of the proportion of smokers in the population, often referred to as the true proportion and represented by µ Some of the sample proportions will be larger than µ, others will be smaller. Many will be close in value to µ a few will be a lot larger or a lot smaller In practice we only conduct one survey, from which we have a sample proportion represented by P. Is P close to µ, or is it very different from µ ?
Only of we are very lucky will P actually be equal to µ. In any random sample, there will be some sampling variation in P. The larger the sample, the smaller the extent of such sampling variation. Consider (P- µ ) 2 as a measure of variation in p from the true proportion µ. Then it can be shown mathematically that if you took lots of random samples each of n subjects then the average value of (P- µ ) 2 is equal to
Variance and standard error of proportion is the vaiance of a proportion is the standard error of a proportion It is a measure of the average extent of error in P= how far we can expect the observed proportion to differ from π on average
Example: π= 0.4: N=100, then SE= N=1000,then SE= (SE smaller) SE does not depend much on π N= 1000 π = 0.5: SE=0.0158
Back to the example:5335 women, 1476 current smokers It means that 27.7% of women are smokers The estimated standard error of the proportion of smokers is We can also use percentages:
95% confidence limits for a proportion We want to get an interval of possible values within which the true population proportion might lie This can be done using the theoretical properties of the Normal distribution It can be shown that P will be within 1.96 standard errors of with probaility 0.95 That is, there is just a 2.5% risk that the observed proportion will exceed the true population proprtion by more than 1.96 standard errors, and another 2.5% risk that p will understimate by more than 1.96 standard errors.
95% Confidence limits for a proportion We use this fact to define a 95% confidence P-1.96× to P × Usually written as P±1.96 ×standard error of P
Back to example The true population percentage of smokers has following 95% confidence interval This means that 95% confidence interval is from 26.5% to 28.9% These two values are the lower and upper confidence limits, respectively.
95% confidence interval 95% confidence intervals= the most common statistical technique for displaying the degree of uncertainty that should be attached to any proportion. There is a 5% risk that the true population proportion lies outside thd interval That is, you can anticipate that one in every 20 confidence intervals you calculate will not include
Two proportions Example TotalWomenMen 879 (34.2%) 313 (23.8%) 566 (45.1%) YesSmoking No Total
Question From the table, we want to evaluate how strong is the evidence that men smoke more than women
The null hypothesis We need to define null hypothesis In our case, the null hypothesis is that smoking is as freqent among women as is among men (same proportion of smokers among men and women) If the null hypothesis were true, then the whole population would have identical percent (%) of smokers. Alternatively, one can say that if the null hypothesis were true for any randomly selected person (man or women ), the probability of being a smokers is the same independent of sex of the person selected.
Significance testing for comparing 2 proportions After defining the null hypothesis, the main question is If the null hypothesis is true, what are the between the two percentages as that observed? For example, in the Czech study, what is the probability of getting a sex difference in smoking as large as (or larger than) 45% versus 24%?
Observed difference in percentages = P 1 -P 2 = 45.1%-23.8%=21.3% The overall percentage response= =34.2%
If the null hypothesis is true, then the only reason that P 1 -P 2 differs from 0 is due to the sampling variation Under the null hypothesis we are assuming that the two samples of size n 1 =1256 and n 2 =1314 are random samples of people with equal true probabilities of response .
We need to calculated the standard error of the difference in two percentages =1.9%
Now, we compare the observed difference with the standard error of the difference, simply by dividing one by the other. Thus, we compute =11.2 Observed difference in percentages Standard Error of difference Z=
How large does Z have to be in orther for us to assert that we have strong evidence that the null hypothesis is untrue? We need to make use of the fact that the difference between two observed proportions has approximately a Normal distribution, since this enables us to convert any value of Z into a probability P (as we have already learnt in previous sessions)
0.5With probability0.674exceedsZ
In our example, Z= 11.2 and so the probability P is (substantially )less than That means, if the proportion of smokers is same among men and women, the chances of getting such a big percentage difference in our study is less than We therefore have storing evidence that the proportin of smokers in men and women in the defined population is different (and is lower in women). We may also say the difference between the percentages is staitstically significant at the 0.1% level.
Exercise: we want to know wheter smoking depends on marital status TotalUnmarriedMarried 879 (34.2)147(34.9%)732 (34.1%)Smoking Yes Smoking no Total
The observed difference in percentages is The standard error of the difference (using the formula given above) is Z= P=
95% confidence interval for a difference in two percentages While giving the actual P-value is useful, we also need to give attention to estimating the magnitude of the difference and express the uncertainty in such an estimate by using a confidence intervals. The 95% confidence interval for the difference between two percentages is Observed difference ±1.96×Standard Error of difference
In the calculation of the confidence interval, the formula for the standard error of the difference does not assume the null hypothesis of the two proportions being equal. A slightly different formula is used for the standard error. SE (difference in proportions)=
In our study, for smoking difference between men and women 95% confidence interval is =17.7% to 24.9%
Exercise Calculate 95% confidence interval for difference in percentage of smokers among married and unmarried individuals SE=2.54 CI=0.8±1.96×2.54=-42%, 5.8%
Note that if such a 95% confidence interval for a difference includes the value 0.0 (i.e one limit is positive and the other is negative), then P is greater than 0.05 Conversely, if the 95% confidence interval does not include 0.0 then P is less than 0.05 This illustrates that there is a close link between significance testing and confidence intervals.