Medical Statistics Medical Statistics Tao Yuchun Tao Yuchun 9
Statistical Analysis of Enumeration Data Statistical Analysis of Enumeration Data 2. Statistical Inference for enumeration data
Sampling error of frequency Example Suppose the death rate is 0.2, if the rats are fed with a kind of poison. What will happen when we do the experiment on n=1, 2, 3 or 4 rat(s)?
5 In general In general, Supposed the population proportion is , sample size =n. The frequency is a random variable. When is unknown and n is big enough, is approximately equal to
Example Example 9-1 HBV Surface antigen. 200 people were tested, 7 positive.
In theory If the sample size n is big enough, and observed frequency is p, then we have approximately
Confidence Interval of Probability If the sample size n is big enough, and observed frequency is p, then 95% Confidence interval: 99% Confidence interval:
Example Example 9-2 HBV Surface antigen. 200 people were tested, 7 positive. Calculate confidence interval for the π.
μ Distinguish between μ and for sampling error and confidence interval
The hypothesis testing of proportion (Z test) (1) Comparison of sample proportion and population proportion ( One- sample Z test) Example Example 9-3 Cerebral infarction Cases Cure rate New Method 98 50% Routine 30%. 50% is sample proportion, p=50%. 30% is population proportion, π 0 =30%.
Hypotheses and α : α= 0.05 Statistic Z : Decision rule : If |Z| ≥ Z α, then reject H 0 ; Otherwise, no reason to reject H 0 (accept H 0 ).
Z α is : Two sides: One side: Since |Z|=4.32 > Z 0.05 =1.96, reject H 0. New method is better than routine. (2) Comparison of two sample proportions ( Two-samples Z test) Example Example 9-4 Carrier rate of Hepatitis in B City: 522 people were tested, 24 carriers, p 1 = 4.06% (population carrier rate: 1 ); in Countryside: 478 people were tested, 33 carriers, p 2 = 6.90% (population carrier rate: 2 ).
α= 0.05
here p c is pooled estimation of two sample proportions, S p1-p2 is standard error of p 1 -p 2. Statistic Z : Decision rule : If |Z| ≥ Z α, then reject H 0 ; Otherwise, no reason to reject H 0 (accept H 0 ). Since |Z|=1.565 < Z 0.05 =1.96, not reject H 0. B City is same as Countryside for population carrier rate ( 1 = 2 ).
Summary The parameter estimation and hypothesis testing of proportion are based on the normal approximation (when sample size is big enough). How big is enough? By experience, n > 5 and n(1- ) >5. np > 5 and n(1-p) >5 For sample: np > 5 and n(1-p) >5. If the sample size is not big, Z test can’t be used and there is no t-test for proportion. (see more detailed text book)
9.4 Chi-square test The Z test can only be used for comparing with a given 0 (one sample) or comparing 1 with 2 (two samples). If we need to compare more than two samples, Chi-square test is widely used.
(1) Basic idea of χ 2 test Given a set of actual frequency distribution A 1, A 2, A 3 … to test whether the data follow certain theory. If the theory is true, then we will have a set of theoretical frequency distribution: T 1, T 2, T 3 … Comparing A 1, A 2, A 3 … and T 1, T 2, T 3 …, If they are quite different, then the theory might not be true; Otherwise, the theory is acceptable.
(2) Chi-square test for 2×2 table Example Example 9-5 Acute lower respiratory infection TreatmentEffectNon-effectTotalEffect rate Drug A68(64.82) a6(9.18) b74 (a+b)91.89 % Drug B52(55.18) c11(7.82) d63(c+d)82.54 % Total120 (a+c)17 (b+d) % H: 1 = 2 H 0 : 1 = 2 H: 1 ≠ 2 H 1 : 1 ≠ 2 =0.05 α=0.05 1 2 here 1 is population effect rate for drug A, 2 is population effect rate for drug B.
To calculate the theoretical frequencies; If H 1 = 2 120/137 If H 0 is true, 1 = 2 120/137 T 11 =74 120/137 =64.82, T 21 =63 120/137=55.18 T 11 =74 120/137 =64.82, T 21 =63 120/137=55.18 T 12 =74 17/137 =9.18, T 22 =63 17/137=7.82 T 12 =74 17/137 =9.18, T 22 =63 17/137=7.82 To compare A and T by a statistic 2 ;
Chi-square test was invented Karl Pearson by Karl Pearson. Chi-square test is also called Pearson’s chi-square test. Karl Pearson chi-square distribution If H 0 is true, 2 follows a chi-square distribution. = (row-1)(column-1) If the 2 value is big enough, we doubt about H 0, then reject H 0 !
ExampleFor Example 9-5 : = (row-1)(column-1)=(2-1)(2-1)=1, 2 α(ν) = (1) =3.84, Now, 2 =2.734<3.84, then P > 0.05, H 0 is not rejected. We have no reason to say the effects of two treatments are different. Question: What is ?Question: What is 2 α(ν) ? Why, then ? Why ?
χ2χ2 ν=3ν=3 ν=5ν=5 ν = 10 ν = 30 Chi-square distribution is a distribution for continuous variable. Chi-square distribution has a parameter-- (degree of freedom), it determines shape of 2 curve. The area under 2 curve is distribution of 2 probability. The 2 curves for different
The Table for 2 distribution. 2 critical value denotes 2 α(ν), α is probability, ν is degree of freedom. The area under the 2 curve means [ for (1) ]:
2 table For 2 2 table, there is a specific formula of chi- square calculation: ExampleFor Example 9-5 :
Chi-square test required large sample. Pearson’s chi-square test statistic follows chi-square distribution approximately. (1)andevery (1) If n≥40, and every T i ≥ 5, 2 test is applicable; (2)or (2) If n < 40 or T i < 1, 2 test is not applicable, you Fisher’s Exact Test should use Fisher’s Exact Test; (3)andonly one (3) If n≥40, and only one 1≤T i < 5, 2 test needs adjustment. 2 2 tableFor 2 2 table :
2 table The correction formula of 2 test for 2 2 table :
Example Example 9-6 Hematosepsis TreatmentEffectiveNo effectTotalEffective rate (%) Drug A28 (26.09)2 (3.91) Drug B12 (13.91)4 (2.09) Total Here n=46>40, but T 12 =30 6/46=3.91< 5; T 22 =16 6/46=2.09< 5. You should use the correction formula of 2 test 2 2 table for 2 2 table :
(3) Chi-square test for R×C table Example Example 9-7 Leukaemia H: H 0 : The distributions of blood types in two populations are all same H: H 1 : The distributions are not all same
R×C table : The formula of 2 test statistic for R×C table : ExampleFor Example 9-7 : ν=(R - 1)(C - 1)=(2-1)(4-1)=3, Checked χ (3) =7.81, now χ 2 =1.84 < 7.81, then P > 0.05, H 0 is not rejected. The distributions of blood types in two populations are same.
Question: Why, thenQuestion: Why 2 =1.84 < (3) =7.81, then ? P > 0.05 ? The answer is in this figure !The answer is in this figure !
(4) Caution for Chi-square test (1)2 2 tableR C table contingency table2 2 table R C table (1) Either 2 2 table or R C table are all called contingency table. 2 2 table is a special case of R C table. (2) (2) When R >2, “H 0 is rejected”only means there is difference among some groups. Does not necessary mean that all the groups are different. (3) (3) The 2 test requires large sample : By experience, The theoretical frequencies should be greater than 5 in more than 4/5 cells The theoretical frequencies should be greater than 5 in more than 4/5 cells ;
The theoretical frequency in any cell should be greater than 1 The theoretical frequency in any cell should be greater than 1. Otherwise, we can not use chi-square test directly. If the above requirements are violated, what should we do? If the above requirements are violated, what should we do? (1) Increase the sample size. (2) Re-organize the categories, Pool some categories, or Cancel some categories. categories, or Cancel some categories.
C You should know You should know: Chi-square test Chi-square test is a very important method of Statistical inference for enumeration data !