Download presentation
Presentation is loading. Please wait.
Published byHugo James Modified over 9 years ago
1
Xuhua Xia Smoking and Lung Cancer This chest radiograph demonstrates a large squamous cell carcinoma of the right upper lobe. This is a larger squamous cell carcinoma in which a portion of the tumor demonstrates central cavitation, probably because the tumor outgrew its blood supply. Squamous cell carcinomas are one of the more common primary malignancies of lung and are most often seen in smokers.
2
Xuhua Xia Smoker Non-smoker Lung Cancer 105 3 No Lung Cancer 99895 99996 Sub-total 100000100000 Smoking and Lung Cancer The number of smokers and non-smokers sampled from the population
3
Xuhua Xia Association between being sick and taking medicine: Taking medicine Not taking medicine Sick 990 111 Healthy 10 889 Sub-total10001000 Sickness and Medication Biological and statistical questions “Taking medicine” is strongly associated with “Sick”. Can we say that “Sick” is caused by “Taking medicine”?
4
Xuhua Xia Simpson’s paradox Treatment ATreatment B Kidney stones78% (273/350)83% (289/350) Small Stones93% (81/87)87% (234/270) Large Stones73% (192/263)69% (55/80) C. R. Charig et al. 1986. Br Med J (Clin Res Ed) 292: 879–882 Treatment A: all open procedures Treatment B: percutaneous nephrolithotomy Question: which treatment is better? Conclusion changed when a new dimension is added.
5
Xuhua Xia What is a Contingency Table? A contingency table: a table of counts cross- classified according to categorical variables. A contingency table has r rows and c columns, and is referred to as an r x c contingency table. The simplest contingency table is a 2 x 2 table. The most typical null hypothesis: The counts found in the rows are independent of the counts found in columns.
6
Goodness-of-fit test Deviation of sex ratio from 1:1 Deviation of 3:1 ratio in offspring from Aa x Aa matings Xuhua Xia
7
Contingency Tables and 2 -Test Chi-Square test is based on 2 distribution. Chi-Square test is typically used in tests for goodness of fit, i.e., how well the observed values fit the expected values Chi-square test and Yates correction for continuity.
8
What is a Contingency Table? Marginal totals (Column totals) Marginal totals (Row totals) Total Cell TreatSuccess YesNo An 11 n 12 R1R1 Bn 21 n 22 R2R2 C1C1 C2C2 n TreatSuccess YesNo A27377350 B28961350 138562700
9
Xuhua Xia Contingency Table The null hypothesis: Success is independent of Treatment (i.e., the success rate is the same for both treatment). The null hypothesis can be tested with the Chi-square test of goodness-of-fit. Expected frequencies (the test should be done on counts, not on proportions). Degree of freedom X 2 value: 0 if the data is perfectly consistent with the null hypothesis. p: the probability of obtaining the observed X 2 value given that the null hypothesis is true, i.e., p(X 2 |H 0 ). TreatSuccess YesNo A27377350 B28961350 138562700
10
Xuhua Xia X 2 -test of a Contingency Table? Do hand-calculation of X 2. What is the df associated with the test? df = (r-1)(c-1) TreatSuccess YesNo A27377350 B28961350 562138700 For 2 -test, E ij should be equal or greater than 5.
11
Formulas for different statistics Statistic for significance tests Measures of association: Phi coefficient Contingency coefficient Cramer's V
12
Xuhua Xia Chi-square Distribution = 2 = 4 = 8 2 distribution is a special case of gamma distribution with = /2 and = 2. It has a mean of and variance 2. In EXCEL, p = chidist(x,DF) = 1-gammadist(x,DF/2,2,true) The p value in chi-square test:
13
R functions Xuhua Xia md<-read.table("Kidneystonelumped.txt",header=T) Treat Success Freq 1 A Yes 273 2 A No 77 3 B Yes 289 4 B No 61 tab1<-xtabs(Freq~.,md) Success Treat No Yes A 77 273 B 61 289 chisq.test(tab1,correct=T) Pearson's Chi-squared test with Yates' continuity correction data: tab1 X-squared = 2.0308, df = 1, p-value = 0.1541 library(MASS) loglm(~Treat+Success,tab1) # Get likelihood ratio chi-square, i.e., G 2 install.packages("vcd") library(vcd) assocstats(tab1) # get Phi, C, and V Phi-Coefficient : 0.057 Contingency Coeff.: 0.057 Cramer's V : 0.057 '.' ==All other variables in the data frame, equivalent in this case to xtabs(Freq~Treat+Success)
14
Assignment: Sex and Hair Color GENDER COLOR | Black | Blond | Brown | Red | Total ---------+--------+--------+--------+--------+ Female | 55 | 64 | 65 | 16 | 200 ---------+--------+--------+--------+--------+ Male | 32 | 16 | 43 | 9 | 100 ---------+--------+--------+--------+--------+ Total 87 80 108 25 300 Assignment: Arrange the data and save to a text file, read in to R, format into an R table use xtabs, and use Pearson chi-square (with Yates' continuity correction) and likelihood ratio chi-square to test the null hypothesis of independence. Compute Phi, contingency coefficient and Cramer's V (Print the slide, fill in the values and conclude) Chi-square:DF:p: Likelihood ratio chi-square: DF:p: Phi: C: V:
15
Xuhua Xia Why are there more blonde females? An evolutionary explanation A genetic explanation A simple chemical explanation The limitation of statistics
16
Xuhua Xia Log-linear model Preferred statistical tool for analyzing multi-way contingency table Use likelihood ratio test to choose the best model Main effects and interactions can be interpreted in a similar manner as ANOVA
17
Loglinear model Xuhua Xia md<-read.table("KidneyStone.txt",header=T) attach(md) tab1<-xtabs(Freq~.,md) fit<-loglin(tab1,list(c(1,2),c(1,3),c(2,3))) 1-pchisq(fit$lrt,fit$df) # if p>α: fine without 3-way interaction. 0.3153416 fit<-loglin(tab1,list(c(1,3),c(2,3))) # reduced model 1-pchisq(fit$lrt,fit$df) # if p>α: fine without 1*2. 0.1781455 fit<-loglin(tab1,list(c(1,2),c(2,3))) # reduced model 1-pchisq(fit$lrt,fit$df) # if p>α: fine without 1*3. 0 fit<-loglin(tab1,list(c(1,2),c(1,3))) # reduced model 1-pchisq(fit$lrt,fit$df) # if p>α: fine without 2*3. 2.041195e-07 full model with list(c(1,2,3)) would fits the data perfectly.
18
ClassSexAgeSurvivedFreq 1stMaleChildNo0 2ndMaleChildNo0 3rdMaleChildNo35 CrewMaleChildNo0 1stFemaleChildNo0 2ndFemaleChildNo0 3rdFemaleChildNo17 CrewFemaleChildNo0 1stMaleAdultNo118 2ndMaleAdultNo154 3rdMaleAdultNo387 CrewMaleAdultNo670 1stFemaleAdultNo4 2ndFemaleAdultNo13 3rdFemaleAdultNo89 CrewFemaleAdultNo3 1stMaleChildYes5 2ndMaleChildYes11 3rdMaleChildYes13 CrewMaleChildYes0 1stFemaleChildYes1 2ndFemaleChildYes13 3rdFemaleChildYes14 CrewFemaleChildYes0 1stMaleAdultYes57 2ndMaleAdultYes14 3rdMaleAdultYes75 CrewMaleAdultYes192 1stFemaleAdultYes140 2ndFemaleAdultYes80 3rdFemaleAdultYes76 CrewFemaleAdultYes20 Survival data of the Titanic
19
Effects Xuhua Xia EFFECTSWHAT THEY MEAN Class Main effects, generally of little interest Sex Age Survived Class × Sex Six 2-way interaction terms, e.g, a higher proportion of males in children than in adults Class × Age Class × Survived Sex × Age Sex × Survived Age × Survived Class × Sex × Age Four 3-way interaction terms Class × Sex × Survived Class × Age × Survived Sex × Age × Survived Class × Sex × Age × Survived Four-way interaction
20
R functions Xuhua Xia md<-read.table("Titanic.txt",header=T) attach(md) head(md) tab1<-xtabs(Freq~.,md) margin.table(tab1) margin.table(tab1, c(2,4)) summary(tab1) # to get a chance to see if we have E ij < 5. Call: xtabs(formula = Freq ~., data = md) Number of cases in table: 2201 Number of factors: 4 Test for independence of all factors: Chisq = 1637.4, df = 25, p-value = 0 Chi-squared approximation may be incorrect library(MASS) loglm(~ Class * Sex * Age * Survived - Class:Sex:Age:Survived, data=tab1) X^2 df P(> X^2) Likelihood Ratio 0.0002728865 3 0.9999988 loglm(~ Class * Sex * Age * Survived - Class:Sex:Age:Survived - +Sex:Age:Survived, data=tab1) X^2 df P(> X^2) Likelihood Ratio 1.685479 4 0.7933536 This line means that we have E ij < 5
21
step function full.model = loglm(~ Class * Sex * Age * Survived, data=tab1) step(full.model, direction="backward") Start: AIC=64 ~Class * Sex * Age * Survived Df AIC - Class:Sex:Age:Survived 3 58 64 Step: AIC=58 ~Class + Sex + Age + Survived + Class:Sex + Class:Age + Sex:Age + Class:Survived + Sex:Survived + Age:Survived + Class:Sex:Age + Class:Sex:Survived + Class:Age:Survived + Sex:Age:Survived Df AIC - Sex:Age:Survived 1 57.685 58.000 - Class:Sex:Age 3 61.783 - Class:Age:Survived 3 89.263 - Class:Sex:Survived 3 117.013 The next step shows no more model reduction, and yields a test: Statistics: X^2 df P(> X^2) Likelihood Ratio 1.685479 4 0.7933536 The only 4-way interaction can be removed, which reduces AIC from 64 to 58 Only one of the four 3-way interactions can be removed, which reduces AIC from 58 to 57.685 (Terms below, if removed, will make things worse.)
22
Get prediction (fitted value) fit<-loglm(~Class + Sex + Age + Survived + Class:Sex + Class:Age + + Sex:Age + Class:Survived + Sex:Survived + Age:Survived + + Class:Sex:Age + Class:Sex:Survived + Class:Age:Survived,tab1) > fitted(fit) Re-fitting to get fitted values,, Age = Adult, Survived = No Sex Class Female Male 1st 4.0000 118.0000 2nd 13.0000 154.0000 3rd 91.4328 384.5672 Crew 3.0000 670.0000,, Age = Child, Survived = No Sex Class Female Male 1st 0.00000 0.00000 2nd 0.00000 0.00000 3rd 14.56719 37.43281 Crew 0.00000 0.00000,, Age = Adult, Survived = Yes Sex Class Female Male 1st 140.00000 57.00000 2nd 79.97709 14.02291 3rd 73.56719 77.43281 Crew 20.00000 192.00000,, Age = Child, Survived = Yes Sex Class Female Male 1st 1.00000 5.00000 2nd 13.01507 10.98493 3rd 16.43282 10.56718 Crew 0.00000 0.00000
23
Use loglin Xuhua Xia fit<-loglin(tab1,list(c(1,2,3),c(1,2,4),c(1,3,4),c(2,3,4))) 1-pchisq(fit$lrt,fit$df) # if p>α: fine without 4-way interaction. 0.9999988 fit<-loglin(tab1,list(c(1,2,4),c(1,3,4),c(2,3,4))) 1-pchisq(fit$lrt,fit$df) # if p>α: fine without 1*2*3. fit<-loglin(tab1,list(c(1,2,3), c(1,3,4),c(2,3,4))) 1-pchisq(fit$lrt,fit$df) # if p>α: fine without 1*2*4. fit<-loglin(tab1,list(c(1,2,3),c(1,2,4), c(2,3,4))) 1-pchisq(fit$lrt,fit$df) # if p>α: fine without 1*3*4. fit<-loglin(tab1,list(c(1,2,3),c(1,2,4),c(1,3,4))) 1-pchisq(fit$lrt,fit$df) # if p>α: fine without 2*3*4.
24
Gender differences Xuhua Xia Slide 24
25
Logistic regression: Gender difference md <- read.table("MANOVAex1.txt",header=T) attach(md) head(md) # 1 st level, i.e., "Male", is set to 1, the other to 0. mylogit <- glm(Gender ~ Height + Weight, family = binomial) summary(mylogit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.1103 6.2455 0.978 0.3279 Height -0.6734 0.2686 -2.507 0.0122 * Weight 0.5701 0.2356 2.420 0.0155 * nd1<-with(md,data.frame(Height=mean(Height),Weight=Weight)) nd1$rankP<-predict(mylogit,nd1,type="response") # output p(Y=1) nd1$rankLogOdds<-predict(mylogit,nd1) #output ln(p/(1-p)) nd1 Height Weight rankP rankLogOdds 1 70.32 70 0.2079526634 -1.3373107 2 70.32 68 0.0774522032 -2.4774782 3 70.32 73 0.5921693274 0.3729405 4 70.32 79 0.9779779526 3.7934430 5…… How response variable would change with Weight with Height set to a constant (= mean)
26
Logistic regression: admission md <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv") head(md) # admit, 1: admit, 0: reject md$rank<-as.factor(md$rank) mylogit <- glm(admit ~ gre+gpa+ rank, data = md, family = binomial) summary(mylogit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.989979 1.139951 -3.500 0.000465 *** gre 0.002264 0.001094 2.070 0.038465 * gpa 0.804038 0.331819 2.423 0.015388 * rank2 -0.675443 0.316490 -2.134 0.032829 * rank3 -1.340204 0.345306 -3.881 0.000104 *** rank4 -1.551464 0.417832 -3.713 0.000205 *** nd1<-with(md,data.frame(gre=mean(gre),gpa=mean(gpa),rank= factor(1:4))) nd1$rankP<-predict(mylogit,nd1,type="response") # output p(Y=1) nd1$rankLogOdds<-predict(mylogit,nd1) #output ln(p/(1-p)) nd1 gre gpa rank rankP rankLogOdds 1 587.7 3.3899 1 0.5166016 0.06643085 2 587.7 3.3899 2 0.3522846 -0.60901208 3 587.7 3.3899 3 0.2186120 -1.27377307 4 587.7 3.3899 4 0.1846684 -1.48503283
27
Logistic regression: Survival md<-read.table("Titanic.txt",header=T) attach(md) head(md) fullModel <-glm(Survived~Class*Sex*Age,family=binomial,weights=Freq) step(fullModel,direction="backward") # will find the best model fit3<-glm(Survived ~ Class + Sex + Age + Class:Sex + Class:Age, family = binomial, weights = Freq) # fit the best model found summary(fit3) Coefficients: (1 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) 3.55535 0.50709 7.011 2.36e-12 *** Class2nd -1.73827 0.58870 -2.953 0.00315 ** Class3rd -3.77275 0.52878 -7.135 9.69e-13 *** ClassCrew -1.65823 0.80030 -2.072 0.03826 * SexMale -4.28298 0.53213 -8.049 8.36e-16 *** AgeChild 15.28493 392.50617 0.039 0.96894 Class2nd:SexMale 0.06801 0.67120 0.101 0.91929 Class3rd:SexMale 2.89768 0.56364 5.141 2.73e-07 *** ClassCrew:SexMale 1.13608 0.82048 1.385 0.16616 Class2nd:AgeChild 2.19967 520.81278 0.004 0.99663 Class3rd:AgeChild -14.94702 392.50626 -0.038 0.96962
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.