Count Data
HT
Cleopatra VII & Marcus Antony C c Aa
EVENODD 1 st 12 2 nd 12 3 rd 12 EVENODD
Gregor Mendel,
RY Ry rY ry Total Obs Expect( ) Which statement is right or ?
RY Ry rY ry Total Obs Expect O-E
Total Obs Expect O-E /9 25/3 2525*15/9
∞ ∞ ∞ ∞ ∞
RY Ry rY ry Total Obs Expect( ) > x <- c(950,250,350,50) > p <- c(9,3,3,1)/16 > chisq.test(x, p=p) Chi-squared test for given probabilities data: x X-squared = , df = 3, p-value = 1.214e-09
YyTotal R r Total Yy R r 1
Yy R r 1 Yy R r 1 Chi-square test for Independence test
RY Ry rY ry Total Obs Expect( ) Yy R r 1 Yy R1200 r
RY Ry rY ry Total Obs Expect( ) > mx<- matrix(c(950,250,350,50),2,) > chisq.test(mx,correct=F) Pearson's Chi-squared test data: mx X-squared = , df = 1, p-value = > mx [,1] [,2] [1,] [2,]
Yy R r 1
y1…ymTot r1 … rk Tot1
Total Obs Expec ( ) > x <- c(8,12,7,14,9,10) > p <- rep(1,6)/6 > chisq.test(x,p=p) Chi-squared test for given probabilities data: x X-squared = 3.4, df = 5, p-value =
HTTotal Obs Expec( ) > chisq.test(c(60,40),p=c(1,1)/2) Chi-squared test for given probabilities data: c(60, 40) X-squared = 4, df = 1, p-value =
|| ? : : > head2 <- c( 560, 640) > toss2 <- c( 1000, 1000) > prop.test(head2, toss2) 2-sample test for equality of proportions …. data: head2 out of toss2 X-squared = , df = 1, p-value = alternative hypothesis: two.sided 95 percent confidence interval: sample estimates: prop 1 prop CaesarTolemy Head Tail > chisq.test(mx,cor=F) Pearson's Chi-squared test data: mx X-squared = , df = 1, p-value = > chisq.test(mx) Pearson's Chi-squared test with Yates‘ continuity correction data: mx X-squared = , df = 1, p-value = > mx <- matrix(c(560,440,640,360),2,) > mx [,1] [,2] [1,] [2,] Chi-square test for Homogeneity of distributions
> > # H0 : all four coins have the same proportion showing head side > # H1 : at least one coin have different proportion to the others > > head4 <- c( 83, 90, 129, 70 ) > toss4 <- c( 86, 93, 136, 82 ) > prop.test(head4, toss4) 4-sample test for equality of proportions without continuity correction data: head4 out of toss4 X-squared = , df = 3, p-value = alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop Coin 1Coin 2Coin 3Coin 4 Head Alive Tail33712Dead Total Total Hospital 1Hospital 2Hospital 3Hospital 4 > mx <- matrix(c(83,3,90,3,129,7,70,12),2,) > chisq.test(mx) Pearson's Chi-squared test data: mx X-squared = , df = 3, p-value =
DWWD CC CR RC RR Australia rare plants data Common (C ) & Rare (R ) in ( South Australia, Victoria) and (Tasmania ) The number of plants: in Dry (D ), Wet (W ) and Wet or Dry (WD ) regions. Question (null hypothesis): Is the distribution of plants for (D,W,WD) are equal for all CC, CR, RC and RR?
Australia rare plants data > rareplants<-matrix(c(37,23,10,15,190,59,141,58,94,23,28,16),4,) > dimnames(rareplants)<-list(c("CC","CR","RC","RR"),c("D","W","WD")) > rareplants > (sout<- chisq.test(rareplants) ) Pearson's Chi-squared test data: rareplants X-squared = , df = 6, p-value = 4.336e-06 > round( sout$expected,1 ) D W WD CC CR RC RR > round( sout$resid,3 ) D W WD CC CR RC RR
The lady tasting tea
Fisher’s exact test for 2 X 2 tables with small n (n<25) > chisq.test(matrix(c(7,2,1,5),2,)) Pearson's Chi-squared test with Yates' continuity correction X-squared = , df = 1, p-value = Warning message: 카이 자승 근사는 부정확할지도 모릅니다 > fisher.test(matrix(c(7,2,1,5),2,)) Fisher's Exact Test for Count Data p-value = alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: sample estimates: odds ratio > fisher.test(matrix(c(7,2,1,5),2,),alter="greater") Fisher's Exact Test for Count Data p-value = alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio Guess\MakingMilk 1 st Tea 1 st Sum Milk 1 st 718 Tea 1 st 257 sum9615
There are 7 possible tables for given marginal counts. G\M M 1 st T 1 st Sum M 1 st 808 T 1 st 167 sum9615 G\M M 1 st T 1 st Sum M 1 st 718 T 1 st 257 sum9615 G\M M 1 st T 1 st Sum M 1 st 628 T 1 st 347 sum9615 G\M M 1 st T 1 st Sum M 1 st 538 T 1 st 437 sum9615 G\M M 1 st T 1 st Sum M 1 st 448 T 1 st 527 sum9615 G\M M 1 st T 1 st Sum M 1 st 358 T 1 st 617 sum9615 G\M M 1 st T 1 st Sum M 1 st 268 T 1 st 707 sum9615 What is the probability that each table will show at the experiment ?
G\M M 1 st T 1 st Sum M 1 st aba+b T 1 st cdc+d suma+cb+dn G\M M 1 st T 1 st Sum M 1 st rqv T 1 st 1-r1-q1-v sum111 means no discernible ability. Odds ratio : with some correction
G\M M 1 st T 1 st Sum M 1 st 808 T 1 st 167 sum9615 G\M M 1 st T 1 st Sum M 1 st 718 T 1 st 257 sum9615 G\M M 1 st T 1 st Sum M 1 st 628 T 1 st 347 sum9615 G\M M 1 st T 1 st Sum M 1 st 538 T 1 st 437 sum9615 G\M M 1 st T 1 st Sum M 1 st 448 T 1 st 527 sum9615 G\M M 1 st T 1 st Sum M 1 st 358 T 1 st 617 sum9615 G\M M 1 st T 1 st Sum M 1 st 268 T 1 st 707 sum When = (See, p-value of the fisher exact test; two-sided test) = (one-sided test)
G\M M 1 st T 1 st Sum M 1 st 909 T 1 st 066 sum9615 G\M M 1 st T 1 st Sum M 1 st 448 T 1 st 527 sum % correct answers Some are misclassified Fisher exact test considers only the cases with the same fixed margins. The probabilities of tables with different margins are completely ignored. This is referred to data-respecting (?) inference, from time to time.
Use Fisher’s exact test only for small n ( less than 25). > Pearson's Chi-squared test X-squared = , df = 1, p-value = > chisq.test(matrix(c(14,4,2,10),2,)) Pearson's Chi-squared test with Yates' continuity correction X-squared = , df = 1, p-value = > fisher.test(matrix(c(14,4,2,10),2,)) Fisher's Exact Test for Count Data p-value = alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: sample estimates: odds ratio Guess\MakingMilk 1 st Tea 1 st Sum Milk 1 st Tea 1 st sum No big difference when n is large !
Yates’ continuity correction G\M M 1 st T 1 st Sum M 1 st aba+b T 1 st cdc+d suma+cb+dn
Odds ratio :
Regression Generalized Linear Model (GLM) ANOVA Linear Model (LM)
- Regression, - ANOVA Generalized Linear Model (GLM) Poisson Regression Binomial Regression ( Logistic Regression )
Guess\MakingMilk 1 st Tea 1 st Sum Milk 1 st 718 Tea 1 st 257 sum9615 are observed! Logistic regression
> tm<-data.frame(gm=c(7,1),gt=c(2,5), making=c("M","T")) > summary( glm(cbind(gm,gt)~making,family=binomial, data=tm) ) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) makingT * (Dispersion parameter for binomial family taken to be 1) Null deviance: e+00 on 1 degrees of freedom Residual deviance: e-16 on 0 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 Logistic regression with the lady tasting tea data
ABCDEF
> sx<-rep(LETTERS[1:6],e=12) > dx<-c(10,7,20,14,14,12,10,23,17,20,14,13,11,17,21,11,16,14,17,17,19,21,7,13, + 0,1,7,2,3,1,2,1,3,0,1,4,3,5,12,6,4,3,5,5,5,5,2,4,3,5,3,5,3,6,1,1,3,2,6, + 4,11,9,15,22,15,16,13,10,26,26,24,13) > ax<- 30-dx > insect<-data.frame(dead=dx,alive=ax,spray=sx) > gout<-glm(cbind(dead,alive)~spray,family=binomial, data=insect) > summary( gout ) Call: glm(formula = cbind(dead, alive) ~ spray, family = binomial, data = insect) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) sprayB sprayC <2e-16 *** sprayD <2e-16 *** sprayE <2e-16 *** sprayF (Dispersion parameter for binomial family taken to be 1) Null deviance: on 71 degrees of freedom Residual deviance: on 66 degrees of freedom AIC: Number of Fisher Scoring iterations: 4
> gres<-rbind(unique(fitted(gout)),unique(predict(gout))) > dimnames(gres)[[2]]<-LETTERS[1:6] > gres A B C D E F [1,] [2,] > anova(gout) Analysis of Deviance Table Model: binomial, link: logit Response: cbind(dead, alive) Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL spray
Correlation and causality The more STBK stores, the higher will APT price increase ? The more Starbucks, the higher APT price ! APT prices in Seoul
STBKAPT price 강남구 강동구2530 중구24520 중랑구0330 STBK: the number of Starbucks stores APT price: Average APT price by a 1 m 2
y<-c(45, 2,1,4,4,6,4,2,1,0,2,3,10,8,21,3,5,5,3,12,7,1,20,24,0) x<-c(3373,1907,1115,1413,1286,1861,1218,1018,1250,1135,1240,1528, 1675,1220,2854,1644,1247,2427,2034,1723,2594,1138,1634,1729,1101) xm<- x/(3.3) # 평단가 ( res<- glm(y~xm, family=poisson) ) anova(res) summary(res) plot(xm,y,ylab="Starbucks",xlab="APT price/m2") points(xm,fitted(res),col="red",pch=16) # exp(predict(res))=fitted(res)
> summary(res) Call: glm(formula = y ~ xm, family = poisson) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) xm <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: on 24 degrees of freedom Residual deviance: on 23 degrees of freedom AIC: Number of Fisher Scoring iterations: 5
> anova(res) Analysis of Deviance Table Model: poisson, link: log Response: y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL xm
A B C distribution & likelihood
What is ? is observed.
distribution & likelihood p likelihood
?
link function (for Poisson family) the number of parameters linear modeling for the link function
link function (for binomial family) linear modeling for the link function
Independence test in GLM for Australia rare plants data > rareplants<-matrix(c(37,23,10,15,190,59,141,58,94,23,28,16),4,) > dimnames(rareplants)<-list(c("CC","CR","RC","RR"),c("D","W","WD")) > (sout<- chisq.test(rareplants) ) Pearson's Chi-squared test data: rareplants X-squared = , df = 6, p-value = 4.336e-06 > wdx<-rep(c("D","W","WD"),e=4) > crx<-rep(c("CC","CR","RC","RR"),3) > rplants<-data.frame(wd=wdx,cr=crx,r=c(rareplants)) > anova( glm(r~wd*cr,family=poisson,data=rplants) ) Analysis of Deviance Table Model: poisson, link: log, Response: r Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL wd cr wd:cr e-15 DWWD CC CR RC RR > 1-pchisq(34.95,6) [1] e-06
> # H0 : all four coins have the same proportion showing head side > # H1 : at least one coin have different proportion to the others > > head4 <- c( 83, 90, 129, 70 ) > toss4 <- c( 86, 93, 136, 82 ) > prop.test(head4, toss4) 4-sample test for equality of proportions without continuity correction X-squared = , df = 3, p-value = alternative hypothesis: two.sided > coins<-factor(LETTERS[1:4]) > anova(glm(cbind(head4,toss4-head4)~coins,family=binomial)) Analysis of Deviance Table Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL coins e-14 Coin 1Coin 2Coin 3Coin 4 Head Alive Tail33712Dead Total Total Hosp’l 1Hosp’l 2Hosp’l 3Hosp’l 4 > 1-pchisq(10.667,3) [1] Homogeneity test in GLM for coin tossing example
Thank you !!