Count Data. HT Cleopatra VII & Marcus Antony C c Aa.

Slides:



Advertisements
Similar presentations
Section 9.3 Inferences About Two Means (Independent)
Advertisements

BINF 702 Spring 2014 Practice Problems Practice Problems BINF 702 Practice Problems.
Logistic Regression Example: Horseshoe Crab Data
Logistic Regression.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
BCOR 1020 Business Statistics Lecture 28 – May 1, 2008.
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Inferences About Process Quality
Genetic Association and Generalised Linear Models Gil McVean, WTCHG Weds 2 nd November 2011.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
1 Categorical Data (Chapter 10) Inference about one population proportion (§10.2). Inference about two population proportions (§10.3). Chi-square goodness-of-fit.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Logistic Regression and Generalized Linear Models:
AS 737 Categorical Data Analysis For Multivariate
BIOL 582 Lecture Set 18 Analysis of frequency and categorical data Part III: Tests of Independence (cont.) Odds Ratios Loglinear Models Logistic Models.
© Department of Statistics 2012 STATS 330 Lecture 28: Slide 1 Stats 330: Lecture 28.
New Ways of Looking at Binary Data Fitting in R Yoon G Kim, Colloquium Talk.
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
© Department of Statistics 2012 STATS 330 Lecture 26: Slide 1 Stats 330: Lecture 26.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.
Statistical modelling Gil McVean, Department of Statistics Tuesday 24 th Jan 2012.
Logistic Regression Pre-Challenger Relation Between Temperature and Field-Joint O-Ring Failure Dalal, Fowlkes, and Hoadley (1989). “Risk Analysis of the.
F73DB3 CATEGORICAL DATA ANALYSIS Workbook Contents page Preface Aims Summary Content/structure/syllabus plus other information Background – computing (R)
Introduction to Generalized Linear Models Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. October 3, 2004.
Design and Analysis of Clinical Study 11. Analysis of Cohort Study Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
Bandit Thinkhamrop, PhD. (Statistics) Department of Biostatistics and Demography Faculty of Public Health Khon Kaen University, THAILAND.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
The binomial applied: absolute and relative risks, chi-square.
© Department of Statistics 2012 STATS 330 Lecture 31: Slide 1 Stats 330: Lecture 31.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Fitting probability models to frequency data. Review - proportions Data: discrete nominal variable with two states (“success” and “failure”) You can do.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Inferential Statistics. Coin Flip How many heads in a row would it take to convince you the coin is unfair? 1? 10?
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
XIAO WU DATA ANALYSIS & BASIC STATISTICS.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Categorical Data Analysis
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Comparing 2 populations. Placebo go to see a doctor.
Logistic Regression. What is the purpose of Regression?
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Bandit Thinkhamrop, PhD. (Statistics) Department of Biostatistics and Demography Faculty of Public Health Khon Kaen University, THAILAND.
Logistic Regression Jeff Witmer 30 March Categorical Response Variables Examples: Whether or not a person smokes Success of a medical treatment.
Unit 32: The Generalized Linear Model
Transforming the data Modified from:
Logistic regression.
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
CHAPTER 7 Linear Correlation & Regression Methods
SAME THING?.
PSY 626: Bayesian Statistics for Psychological Science
Confidence Interval.
Logistic Regression with “Grouped” Data
Common Statistical Analyses Theory behind them
Presentation transcript:

Count Data

HT

Cleopatra VII & Marcus Antony C c Aa

EVENODD 1 st 12 2 nd 12 3 rd 12 EVENODD

Gregor Mendel,

RY Ry rY ry Total Obs Expect( ) Which statement is right or ?

RY Ry rY ry Total Obs Expect O-E

Total Obs Expect O-E /9 25/3 2525*15/9

∞ ∞ ∞ ∞ ∞

RY Ry rY ry Total Obs Expect( ) > x <- c(950,250,350,50) > p <- c(9,3,3,1)/16 > chisq.test(x, p=p) Chi-squared test for given probabilities data: x X-squared = , df = 3, p-value = 1.214e-09

YyTotal R r Total Yy R r 1

Yy R r 1 Yy R r 1 Chi-square test for Independence test

RY Ry rY ry Total Obs Expect( ) Yy R r 1 Yy R1200 r

RY Ry rY ry Total Obs Expect( ) > mx<- matrix(c(950,250,350,50),2,) > chisq.test(mx,correct=F) Pearson's Chi-squared test data: mx X-squared = , df = 1, p-value = > mx [,1] [,2] [1,] [2,]

Yy R r 1

y1…ymTot r1 … rk Tot1

Total Obs Expec ( ) > x <- c(8,12,7,14,9,10) > p <- rep(1,6)/6 > chisq.test(x,p=p) Chi-squared test for given probabilities data: x X-squared = 3.4, df = 5, p-value =

HTTotal Obs Expec( ) > chisq.test(c(60,40),p=c(1,1)/2) Chi-squared test for given probabilities data: c(60, 40) X-squared = 4, df = 1, p-value =

|| ? : : > head2 <- c( 560, 640) > toss2 <- c( 1000, 1000) > prop.test(head2, toss2) 2-sample test for equality of proportions …. data: head2 out of toss2 X-squared = , df = 1, p-value = alternative hypothesis: two.sided 95 percent confidence interval: sample estimates: prop 1 prop CaesarTolemy Head Tail > chisq.test(mx,cor=F) Pearson's Chi-squared test data: mx X-squared = , df = 1, p-value = > chisq.test(mx) Pearson's Chi-squared test with Yates‘ continuity correction data: mx X-squared = , df = 1, p-value = > mx <- matrix(c(560,440,640,360),2,) > mx [,1] [,2] [1,] [2,] Chi-square test for Homogeneity of distributions

> > # H0 : all four coins have the same proportion showing head side > # H1 : at least one coin have different proportion to the others > > head4 <- c( 83, 90, 129, 70 ) > toss4 <- c( 86, 93, 136, 82 ) > prop.test(head4, toss4) 4-sample test for equality of proportions without continuity correction data: head4 out of toss4 X-squared = , df = 3, p-value = alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop Coin 1Coin 2Coin 3Coin 4 Head Alive Tail33712Dead Total Total Hospital 1Hospital 2Hospital 3Hospital 4 > mx <- matrix(c(83,3,90,3,129,7,70,12),2,) > chisq.test(mx) Pearson's Chi-squared test data: mx X-squared = , df = 3, p-value =

DWWD CC CR RC RR Australia rare plants data Common (C ) & Rare (R ) in ( South Australia, Victoria) and (Tasmania ) The number of plants: in Dry (D ), Wet (W ) and Wet or Dry (WD ) regions. Question (null hypothesis): Is the distribution of plants for (D,W,WD) are equal for all CC, CR, RC and RR?

Australia rare plants data > rareplants<-matrix(c(37,23,10,15,190,59,141,58,94,23,28,16),4,) > dimnames(rareplants)<-list(c("CC","CR","RC","RR"),c("D","W","WD")) > rareplants > (sout<- chisq.test(rareplants) ) Pearson's Chi-squared test data: rareplants X-squared = , df = 6, p-value = 4.336e-06 > round( sout$expected,1 ) D W WD CC CR RC RR > round( sout$resid,3 ) D W WD CC CR RC RR

The lady tasting tea

Fisher’s exact test for 2 X 2 tables with small n (n<25) > chisq.test(matrix(c(7,2,1,5),2,)) Pearson's Chi-squared test with Yates' continuity correction X-squared = , df = 1, p-value = Warning message: 카이 자승 근사는 부정확할지도 모릅니다 > fisher.test(matrix(c(7,2,1,5),2,)) Fisher's Exact Test for Count Data p-value = alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: sample estimates: odds ratio > fisher.test(matrix(c(7,2,1,5),2,),alter="greater") Fisher's Exact Test for Count Data p-value = alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio Guess\MakingMilk 1 st Tea 1 st Sum Milk 1 st 718 Tea 1 st 257 sum9615

There are 7 possible tables for given marginal counts. G\M M 1 st T 1 st Sum M 1 st 808 T 1 st 167 sum9615 G\M M 1 st T 1 st Sum M 1 st 718 T 1 st 257 sum9615 G\M M 1 st T 1 st Sum M 1 st 628 T 1 st 347 sum9615 G\M M 1 st T 1 st Sum M 1 st 538 T 1 st 437 sum9615 G\M M 1 st T 1 st Sum M 1 st 448 T 1 st 527 sum9615 G\M M 1 st T 1 st Sum M 1 st 358 T 1 st 617 sum9615 G\M M 1 st T 1 st Sum M 1 st 268 T 1 st 707 sum9615 What is the probability that each table will show at the experiment ?

G\M M 1 st T 1 st Sum M 1 st aba+b T 1 st cdc+d suma+cb+dn G\M M 1 st T 1 st Sum M 1 st rqv T 1 st 1-r1-q1-v sum111 means no discernible ability. Odds ratio : with some correction

G\M M 1 st T 1 st Sum M 1 st 808 T 1 st 167 sum9615 G\M M 1 st T 1 st Sum M 1 st 718 T 1 st 257 sum9615 G\M M 1 st T 1 st Sum M 1 st 628 T 1 st 347 sum9615 G\M M 1 st T 1 st Sum M 1 st 538 T 1 st 437 sum9615 G\M M 1 st T 1 st Sum M 1 st 448 T 1 st 527 sum9615 G\M M 1 st T 1 st Sum M 1 st 358 T 1 st 617 sum9615 G\M M 1 st T 1 st Sum M 1 st 268 T 1 st 707 sum When = (See, p-value of the fisher exact test; two-sided test) = (one-sided test)

G\M M 1 st T 1 st Sum M 1 st 909 T 1 st 066 sum9615 G\M M 1 st T 1 st Sum M 1 st 448 T 1 st 527 sum % correct answers Some are misclassified Fisher exact test considers only the cases with the same fixed margins. The probabilities of tables with different margins are completely ignored. This is referred to data-respecting (?) inference, from time to time.

Use Fisher’s exact test only for small n ( less than 25). > Pearson's Chi-squared test X-squared = , df = 1, p-value = > chisq.test(matrix(c(14,4,2,10),2,)) Pearson's Chi-squared test with Yates' continuity correction X-squared = , df = 1, p-value = > fisher.test(matrix(c(14,4,2,10),2,)) Fisher's Exact Test for Count Data p-value = alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: sample estimates: odds ratio Guess\MakingMilk 1 st Tea 1 st Sum Milk 1 st Tea 1 st sum No big difference when n is large !

Yates’ continuity correction G\M M 1 st T 1 st Sum M 1 st aba+b T 1 st cdc+d suma+cb+dn

Odds ratio :

Regression Generalized Linear Model (GLM) ANOVA Linear Model (LM)

- Regression, - ANOVA Generalized Linear Model (GLM) Poisson Regression Binomial Regression ( Logistic Regression )

Guess\MakingMilk 1 st Tea 1 st Sum Milk 1 st 718 Tea 1 st 257 sum9615 are observed! Logistic regression

> tm<-data.frame(gm=c(7,1),gt=c(2,5), making=c("M","T")) > summary( glm(cbind(gm,gt)~making,family=binomial, data=tm) ) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) makingT * (Dispersion parameter for binomial family taken to be 1) Null deviance: e+00 on 1 degrees of freedom Residual deviance: e-16 on 0 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 Logistic regression with the lady tasting tea data

ABCDEF

> sx<-rep(LETTERS[1:6],e=12) > dx<-c(10,7,20,14,14,12,10,23,17,20,14,13,11,17,21,11,16,14,17,17,19,21,7,13, + 0,1,7,2,3,1,2,1,3,0,1,4,3,5,12,6,4,3,5,5,5,5,2,4,3,5,3,5,3,6,1,1,3,2,6, + 4,11,9,15,22,15,16,13,10,26,26,24,13) > ax<- 30-dx > insect<-data.frame(dead=dx,alive=ax,spray=sx) > gout<-glm(cbind(dead,alive)~spray,family=binomial, data=insect) > summary( gout ) Call: glm(formula = cbind(dead, alive) ~ spray, family = binomial, data = insect) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) sprayB sprayC <2e-16 *** sprayD <2e-16 *** sprayE <2e-16 *** sprayF (Dispersion parameter for binomial family taken to be 1) Null deviance: on 71 degrees of freedom Residual deviance: on 66 degrees of freedom AIC: Number of Fisher Scoring iterations: 4

> gres<-rbind(unique(fitted(gout)),unique(predict(gout))) > dimnames(gres)[[2]]<-LETTERS[1:6] > gres A B C D E F [1,] [2,] > anova(gout) Analysis of Deviance Table Model: binomial, link: logit Response: cbind(dead, alive) Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL spray

Correlation and causality The more STBK stores, the higher will APT price increase ? The more Starbucks, the higher APT price ! APT prices in Seoul

STBKAPT price 강남구 강동구2530 중구24520 중랑구0330 STBK: the number of Starbucks stores APT price: Average APT price by a 1 m 2

y<-c(45, 2,1,4,4,6,4,2,1,0,2,3,10,8,21,3,5,5,3,12,7,1,20,24,0) x<-c(3373,1907,1115,1413,1286,1861,1218,1018,1250,1135,1240,1528, 1675,1220,2854,1644,1247,2427,2034,1723,2594,1138,1634,1729,1101) xm<- x/(3.3) # 평단가 ( res<- glm(y~xm, family=poisson) ) anova(res) summary(res) plot(xm,y,ylab="Starbucks",xlab="APT price/m2") points(xm,fitted(res),col="red",pch=16) # exp(predict(res))=fitted(res)

> summary(res) Call: glm(formula = y ~ xm, family = poisson) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) xm <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: on 24 degrees of freedom Residual deviance: on 23 degrees of freedom AIC: Number of Fisher Scoring iterations: 5

> anova(res) Analysis of Deviance Table Model: poisson, link: log Response: y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL xm

A B C distribution & likelihood

What is ? is observed.

distribution & likelihood p likelihood

?

link function (for Poisson family) the number of parameters linear modeling for the link function

link function (for binomial family) linear modeling for the link function

Independence test in GLM for Australia rare plants data > rareplants<-matrix(c(37,23,10,15,190,59,141,58,94,23,28,16),4,) > dimnames(rareplants)<-list(c("CC","CR","RC","RR"),c("D","W","WD")) > (sout<- chisq.test(rareplants) ) Pearson's Chi-squared test data: rareplants X-squared = , df = 6, p-value = 4.336e-06 > wdx<-rep(c("D","W","WD"),e=4) > crx<-rep(c("CC","CR","RC","RR"),3) > rplants<-data.frame(wd=wdx,cr=crx,r=c(rareplants)) > anova( glm(r~wd*cr,family=poisson,data=rplants) ) Analysis of Deviance Table Model: poisson, link: log, Response: r Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL wd cr wd:cr e-15 DWWD CC CR RC RR > 1-pchisq(34.95,6) [1] e-06

> # H0 : all four coins have the same proportion showing head side > # H1 : at least one coin have different proportion to the others > > head4 <- c( 83, 90, 129, 70 ) > toss4 <- c( 86, 93, 136, 82 ) > prop.test(head4, toss4) 4-sample test for equality of proportions without continuity correction X-squared = , df = 3, p-value = alternative hypothesis: two.sided > coins<-factor(LETTERS[1:4]) > anova(glm(cbind(head4,toss4-head4)~coins,family=binomial)) Analysis of Deviance Table Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL coins e-14 Coin 1Coin 2Coin 3Coin 4 Head Alive Tail33712Dead Total Total Hosp’l 1Hosp’l 2Hosp’l 3Hosp’l 4 > 1-pchisq(10.667,3) [1] Homogeneity test in GLM for coin tossing example

Thank you !!