Lecture 8 Generalized Additive Models Olivier MISSA, Advanced Research Skills.

Slides:



Advertisements
Similar presentations
Generalized Additive Models Keith D. Holler September 19, 2005 Keith D. Holler September 19, 2005.
Advertisements

Lecture 4 Linear Models III Olivier MISSA, Advanced Research Skills.
Lecture 2 Linear Models I Olivier MISSA, Advanced Research Skills.
Logistic Regression Example: Horseshoe Crab Data
Vector Generalized Additive Models and applications to extreme value analysis Olivier Mestre (1,2) (1) Météo-France, Ecole Nationale de la Météorologie,
Logistic Regression.
Predicting Success in the National Football League An in-depth look at the factors that differentiate the winning teams from the losing teams. Benjamin.
Datamining and statistical learning - lecture 9 Generalized linear models (GAMs)  Some examples of linear models  Proc GAM in SAS  Model selection in.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
SPH 247 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
1 Logistic Regression Homework Solutions EPP 245/298 Statistical Analysis of Laboratory Data.
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Logistic Regression and Generalized Linear Models:
New Ways of Looking at Binary Data Fitting in R Yoon G Kim, Colloquium Talk.
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
© Department of Statistics 2012 STATS 330 Lecture 26: Slide 1 Stats 330: Lecture 26.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.
Logistic Regression Pre-Challenger Relation Between Temperature and Field-Joint O-Ring Failure Dalal, Fowlkes, and Hoadley (1989). “Risk Analysis of the.
Introduction to Generalized Linear Models Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. October 3, 2004.
Lecture 5 Linear Mixed Effects Models
Design and Analysis of Clinical Study 11. Analysis of Cohort Study Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Data Mining Volinsky - Columbia University 1 Chapter 4.2 Regression Topics Credits Hastie, Tibshirani, Friedman Chapter 3 Padhraic Smyth Lecture.
Use of Weighted Least Squares. In fitting models of the form y i = f(x i ) +  i i = 1………n, least squares is optimal under the condition  1 ……….  n.
Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
Linear Model. Formal Definition General Linear Model.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
Using R for Marketing Research Dan Toomey 2/23/2015
一般化線形モデル( GLM ) generalized linear Models データ解析のための統計モデリング入門 久保拓也(2012) 岩波書店.
Lecture 7 GLMs II Binomial Family Olivier MISSA, Advanced Research Skills.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Design and Analysis of Clinical Study 10. Cohort Study Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
© Department of Statistics 2012 STATS 330 Lecture 23: Slide 1 Stats 330: Lecture 23.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30.
Count Data. HT Cleopatra VII & Marcus Antony C c Aa.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Lecture 3 Linear Models II Olivier MISSA, Advanced Research Skills.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
Chapter 17.1 Poisson Regression Classic Poisson Example Number of deaths by horse kick, for each of 16 corps in the Prussian army, from 1875 to 1894.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Remembering way back: Generalized Linear Models Ordinary linear regression What if we want to model a response that is not Gaussian?? We may have experiments.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
The Effect of Race on Wage by Region. To what extent were black males paid less than nonblack males in the same region with the same levels of education.
R by Example CNR SBCS workshop Gary Casterline. Experience and Applications Name and Dept/Lab Experience with Statistics / Software Your research and.
> xyplot(cat~time|id,dd,groups=ses,lty=1:3,type="l") > dd head(dd) id ses cat pre post time
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Logistic Regression Jeff Witmer 30 March Categorical Response Variables Examples: Whether or not a person smokes Success of a medical treatment.
Generalised and mixed GAM models Claudia von Brömssen Dept of Energy and Technology.
An introduction to General Additive Models Claudia von Brömssen Dept. of Energy and Technology.
Unit 32: The Generalized Linear Model
Transforming the data Modified from:
Logistic regression.
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
CHAPTER 7 Linear Correlation & Regression Methods
Generalized Additive Models & Decision Trees
SAME THING?.
PSY 626: Bayesian Statistics for Psychological Science
Logistic Regression with “Grouped” Data
Presentation transcript:

Lecture 8 Generalized Additive Models Olivier MISSA, Advanced Research Skills

2 Outline Introduce you to Generalized Additive Models.

3 GLMs GAMs GAMs vs. GLMs Smooth function of x 1

4 A number of algorithms are available to fit them, all known generically as "splines" The most frequently used are: Thin Plate Regression Splines (default) & Cubic Regression Splines "Smooth" functions

5 More efficient than Polynomials Est. degr. freedom : 8.69 R 2 (adj) : Dev. Expl. :79.8% Thin Plate Regression Splines degree 5 degree 10 Polynomials degr. freedom : 10 R 2 (adj) : Dev. Expl. : 78%

6 More efficient than Polynomials No need to specify the degrees of freedom (Wiggliness) of the smooth function. The algorithm finds the optimal solution for us, and avoids overfitting by cross-validation ('leave-one-out' trick).

7 Example: ozone pollution > ozone.pollution <- read.table("ozone.data.txt", header=T) ## in datasets folder > names(ozone.pollution) [1] "rad" "temp" "wind" "ozone" > attach(ozone.pollution) > modgam <- gam(ozone ~ s(rad) + s(temp) + s(wind) ) > plot(ozone ~ modgam$$fitted, pch=16) > abline(0,1, col="red") Revisiting the ozone dataset (Lecture 4 Linear Models III)

8 Ozone Example > summary(modgam) Family: gaussian Link function: identity... Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) <2e-16 *** Approximate significance of smooth terms: edf Ref.df F p-value s(rad) ** s(temp) e-09 *** s(wind) e-08 *** R-sq.(adj) = Deviance explained = 74.8% GCV score = 338 Scale est. = n = 111 Generalised Cross-Validation

9 Ozone Example > plot(modgam, residuals=T, pch=16) rad temp wind > modgam2 <- update(modgam, ~. - s(rad) ) > modgam3 <- update(modgam, ~. - s(temp) ) > modgam4 <- update(modgam, ~. - s(wind) )

10 Ozone Example > anova(modgam, modgam2, test="F") Model 1: ozone ~ s(rad) + s(temp) + s(wind) Model 2: ozone ~ s(temp) + s(wind) Resid. Df Resid. Dev Df Deviance F Pr(>F) ** > anova(modgam, modgam3, test="F") Model 1: ozone ~ s(rad) + s(temp) + s(wind) Model 2: ozone ~ s(rad) + s(wind) Resid. Df Resid. Dev Df Deviance F Pr(>F) e-09 *** > anova(modgam, modgam4, test="F") Model 1: ozone ~ s(rad) + s(temp) + s(wind) Model 2: ozone ~ s(rad) + s(temp) Resid. Df Resid. Dev Df Deviance F Pr(>F) e-09 ***

11 Ozone Example > modgam5 <- gam(ozone ~ rad + s(temp) + s(wind) ) > summary(modgam5) Family: gaussian Link function: identity... Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-11 *** rad ** --- Approximate significance of smooth terms: edf Ref.df F p-value s(temp) e-09 *** s(wind) e-09 *** --- R-sq.(adj) = Deviance explained = 73.4% GCV score = Scale est. = n = 111 linear term

12 Ozone Example > anova(modgam, modgam5, test="F") Analysis of Deviance Table Model 1: ozone ~ s(rad) + s(temp) + s(wind) Model 2: ozone ~ rad + s(temp) + s(wind) Resid. Df Resid. Dev Df Deviance F Pr(>F) > shapiro.test(residuals(modgam5, type="deviance")) Shapiro-Wilk normality test data: residuals(modgam5, type = "deviance") W = , p-value = 1.999e-06 > resid <- residuals(modgam5, type="deviance")

13 Ozone Example > qqnorm(resid, pch=16) > qqline(resid, lwd=2, col="red") > plot(resid ~ fitted(modgam5), pch=16) > abline(h=0, col="gray85", lty=2)

14 Ozone Example > plot(sqrt(abs(resid)) ~ fitted(modgam5), pch=16) > lines(lowess(sqrt(abs(resid)) ~ fitted(modgam5)), lwd=2, col="red") > plot(cooks.distance(modgam5), type="h")

15 Possible Refinements > modgam <- gam(resp ~ pred1 + s(pred2) + s(pred3), family=poisson(link="log") ) Specify a different family than Gaussian Specify a different spline basis than Thin Plate > modgam <- gam(resp ~ pred1 + s(pred2, bs="cr") + s(pred3) ) Specify a maximum number of degrees of freedom for the spline > modgam <- gam(resp ~ pred1 + s(pred2, k=5) + s(pred3) ) cubic regression spline

16 1 st Example > plot( c(0,1) ~ c(1,32), type="n", log="x", xlab="dose", ylab="Probability") > text(dose, numdead/20, labels=as.character(sex) ) > ld <- seq(0,32,0.5) > lines (ld, predict(modb3, data.frame(ldose=log2(ld), sex=factor(rep("M", length(ld)), levels=levels(sex))), type="response") ) > lines (ld, predict(modb3, data.frame(ldose=log2(ld), sex=factor(rep("F", length(ld)), levels=levels(sex))), type="response"), lty=2, col="red" )

17 1 st Example > modbp <- glm(SF ~ sex*ldose, family=binomial(link="probit")) > AIC(modbp) [1] > modbc <- glm(SF ~ sex*ldose, family=binomial(link="cloglog")) > AIC(modbc) [1] > AIC(modb3) [1]

18 > summary(modb3) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-13 *** sexM ** ldose e-16 *** --- > exp(modb3$coeff) ## careful it may be misleading (Intercept) sexM ldose ## odds ration: p / (1-p) > exp(modb3$coeff[1]+modb3$coeff[2]) ## odds for males (Intercept) st Example logit scale Every doubling of the dose will lead to an increase in the odds of dying over surviving by a factor of 2.899

19 Erythrocyte Sedimentation Rate in a group of patients. Two groups : 20 (ill) mm/hour Q: Is it related to globulin & fibrinogen level in the blood ? 2 nd Example > data("plasma", package="HSAUR") > str(plasma) 'data.frame': 32 obs. of 3 variables: $ fibrinogen: num $ globulin : int $ ESR : Factor w/ 2 levels "ESR 20": > summary(plasma) fibrinogen globulin ESR Min. :2.090 Min. :28.00 ESR < 20:26 1st Qu.: st Qu.:31.75 ESR > 20: 6 Median :2.600 Median :36.00 Mean :2.789 Mean : rd Qu.: rd Qu.:38.00 Max. :5.060 Max. :46.00

20 2 nd Example > stripchart(globulin ~ ESR, vertical=T, data=plasma, xlab="Erythrocyte Sedimentation Rate (mm/hr)", ylab="Globulin blood level", method="jitter" ) > stripchart(fibrinogen ~ ESR, vertical=T, data=plasma, xlab="Erythrocyte Sedimentation Rate (mm/hr)", ylab="Fibrinogen blood level", method="jitter" )

21 2 nd Example > mod1 <- glm(ESR~fibrinogen, data=plasma, family=binomial) > summary(mod1) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * fibrinogen * --- (Dispersion parameter for binomial family taken to be 1) Null deviance: on 31 degrees of freedom Residual deviance: on 30 degrees of freedom AIC: > mod2 <- glm(ESR~fibrinogen+globulin, data=plasma, family=binomial) > AIC(mod2) [1] factor

22 2 nd Example > anova(mod1, mod2, test="Chisq") Analysis of Deviance Table Model 1: ESR ~ fibrinogen Model 2: ESR ~ fibrinogen + globulin Resid. Df Resid. Dev Df Deviance P(>|Chi|) > summary(mod2) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * fibrinogen * globulin (Dispersion parameter for binomial family taken to be 1) Null deviance: on 31 degrees of freedom Residual deviance: on 29 degrees of freedom AIC: The difference in terms of Deviance between these models is not significant, which leads us to select the least complex model

23 2 nd Example > shapiro.test(residuals(mod1, type="deviance")) Shapiro-Wilk normality test data: residuals(mod1, type = "deviance") W = , p-value = 5.465e-07 > par(mfrow=c(2,2)) > plot(mod1)