Datamining and statistical learning - lecture 9 Generalized linear models (GAMs) Some examples of linear models Proc GAM in SAS Model selection in GAM
Linear regression models The inputs can be: quantitative inputs functions of quantitative inputs base expansions of quantitative inputs dummy variables interaction terms
Justification of linear regression models Many response variables are linearly or almost linearly related to a set of inputs Linear models are easy to comprehend and to fit to observed data Linear regression models are particularly useful when: the number of cases is moderate data are sparse the signal-to-noise ratio is low
Performance of predictors based on: (i) a simple linear regression model (ii) a quadratic regression model when the true expected response is a second order polynomial in the input Predictions based on a linear modelPredictions based on a quadratic model
Logistic regression of multiple purchases vs first amount spent
Logistic regression for a binary response variable Y The expectation of Y given x is a linear function of x
Generalized additive models: some examples A nonlinear, additive model A mixed linear and nonlinear, additive model A mixed linear and nonlinear, additive model with a class variable
Generalized additive models: Modelling the concentration of total nitrogen at Lobith on the Rhine Observed data Fitted model Output: Total-N conc Inputs: Monthly pattern Trend function
Modelling the concentration of total nitrogen at Lobith on the Rhine: Extracted additive components Year components Month components
Weekly mortality and confirmed cases of influenza in Sweden Response: Weekly mortality Inputs: Confirmed cases of influenza Seasonal dummies Long-term trend
SYNTAX for common GAM models Type of ModelSyntaxMathematical Form Parametricmodel y = param(x); Nonparametricmodel y = spline(x); Nonparametricmodel y = loess(x); Semiparametricmodel y = param(x1) spline(x2); Additivemodel y = spline(x1) spline(x2); Thin-plate splinemodel y = spline2(x1,x2);
Generalized additive models: Modelling the concentration of total nitrogen at Lobith on the Rhine Model 1 proc gam data=Mining.Rhine; model Nconc = spline(Year) spline(Month); output out = addmodel1; run; Model 2 proc gam data=Mining.Rhine; model Nconc = spline2(Year, Month); output out = addmodel2; run;
Proc GAM – degrees of freedom of the spline components The degrees of freedom of the spline components is selected by the user or by specifying method=GCV proc gam data=Mining.Rhine; model Nconc = spline(Year, df=3) spline(Month, df=3); output out = addmodel1; run; Df=3 implies that the same cubic polynomial is valid in the entire range of the input Increasing the df-value implies that knots are introduced
Generalized additive models: Modelling the concentration of total nitrogen at Lobith on the Rhine proc gam data=Mining.Rhine; model Nconc = spline(Year) spline(Month); output out = addmodel1; run;
Generalized additive models: Modelling the concentration of total nitrogen at Lobith on the Rhine Model 1
Generalized additive models: Modelling the concentration of total nitrogen at Lobith on the Rhine Model 2 df=4
Generalized additive models: Modelling the concentration of total nitrogen at Lobith on the Rhine Model 3 df=20
Generalized additive models: Modelling the concentration of total nitrogen at Lobith on the Rhine The GAM Procedure Dependent Variable: Nconc Smoothing Model Component(s): spline(Year) spline(Month) Summary of Input Data Set Number of Observations 168 Number of Missing Observations 0 Distribution Gaussian Link Function Identity Iteration Summary and Fit Statistics Final Number of Backfitting Iterations 2 Final Backfitting Criterion E-30 The Deviance of the Final Estimate The local score algorithm converged. Model 1
Generalized additive models: Modelling the concentration of total nitrogen at Lobith on the Rhine Regression Model Analysis Parameter Estimates Parameter Standard Parameter Estimate Error t Value Pr > |t| Intercept <.0001 Linear(Year) <.0001 Linear(Month) <.0001 Smoothing Model Analysis Analysis of Deviance Sum of Source DF Squares Chi-Square Pr > ChiSq Spline(Year) Spline(Month) <.0001 Model 1
Generalized additive models: Modelling the concentration of total nitrogen at Lobith on the Rhine Model 2
Generalized additive models: Modelling the concentration of total nitrogen at Lobith on the Rhine Model 2 (20 df)
Estimation of additive models - the backfitting algorithm
Modelling ln daily electricity consumption as a spline function of the population-weighted mean temperature in Sweden proc gam data=sasuser.smhi; model lnDaily_consumption = spline(Meantemp, df=20); ID Time; output out=smhiouttemp pred resid; run;
Modelling ln daily electricity consumption as a spline function of the population-weighted mean temperature in Sweden: residual analysis
Modelling ln daily electricity consumption in Sweden - residual analysis Spline of temperature Spline of Julian day Weekday dummies
Modelling ln daily electricity consumption in Sweden - residual analysis Spline of temperature Spline of Julian day Weekday dummies Splines of contemporaneous and time-lagged weather data Splines of Julian day and time Weekday and holiday dummies
Deviance analysis of the investigated models of ln daily electricity consumption in Sweden The residual deviance of a fitted model is minus twice its log-likelihood If the error terms are normally distributed, the deviance is equal to the sum of squared residuals
Modelling ln daily electricity consumption in Sweden: time series plot of residuals