Dealing with complicated data in time series models

Slides:



Advertisements
Similar presentations
Assumptions underlying regression analysis
Advertisements

Copula Regression By Rahul A. Parsa Drake University &
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Part V The Generalized Linear Model Chapter 16 Introduction.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Maximum likelihood estimates What are they and why do we care? Relationship to AIC and other model selection criteria.

Mixed models Various types of models and their relation
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Simple Linear Regression Analysis
Generalized Linear Models
Review of Lecture Two Linear Regression Normal Equation
A Primer on the Exponential Family of Distributions David Clark & Charles Thayer American Re-Insurance GLM Call Paper
Baburao Kamble (Ph.D) University of Nebraska-Lincoln Data Analysis Using R Week 8: Advanced Regression and Probability Distribution.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
Fixed vs. Random Effects Fixed effect –we are interested in the effects of the treatments (or blocks) per se –if the experiment were repeated, the levels.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
Generalized Linear Models All the regression models treated so far have common structure. This structure can be split up into two parts: The random part:
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
Linear Model. Formal Definition General Linear Model.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Mixed Effects Models Rebecca Atkins and Rachel Smith March 30, 2015.
Generalized linear MIXED models
Statistics……revisited
Practical GLM Analysis of Homeowners David Cummings State Farm Insurance Companies.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Tutorial I: Missing Value Analysis
Dependent Variable Discrete  2 values – binomial  3 or more discrete values – multinomial  Skewed – e.g. Poisson Continuous  Non-normal.
Review of statistical modeling and probability theory Alan Moses ML4bio.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
 Naïve Bayes  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing.
Hierarchical models. Hierarchical with respect to Response being modeled – Outliers – Zeros Parameters in the model – Trends (Us) – Interactions (Bs)
Applied Regression Analysis BUSI 6220
Bayesian estimation for time series models
Missing data: Why you should care about it and what to do about it
BINARY LOGISTIC REGRESSION
WiFi password:
Statistical Modelling
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
Regression Analysis AGEC 784.
Logistic Regression CSC 600: Data Mining Class 14.
Multiple Imputation using SOLAS for Missing Data Analysis
Generalized Linear Models
CH 5: Multivariate Methods
Generalized Linear Models
Correct statistics in ecological research
Special Topics In Scientific Computing
Generalized Linear Models (GLM) in R
Regression model Y represents a value of the response variable.
Statistical Methods For Engineers
CHAPTER 29: Multiple Regression*
Checking Regression Model Assumptions
Autocorrelation.
The Gamma PDF Eliason (1993).
What is Regression Analysis?
Task 6 Statistical Approaches
Fixed, Random and Mixed effects
Chapter 13 Additional Topics in Regression Analysis
A protocol for data exploration to avoid common statistical problems
Autocorrelation.
Generalized Linear Models
Fall Final Topics by “Notecard”.
Generalized Additive Model
Presentation transcript:

Dealing with complicated data in time series models Eric Ward Feb 16 2017

Missing data Data may be missing (NAs) in response or predictors Some models (functions) require complete datasets Variety of interpolation methods

Example dataset Lake WA pH

Simple approach Linear interpolation [3, NA, 4.3, 5.4, NA, 6.1] 2nd observation becomes (3+4.3)/2 5th observation becomes (5.4+6.1)/2 Trickier when more data is missing, [NA, NA, NA, 5.4, NA, 6.1]

Linear interpolation for longer gaps library(zoo) na.approx() function Example for our Lake WA data: y_new = y y_new[is.na(y)] = na.approx(y,na.rm=FALSE)[is.na(y)] * Missing values filled in via linear interpolation

Imputed values in red

Two useful extensions maxgap na.approx(y,na.rm=FALSE, maxgap = 3) rule (inherited from approx()): should data be extrapolated (2) or not (1) na.approx(y,na.rm=FALSE, maxgap = 3, rule=2) Data point closest is used for extrapolation

Alternative linear interpolation Use rollmean() function in zoo library to customize the rolling window rollmean(y, k=3, align = c("center", "left", "right"))

Alternatives to linear interpolation Splines (think GAMs) Fit the spline yourself Or use other ‘na’ functions na.locf() – last observation carried forward na.spline() – interpolate with spline

Spline(blue), linear (red) Beware near end of time series / near lots of missing data y_new = y y_new[is.na(y)] = na.approx(y,na.rm=FALSE)[is.na(y)] plot(y_new, col = 1+as.numeric(is.na(y))) y_new[is.na(y)] = na.spline(y,na.rm=FALSE)[is.na(y)] points(which(is.na(y)), y_new[which(is.na(y))], col="blue")

Other interpolation approaches imputeTS: wrapper for approx(), contains other splines Splines: smooth.spline() gam (mcv)

Is interpolating a good idea? One consequence is falsely increased precision As example we can fit a univariate state space model (with MARSS) to the three datasets: Raw (including missing values) Interpolated (linear) Interpolated (spline)

How do variances compare? Linear interpolation = low observation error Model Q R Raw 0.0161 0.0148 Linear interp 0.01079 0.01389 Spline interp y = lakeWAplanktonRaw[1:100,4] y[sample(1:100, size=20, replace=F)] = NA y_new = y mod_1 = MARSS(log(y_new)) y_new[is.na(y)] = na.approx(y,na.rm=FALSE)[is.na(y)] mod_2 = MARSS(log(y_new)) y_new[is.na(y)] = na.spline(y)[is.na(y)] mod_3 = MARSS(log(y_new))

Missing covariates Generally, missing covariates can be more of a problem (e.g. state space models, DFAs, etc) Same approaches can be used for interpolation Or in a Bayesian framework, the missing values can be assigned priors

When missing data are zeros Lake WA Daphnia plot(lakeWAplanktonRaw[,"Daphnia"], main="Daphnia", ylab="Density", xlab="")

Options Throw out observations Transform your data Work with more complicated statistical models

Data transformations Ln(y + small number) is one of the more common approaches BUT choice of small number has impact on results What is adding a constant going to do to observation or process variances?

Adding constants to Daphnia R Q U 1 0.002973 0.462834 0.000901 0.1 0.00523 1.20592 0.00419 0.01 0.00881 2.03384 0.00954 0.001 0.0187 2.9671 0.0153 0.0001 0.0714 4.3777 0.0212 0.00001 0.7155 5.62 0.0275 y = lakeWAplanktonRaw[,"Daphnia"] mod_1 = MARSS(log(y + 1)) mod_2 = MARSS(log(y + 0.1)) mod_3 = MARSS(log(y + 0.01)) mod_4 = MARSS(log(y + 0.001)) mod_5 = MARSS(log(y + 0.0001)) mod_6 = MARSS(log(y + 0.00001))

Alternate statistical distributions Use time series (MARSS) or other statistical model for positive values > 0 Apply logistic regression (or more complicated model) to model zeros

Delta-GLMs Density of marine fishes almost always fits this pattern (zero inflated)

Delta-GLM or ‘hurdle models’ Breaks the response into 2 parts Presence / absence Positive density 2 separate GLMs May include different covariates If we include random effects / shared terms, they usually aren’t correlated across models Different data + different link functions = weird interpretation Results from both models combined for estimates of total density

Example with Daphnia Example code for GLM y_int = ifelse(y > 0, 1, 0) mod = glm(y_int~seq(1,length(y_int)), family="binomial") pred = predict(mod, newdata=data.frame(1:396), type="response", se.fit=T)

Probability of Daphnia plot(pred$fit, type="l", lwd=2, ylab="Pr(present)", xlab="") lines(pred$fit+pred$se.fit) lines(pred$fit-pred$se.fit)

Positive model Convert 0s to NAs y[which(y==0)]=NA Fit MARSS model mod = MARSS(log(y)) Predictions (log space) exp(c(mod$states))

Estimating total density Multiply Pr(present) * E[density|present] Standard errors more complicated Delta method Monte Carlo simulation, etc

Combined model: E(present) * E(pos|present)

Other types of response data may be non normal

We can model the response as function of predictors using link functions Defaults in GLMs Binomial data (logit link) Poisson, Negative Binomial, Gamma, Lognormal (log link) Note that these formulas don’t include additional error (like regression)

GLMMs Including additional variation turns GLMs -> GLMMs More data hungry, but flexible Random effects allow us to turn ordinary GLMMs into time series models or models with spatial effects

Where have we seen this before? We could construct a DLM with binomial response (or any other distribution)

Univariate -> multivariate For population i As in MARSS models, we need to think about how to model the deviations Independent and shared variance across pops? Independent and unique variance across pops? “equalvarcov” Unconstrained Model covariance as spatially correlated

Powerful functions for estimating non-normal response data rstanarm Extension of Bayesian regression, GLMs, GLMMs, etc glmmTMB (or lme4, bbmle, etc) Fast maximum likelihood estimation Same formula syntax as glm(), lm(), etc

BUT These approaches don’t incorporate time series aspect of our data Other packages: tscount, etc And can be included in our function fit_stan()

Example: time series of ecoli y = tscount::ecoli

glmmTMB glmmTMB(cases ~ week + year, data=y, family=“poisson”)

But residuals show problem Not independent, ACF shows they’re very correlated

tscount() mod = tsglm(y, link="log", distr="poisson")

Added complexity Observations at time t can be made a function of predictions or observations in previous time steps model = list(past_obs = NULL, past_mean = NULL,…) Size of moving window also flexible Covariates can be included: time or external predictors

Implementation in stan fit_stan has a 'family' argument which can be specifed as “gaussian” “binomial” “poisson” “gamma” “lognormal” “negative-binomial” Etc Only for the following models: Regression, DLMs, 'MARSS’ models

Poisson regression mod = fit_stan(y, x = model.matrix(lm(y~1)), model="regression", family="poisson") mod = fit_stan(y, x = model.matrix(lm(y~1)), model="regression", family="poisson") plot(apply(pars$pred, 2, mean), type="l", lwd=3) lines(apply(pars$pred, 2, quantile,0.025)) lines(apply(pars$pred, 2, quantile, 0.975)) points(y, col="red",cex=0.3)

Implementation of DLM We’ll fit model with time-varying level (mean) - No covariates included mod = fit_stan(y, model="dlm-intercept", family="poisson")

Now capturing data much better! mod = fit_stan(y, model="dlm-intercept", family="poisson") plot(apply(exp(pars$intercept), 2, mean), type="l", lwd=3, ylim=c(0,100), ylab="E coli", xlab="") lines(apply(exp(pars$intercept), 2, quantile,0.025)) lines(apply(exp(pars$intercept), 2, quantile,0.975)) points(y, col="red", cex=0.3)

Predicted vs observed

Residuals look much better Still negative acf ~ lags 1-2 mod = fit_stan(y, model="dlm-intercept", family="poisson") plot(apply(exp(pars$intercept), 2, mean), type="l", lwd=3, ylim=c(0,100), ylab="E coli", xlab="") lines(apply(exp(pars$intercept), 2, quantile,0.025)) lines(apply(exp(pars$intercept), 2, quantile,0.975)) points(y, col="red", cex=0.3)