Generalized linear MIXED models

Slides:



Advertisements
Similar presentations
Analysis by design Statistics is involved in the analysis of data generated from an experiment. It is essential to spend time and effort in advance to.
Advertisements

Assumptions underlying regression analysis
Randomized Complete Block and Repeated Measures (Each Subject Receives Each Treatment) Designs KNNL – Chapters 21,
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Chapter 10: Estimating with Confidence
Objectives (BPS chapter 24)
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group.
Clustered or Multilevel Data
Biol 500: basic statistics
Topic 3: Regression.
Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis.
Chapter 10: Estimating with Confidence
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
1 A MONTE CARLO EXPERIMENT In the previous slideshow, we saw that the error term is responsible for the variations of b 2 around its fixed component 
Chapter 12 Section 1 Inference for Linear Regression.
Generalized Linear Models
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
STA291 Statistical Methods Lecture 27. Inference for Regression.
5-1 Introduction 5-2 Inference on the Means of Two Populations, Variances Known Assumptions.
Simple Linear Regression
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
© 1998, Geoff Kuenning General 2 k Factorial Designs Used to explain the effects of k factors, each with two alternatives or levels 2 2 factorial designs.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Issues in Estimation Data Generating Process:
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Three Frameworks for Statistical Analysis. Sample Design Forest, N=6 Field, N=4 Count ant nests per quadrat.
Hypothesis Testing. Why do we need it? – simply, we are looking for something – a statistical measure - that will allow us to conclude there is truly.
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
ANOVA, Regression and Multiple Regression March
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
Statistics 3: mixed effect models Install R library lme4 to your computer: 1.R -> Packages -> Install packages 2.Choose mirror 3.Choose lme4 4.Open the.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Review of statistical modeling and probability theory Alan Moses ML4bio.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
WELCOME TO BIOSTATISTICS! WELCOME TO BIOSTATISTICS! Course content.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Multilevel modelling: general ideas and uses
Transforming the data Modified from:
The simple linear regression model and parameter estimation
Linear Mixed Models in JMP Pro
Mixed models and their uses in meta-analysis
HLM with Educational Large-Scale Assessment Data: Restrictions on Inferences due to Limited Sample Sizes Sabine Meinck International Association.
12 Inferential Analysis.
Simple Linear Regression - Introduction
CHAPTER 29: Multiple Regression*
Randomized Complete Block and Repeated Measures (Each Subject Receives Each Treatment) Designs KNNL – Chapters 21,
12 Inferential Analysis.
CHAPTER 12 More About Regression
Fixed, Random and Mixed effects
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Generalized linear MIXED models Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Introduction Methods used yesterday all depend on the independence of observations. All collected data should be - true replicates - not clustered - not measured several times over a time period

Independence and replicates 1. Does fish length of species A vary with land use? river 1 (lies in forest): 25, 27, 34, 22, 26 river 2 (lies in agricultural area): 42, 36, 29, 35 river 3 (lies in a mixed area): 34, 27, 32, 41 2. Does fish length of species A vary between rivers? river 1: 25, 27, 34, 22, 26 river 2: 42, 36, 29, 35 river 3: 34, 27, 32, 41 Why can question 2 be answered with statistical methods but not 1.?

Independence and replicates 2. Does fish length of species A vary between rivers? river 1: river 2: river 3: Population 1: all fish in river 1 Population 2: all fish in river 2 Population 3: all fish in river 3 Observations: individual fish of species A in each river = independent fish representing the population

Independence and replicates 1. Does fish length of species A vary with land use? river 1: lies in forest river 2: lies in agricultural area river 3: lies in a mixed area Population 1: fish in rivers in forests Population 2: fish in rivers in agricultural areas Population 3: fish in rivers in mixed ares Observations: 5 observations in river 1 represents the river but not the population. There are no true replicates, but pseudoreplicates.

Independence and replicates 1. Does fish length of species A vary with land use? forest area: rivers 1a, 1b and 1c agricultural area: rivers 2a, 2b, 2c and 2d mixed area: rivers 3a, 3b and 3c Population 1: fish in rivers in forests Population 2: fish in rivers in agricultural areas Population 3: fish in rivers in mixed ares Rivers 1a, 1b and 1c are replicates and represent population 1.

Experimental units When conducting experiments experimental units are the smallest unit that can get a individual treatment: If you have cows in a box each cow can get its own diet -> to compare diets cows are the experimental units, several cows getting the same diet are replicates If you treat plots in a forest with a special treatment the plots are the experimental units. If instead each leave can get different treatments, the leaves are the experimental units.

Experimental units Experimental units = independent observations are needed to quantify the variation in the data How much variation can we expect from completely unconnected individuals/subjects/sites

Dependent data Often it is easier or of special interest to collect dependent data Time series/repeated measurements: we are interested how the treatment effects the experimental unit over time Clustered/hierarchical data: it is easier and gives a better representation to collect several leaves from several trees within the same experimental plot.

Dependent data If dependent data is ignored in the analysis this can lead to bias in the estimates and an underestimation of variation, leading to low, but false p-values. If you want to make a study that includes dependent data plan this thoroughly before data collection.

Dependent data Observe that I am talking about dependencies/ independence of observations. Dependecies between variables is desirable for multivariate methods. Dependecies between explanatory variables in general or generalised linear models can be a problem if correlations are very high.

Models for dependent observations - examples If it is important to follow a treatment over time we could make observations on the same plot several times (several days after the treatment, several month after the treatment,…) Data for each plot has a time series structure and measurements on the same plot are not independent. The time series structure is incorporated in the model. We often call these models ’repeated measures models’

Models for dependent observations - examples To make estimates better we could choose to take measurements several times on the same plot (but at the same time point). This data structure is called clustered or hierarchical and we can use the data to get some idea of how large the variation within the plot is.

Mixed models Data with such structures are analysed with mixed models where different types of random factors or random effects account for the dependencies in the data. Mixed models in R can be run in different functions/packages all with some restrictions. We will use the function glmer and glmmPQL.

Examples - Lophodermium For the Lophodermium data set there were actually 2 forests observed at each site: sample site forest Latitud veg_period vegetation_zone status 1 Sk1G07 1 1 55.9 205 Nemoral Healthy 2 Th1G07 2 1 56.7 205 Nemoral Healthy 3 Th2G07 2 2 56.5 205 Nemoral Healthy 4 Bo1G07 3 1 58.6 205 Nemoral Healthy 5 Bo2G07 3 2 58.6 205 Nemoral Healthy 6 Asa1G07 4 1 57.2 185 Hemi Healthy 7 Asa2G07 4 2 57.2 185 Hemi Healthy

Examples - Lophodermium Since we now for most sites have 2 forests observed, the two forests at the same site cannot really be regarded to be independent of each other. Probably the results from these two forests are similar due to their being close geographically. We can assume a hierachical structure. In the model this resolves to estimating variance components for the site and the forests within each site.

Fixed and random effects 𝑔 𝜇 𝑖𝑗 =𝜇+ 𝛽 𝑖 +𝑎 𝑗 + 𝑒 𝑖𝑗 Where 𝛽 is a factor effect (e.g. healthy/sick) and 𝑎 is a random effect (e.g. of the forest within each site). Generelly the factor effects or fixed effects are the one that we are interested to model, whereas the random effects are there to reconstruct the design of the study or experimental design.

Fixed and random effects If we only look at the random effect, site: 𝜇 1 =𝜇+ 𝑎 1 𝑎 𝑖 𝑖𝑠 𝑟𝑎𝑛𝑑𝑜𝑚 𝑤𝑖𝑡ℎ 𝐸 𝑎 𝑖 =0 𝑎 𝑖 gives different values for each site. 𝜇 2 =𝜇+ 𝑎 2 𝜇 3 =𝜇+ 𝑎 3 The different sites are included in the experiment since they represent different conditions. Variation in the proportion of X6 between the sites 𝜎 𝐴 2

Fixed and random effects It is usually not intersting to learn more about the different levels of a random factor. If we would use site as a fixed factor, we would estimate the level of mean proportion for species 6 for each of the sites. We would make 18 estimates, one for each site except for one. When we treat site as a random factor, we only estimate one parameter – the variance between the different sites.

Fixed and random effects The hierarchical structure site 2 experimental unit site 1 site 3 several measurements on the same unit

Fixed and random effects Since the forests can be affected by the common factor site we do not see them as independent. Forests within the same site can be more similar than forests from different sites. We model this effect by including the random factor site in the model.

Fixed and random effects We also make several measurements on each forest = we measure both sick and healthy needles in each of the forests and we observe all forests both 2006 and 2007. The fixed factors status and needle_cohort are nested within forest. This type of model is in agricultural experiments often called split-plot model.

Fixed and random effects The factor ’site’ is on the large scale level. It coincides with latitude and to some extend with vegetation_zone. Both forests are observed on the same level of ’site’. The factors status and year are on the small scale level. They can be oberserved separately for each of the forests.

Loph: Consideration regarding the factor site In this study design data was collected at different sites. At each site both healthy and diseased needles were collected during both 2006 and 2007. Some measurements are however missing. The correct model yesterday would also need to include the site variable to adjust for local levels. In our model, however, this part was taken by the latitude variable. We could choose to replace the latitude with the site variable (which gives less information) or use the site variable as a random factor and keep latitude in the model as well.

Mixed models for Lophodermium With the type of model we use now we can include the factor ’site’ easily as random variable. Also forest is included as random variable. We assume that both sites and forests are randomly selected from all sites and forests available.

Mixed models for Lophodermium We need now to change to an R packages that can do mixed models. There are several of them, but we start with the glmer function. I glmer we write the model basically the same as in glm, but we can include random variables by setting them into a paranthesis: (1|site) for a random site (1|site/forest) for a random forest within a random site

Mixed models for Lophodermium Model1 <- glmer(cbind(X6_reads, reads-X6_reads)~ Latitud+status + needle_cohort + (1|site/forest), family=binomial, data=Loph2) Model3<-glmer(X6_reads~Latitud+status + needle_cohort + (1|site/forest), family=poisson, offset=log_reads, data=Loph2)

Mixed models for Lophodermium Random effects: Groups Name Variance Std.Dev. forest:site (Intercept) 1.807 1.344 site (Intercept) 0.000 0.000 Number of obs: 69, groups: forest:site, 20; site, 10 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) -28.06088 5.62657 -4.99 6.13e-07 *** Latitud 0.41701 0.09307 4.48 7.44e-06 *** statusHealthy -5.33618 0.13961 -38.22 < 2e-16 *** needle_cohort2007 0.46179 0.03535 13.06 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Mixed models for Lophodermium Quasi-binomial and quasipoisson does not work with glmer. Instead we need to include overdispersion with yet another method: Random residual The idea is to just estimate a separate variance for the residuals of the model and adjust p-value for that. Model1b<-glmer(cbind(X6_reads, reads-X6_reads)~Latitud+status + needle_cohort + (1|site/forest)+ (1|sample), family=binomial, data=Loph2) Model1a<-glmer(cbind(X6_reads, reads-X6_reads)~Latitud+status + needle_cohort + (1|site/forest/sample), family=binomial, data=Loph2)

Mixed models for Lophodermium Random effects: Groups Name Variance Std.Dev. sample:(forest:site) (Intercept) 1.996e+00 1.4127259 forest:site (Intercept) 8.703e-01 0.9329018 site (Intercept) 1.508e-08 0.0001228 Number of obs: 69, groups: sample:(forest:site), 69; forest:site, 20; site, 10 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) -28.61700 5.62712 -5.086 3.67e-07 *** Latitud 0.42339 0.09242 4.581 4.63e-06 *** statusHealthy -6.34987 0.54205 -11.715 < 2e-16 *** needle_cohort2007 0.40321 0.42026 0.959 0.337 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Mixed models for Lophodermium Manyglm and edgeR do not seem to have any possibility to account for hierachical or other mixed structures.

Mixed models for Funghi In the second example of yesterdays computer lab we analysed funghi data at a number of sites. The sites in this example are the replicates made in the experiment, the experimental units At each site we observe a specific combination of tree type, CO2 (yes/no) and Warmed (yes/no). At each experimental unit we make 3 observations (=the three different horizons).

Mixed models for Funghi The structure is similar to the Lophodermium example, where we also had several measurements at each site, but in this case these measurements also have meaning – they represent different soil layers. This means that measurements have a meaning and a specific order.

Mixed models for Funghi In such cases we usually assume that there is a correlation between measurements at the same site. If a measurement is made at a site with high probability of species 3 it will be so at all levels. Part of this correlation between horizons is described by the model – we include horizon as factor. There can, however, still be correlations in the residuals of the model = data is not independent.

Mixed models for Funghi correlated also correlated, but less We can assume that the observations made at the same site are correlated with each other. Observations made close to each other are more correlated than observations longer apart.

Mixed models for Funghi Correlation between layers can be estimated and the standard errors and p-values are adjusted accordingly. The correlations are estimated on the residuals, i.e. after the model is fitted, to see if there is any remaining dependence between the layers.

Mixed models for Funghi Since horizon actually is a rather important factor in the model we should also consider interactions between the other factors and horizon. The effect of tree type could be different at different soil horizons. For X3 however we will not be able to estimate this interaction.

Mixed models for Funghi

Generalised linear models - overview We use logistic regression or Poisson regression as base models. For DNA sequencing data or similar data specific procedures often use the negative binomial distribution, since overdispersion is almost always observed.

Generalised linear models - overview For these types of models you need to have the data observed as counts. If your response variable is a propoportion and cannot be traced back to counts, you use general linear models with a normal distribution for the error term. Sometimes this will demand transformation for the observed data before the model can be fitted. (Look at residual plots to check it residuals are normally distributed and have equal variances) If normality does not hold use nonparametric metods.

Overdispersion - overview There are several ways to handle overdispersion in data to use quasidistributions (this does often not work in mixed settings) to use the negative binomial distribution (not availabe in all packages, e.g. not in glm) use a random residual (demand the use of mixed models even if the model itself is not mixed)

Overdispersion - overview Always control that the design is well represented in the model. Leaving out design variables (factors that are used to define the data collection) will almost always lead to overdispersion.

Mixed models - overview If your data is collected according to a specific experimental plan or study design you need to account for this structure in the analysis. If you do not do this it will leave you with faulty variation estimates = wrong pvalues (usually to low pvalues). Leaving out the study design variables can also lead to overdispersion.

Mixed models - overview Typical mixed models are repeated measures models, where an experimental unit is observed several times (in time or space) hierarchical models, where several observations are made within the experimental unit (but with no specific order)