From t-test to … multilevel analyses

Slides:



Advertisements
Similar presentations
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Advertisements

Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Lecture 4 (Chapter 4). Linear Models for Correlated Data We aim to develop a general linear model framework for longitudinal data, in which the inference.
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
Objectives (BPS chapter 24)
Multilevel Models 4 Sociology 8811, Class 26 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission.
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model.
Sociology 601 Class 21: November 10, 2009 Review –formulas for b and se(b) –stata regression commands & output Violations of Model Assumptions, and their.
Clustered or Multilevel Data
Final Review Session.
SIMPLE LINEAR REGRESSION
Introduction to Regression Analysis Straight lines, fitted values, residual values, sums of squares, relation to the analysis of variance.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
SIMPLE LINEAR REGRESSION
A trial of incentives to attend adult literacy classes Carole Torgerson, Greg Brooks, Jeremy Miles, David Torgerson Classes randomised to incentive or.
Simple Linear Regression Analysis
Generalized Linear Models
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
Regression and Correlation Methods Judy Zhong Ph.D.
SIMPLE LINEAR REGRESSION
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
Introduction to Linear Regression
Bandit Thinkhamrop, PhD. (Statistics) Department of Biostatistics and Demography Faculty of Public Health Khon Kaen University, THAILAND.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Multilevel Modeling Software Wayne Osgood Crime, Law & Justice Program Department of Sociology.
Lecture 3 Linear random intercept models. Example: Weight of Guinea Pigs Body weights of 48 pigs in 9 successive weeks of follow-up (Table 3.1 DLZ) The.
Biostat 200 Lecture Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Lecture 5. Linear Models for Correlated Data: Inference.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Nonparametric Statistics
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
Bandit Thinkhamrop, PhD. (Statistics) Department of Biostatistics and Demography Faculty of Public Health Khon Kaen University, THAILAND.
Bivariate Regression. Bivariate Regression analyzes the relationship between two variables. Bivariate Regression analyzes the relationship between two.
From t-test to multilevel analyses Del-3
Chapter 13 Simple Linear Regression
Nonparametric Statistics
Chapter 20 Linear and Multiple Regression
QM222 Class 9 Section A1 Coefficient statistics
Regression Analysis AGEC 784.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
From t-test to multilevel analyses Del-2
CHAPTER 7 Linear Correlation & Regression Methods
From t-test to multilevel analyses (Linear regression, GLM, …)
QM222 Class 8 Section A1 Using categorical data in regression
Chapter 11 Simple Regression
The slope, explained variance, residuals
Generalized Linear Models
Simple Linear Regression
I271B Quantitative Methods
Nonparametric Statistics
6-1 Introduction To Empirical Models
Simple Linear Regression
Migration and the Labour Market
SIMPLE LINEAR REGRESSION
SIMPLE LINEAR REGRESSION
Simple Linear Regression
Statistics for Business and Economics
Common Statistical Analyses Theory behind them
Introduction to Econometrics, 5th edition
Correlation and Simple Linear Regression
Presentation transcript:

From t-test to … multilevel analyses Stein Atle Lie

Outline Pared t-test (Mean and standard deviation) Two-group t-test (Mean and standard deviations) Linear regression GLM (general linear models) GEE (general estimation equations) GLMM (general linear mixed model) … SPSS, Stata, R, MLwiN, gllamm (Stata)

Multilevel models “Same thing – many names”: Generalized estimation equations Random effects models Random intercept and random slope models Mixed effects models Variance component models Frailty models (in survival analyses) Latent variables

Cortisol data – missing data

Objective Take the general thinking from simple statistical methods into more sophisticated data-structures and statistical analyses Focus on the interpretation of the results with respect to those found in basic statistical methods

Multilevel data Types of data: Repeated measures for the same individual The same measure is repeated several times on the same individual Several observers have measured the same individual Several different measures for the same individual Related observations (siblings, families, …) A categorical variable with ”many” levels (multicenter data, hospitals, clinics, …) Panel data

Null hypotheses In ordinary statistics (using both pared and two‑sample t-tests) we define a null hypothesis. H0: m1 = m2 We assume that mean from group (or measure) 1 is equal to the mean from group (or measure) 2. Alternatively H0: D = m1-m2 = 0

p-value Definition: “If our null-hypothesis is true - what is the probability to observe the data* that we did?” * And hence the mean, t-statistic, etc…

p-value We assume that our null-hypothesis is true (m0=0 or m1-m2=0) We observe our data Mean value etc. Under the assumption of normal distributed data p-value The p-value is the probability to observe our data (or something more extreme) under the given assumptions m0

Pared t-test The straightforward way to analyze two repeated measures is a pared t-test. Measure at time1 or location1 (e.g. Data1) is directly compared to measure at time2 or location2 (e.g. Data2) Is the difference between Data1 and Data2 (Diff = Data1-Data2) unlike 0?

Pared t-test (n=10) PASW: T-TEST PAIRS=Data1 WITH Data2 (PAIRED).

Pared t-test The pared t-test will only be performed for complete (balanced) data. What happens if we delete two observations from data2? (Only 8 complete pairs remain)

Pared t-test (n=8) PASW: T-TEST PAIRS=Data1 WITH Data2 (PAIRED). Excel

Two group t-test If we now consider the data from time1 and time2 (or location1 and location2) to be independent (even if their not) and use a two group t-test on the full dataset, 2*10 observations

Two group t-test (n=20 [10+10]) PASW: T-TEST GROUPS=Grp(1 2) /VARIABLES=Data.

Two group t-test Observe that mean for Grp1 and Grp2 is equal to mean for Data1 and Data2 And that the mean difference is also equal The difference between pared t-test and two group t-test lies in the Variance - and the number of observations and therefore in the standard deviation and standard error and hence in the p-value and confidence intervals

Two group t-test The two group t-test are performed on all available data. What happens if we delete two observations from Grp2? (Only 8 complete pairs remain - but 18 observations remain!)

Two group t-test (n=18 [10+8]) PASW: T-TEST GROUPS=Grp(1 2) /VARIABLES=Data.

Two group t-test (s1=s2) s1 s2 m1 m2 D

Two group t-test (s1=s2) s1 s2

ANOVA (Analysis of variance (s1=s2=s3)

Linear regression If we now perform an ordinary linear regression with the data as outcome (dependent variable) and the group variable (Grp=1 and 2) as independent variable the coefficient for group is identical to the mean difference and the standard error, t-statistic, and p‑value are identical to those found in a two‑group t‑test

Linear regression (n=20) Stata: . regress data grp Source | SS df MS Number of obs = 20 -------------+------------------------------ F( 1, 18) = 1.38 Model | 21.0124998 1 21.0124998 Prob > F = 0.2554 Residual | 274.01701 18 15.2231672 R-squared = 0.0712 -------------+------------------------------ Adj R-squared = 0.0196 Total | 295.02951 19 15.5278689 Root MSE = 3.9017 ------------------------------------------------------------------------------ data | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- grp | 2.05 1.744888 1.17 0.255 -1.615873 5.715873 _cons | 5.33 2.75891 1.93 0.069 -.4662545 11.12625

Linear regression Now exchange the independent variable for group (Grp=1 and 2) with a dummy variable (dummy=0 for grp=1 and dummy=1 for grp=2) the coefficient for the dummy is equal to the coefficient for grp (the mean difference) and the coefficient for the constant term is equal to the mean for grp1 (the standard error is not!)

Linear regression (n=20) Stata: . regress data dummy Source | SS df MS Number of obs = 20 -------------+------------------------------ F( 1, 18) = 1.38 Model | 21.0124998 1 21.0124998 Prob > F = 0.2554 Residual | 274.01701 18 15.2231672 R-squared = 0.0712 -------------+------------------------------ Adj R-squared = 0.0196 Total | 295.02951 19 15.5278689 Root MSE = 3.9017 ------------------------------------------------------------------------------ data | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dummy | 2.05 1.744888 1.17 0.255 -1.615873 5.715873 _cons | 7.38 1.233822 5.98 0.000 4.787836 9.972164

Linear models in Stata In ordinary linear models (regress and glm) in Stata one may add an option for clustered data – to obtain standard errors adjusted for intragroup correlation This is ideal when you want to adjust for clustered data, but are not interested in the correlation within or between groups And - you will still have the population effects!!

Linear regression (n=20) Stata: . regress data dummy, cluster(id) Linear regression Number of obs = 20 F( 1, 9) = 2.64 Prob > F = 0.1388 R-squared = 0.0712 Root MSE = 3.9017 (Std. Err. adjusted for 10 clusters in id) ------------------------------------------------------------------------------ | Robust data | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dummy | 2.05 1.262145 1.62 0.139 -.8051699 4.90517 _cons | 7.38 1.224847 6.03 0.000 4.609204 10.1508

Linear models in Stata Thus, we now have an alternative to the pared t‑test. The mean difference is identical to that obtained from the pared t‑test, and the standard errors (and p-values) are adjusted for intragroup correlation As an alternative we may use the program gllamm (Generalized Linear Latent And Mixed Models) in Stata http://www.gllamm.org/

gllamm (n=20) gllamm (Stata): . gllamm data dummy, i(id) number of level 1 units = 20 number of level 2 units = 10 ------------------------------------------------------------------------------ data | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- dummy | 2.05 1.167852 1.76 0.079 -.2389486 4.338949 _cons | 7.379808 1.172819 6.29 0.000 5.081124 9.678492 Variance at level 1 6.8193955 (3.0174853) Variances and covariances of random effects level 2 (id) var(1): 6.8114516 (4.5613185)

Linear models in Stata If we now delete two of the observations in Grp2 We then have coefficients (“mean differences”) calculated based on all (n=18) data and standard errors corrected for intragroup correlation - using the commands <regress>, <glm> or <gllamm>

Linear regression (n=18) Stata: . regress data dummy, cluster(id) Linear regression Number of obs = 18 F( 1, 9) = 1.63 Prob > F = 0.2332 R-squared = 0.0587 Root MSE = 4.1303 (Std. Err. adjusted for 10 clusters in id) ------------------------------------------------------------------------------ | Robust data | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dummy | 1.9575 1.531486 1.28 0.233 -1.506963 5.421963 _cons | 7.38 1.228869 6.01 0.000 4.600105 10.1599

gllamm (n=18) gllamm (Stata): . gllamm data dummy, i(id) number of level 1 units = 18 number of level 2 units = 10 log likelihood = -48.538837 ------------------------------------------------------------------------------ data | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- dummy | 2.458305 1.253552 1.96 0.050 .0013882 4.915223 _cons | 7.357426 1.232548 5.97 0.000 4.941677 9.773176 Variance at level 1 6.4041537 (3.3485133) level 2 (id) var(1): 8.7561818 (5.1671805)

Intra class correlation (ICC) Variance at level 1 6.4041537 (3.3485133) level 2 (id) var(1): 8.7561818 (5.1671805) The total variance is hence 6.4041 + 8.7561= 15.1603 (and the standard deviation is hence 3.8936) The proportion of variance attributed to level 2 is therefore ICC = 8.7561/15.1603 = 0.578

Linear regression Ordinary linear regression Assumes data is Normal and i.i.d. (identical independent distributed)

Linear regression Y X Regression line: y = b0 + b1·x b1 b0 residual b1 (x1,y1) (xn,yn) (xi,yi) b0 Kortisol * Months Height * Weight Kortisol * Time X

Linear regression Assumptions: 1) y1, y2,…, yn are independent normal distributed 2) The expectation of Yi is: E(Yi) = b0 + b1·xi (linear relation between X and Y) 3) The variance of Yi is: var(Yi) = s2 (equal variance for ALL values of X)

Linear regression Assumptions - Residuals (ei): yi = a + b·xi + ei 1) e1, e2,…, en are independent normal distributed 2) The expectation of ei is: E(ei) = 0 3) The variance of ei is: var(Yi) = s2

Regression Y X ^ yi=a+b·xi What is the ”best” a and b? _ y _ x Least squares method (xi,yi) residual (e) ^ (yi-yi)2 residual (e) ^ (xi,yi) _ y X

Regression Least squares method: We wish that the sum of squares (The distance from all points to the line [the residuals]; squared) is as least as possible – we whish to find the minimum

Regression The least squares method: The solution is:

Regression The maximum likelihood method: Assumptions: 1) y1, y2,…, yn are random (independent), normal-distributed observations, i.i.d. 2) Expectation for Yi is: E(Yi) = a + b·xi 3) Variance for Yi is: var(Yi) = s2 f(y) maximized v.r.t. a and b. (The likelihood-function) This is the same as finding the minimum of For simple linear regression the least squares method and the maximum likelihood method are equal!

Regression Y X _ y _ x The maximum likelihood method ”The probability that the line fits the observed points” _ x _ y residual (e) ^ (xi,yi) (xi,yi) X

Ordinary linear regression The formula for an ordinary regression can be expressed as: yi = b0 + b1·xi + ei ei ~N(0, se2)

Interpretation of coefficients 50 Høyde i cm (X) 210 200 190 180 170 160 150 Vekt i kg (Y) 100 90 80 70 60 Kvinner Menn Y = - 97.6 + 0.96*X Y = a + b*X Det vil si: a = -97.6 og b=0.96

Interpretation of coefficients Y = - 85.0 + 0.91*X1 - 1.86*X2 } = 1.86 kg

Random intercept model Y Regression lines: yij = b0 + b1·xij+vij (x11,y11) b1 (xnp,ynp) b0+uj (xij,yij) su se X

Random intercept model For a random intercept model, we can express the regression line(s) - and the variance components as yij = b0 + b1·xij + vij vij = uj + eij eij ~N(0, se2) (individual) uj ~N(0, su2) (group)

Random intercept model Alternatively we may express the formulas, for the simple variance component model, in terms of random intercepts: yij = b0j + b1·xij + eij b0j = b0 + uj eij ~N(0, se2) (individual) uj ~N(0, su2) (group)

Random slope model For a random slope model (the intercepts are equal), we can express the regression line(s) and the variance components as yij = b0 + b1j·xij + eij b1j = b1+ wj eij ~N(0, se2) (individual) wj ~N(0, sw2) (group)

Random slope and intercept model For a random slope and random intercept model, we can express the regression line(s) and the variance components as yij = b0j + b1j·xij + eij b1j = b1+ wj b0j = b0 + uj eij ~N(0, se2) (individual) uj ~N(0, su2) (group) wj ~N(0, sw2) (group)

Cortisol data Cortisol level in saliva measured each morning in 3 days, in two periods* 55 individuals 278 observations (52 missing) * The real data was measured 5 times per day, in 3 days and 3 periods - from the article: Harris A, Marquis P, Eriksen HR, Grant I, Corbett R, Lie SA, Ursin H. Diurnal rhythm in British Antarctic personnel. Rural Remote Health. 2010 Apr-Jun;10(2):1351.

Cortisol data – missing data

Cortisol data – long data format

Cortisol data Period1 Period2

Linear model Stata: . glm kortisol period2 day2 day3, cluster(id) (. regress kortisol period2 day2 day3, cluster(id)) Generalized linear models No. of obs = 278 Optimization : ML Residual df = 274 (Std. Err. adjusted for 55 clusters in id) ------------------------------------------------------------------------------ | Robust kortisol | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- period2 | -2.536544 .9788702 -2.59 0.010 -4.455094 -.6179938 day2 | .1313347 .7238506 0.18 0.856 -1.287386 1.550056 day3 | .6528685 .7052775 0.93 0.355 -.72945 2.035187 _cons | 11.31802 .9542124 11.86 0.000 9.447799 13.18824

Linear mixed model (variance component) Stata: . gllamm kortisol period2 day2 day3, i(id) number of level 1 units = 278 number of level 2 units = 55 ------------------------------------------------------------------------------ kortisol | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- period2 | -2.600979 .6875339 -3.78 0.000 -3.94852 -1.253437 day2 | .05486 .8154391 0.07 0.946 -1.543371 1.653091 day3 | .5183787 .8242555 0.63 0.529 -1.097132 2.13389 _cons | 11.29695 .7666444 14.74 0.000 9.794358 12.79955 Variance at level 1 31.202774 (2.9334224) Variances and covariances of random effects level 2 (id) var(1): 8.6764463 (2.8796675) ICC=0.218

Linear mixed model (variance component) lmer(Kortisol~1+Day2+Day3+Period2 +(1|ID),data=kortisol) Random effects: Groups Name Variance Std.Dev. ID (Intercept) 8.8683 2.9780 Residual 31.6173 5.6229 Number of obs: 278, groups: ID, 55 Fixed effects: Estimate Std. Error t value (Intercept) 11.29105 0.77282 14.610 Day2 0.05431 0.82076 0.066 Day3 0.51766 0.82946 0.624 Period2 -2.60115 0.69204 -3.759 ICC=0.219

Cortisol data Period1 Period2

Linear mixed model (variance component) PASW: MIXED Kortisol BY ID WITH Period2 Day2 Day3 /FIXED=Period2 Day2 Day3 | SSTYPE(3) /METHOD=REML /PRINT=SOLUTION /RANDOM=ID | COVTYPE(VC). ICC=0.219

Linear mixed model (random intercept model) lmer(Kortisol~1+Day+Period2 +(1|ID),data=kortisol) Random effects: Groups Name Variance Std.Dev. ID (Intercept) 8.8879 2.9813 Residual 31.4891 5.6115 Number of obs: 278, groups: ID, 55 Fixed effects: Estimate Std. Error t value (Intercept) 11.2281 0.7394 15.186 Day 0.2546 0.4137 0.616 Period2 -2.6007 0.6907 -3.765 ICC=0.220

Linear mixed model (random intercept model) Period1 Period2

Linear mixed model (random slope model) lmer(Kortisol~1+Day+Period2 +(Day-1|ID),data=kortisol) Random effects: Groups Name Variance Std.Dev. ID Day 6.2228e-08 0.00024945 ! Residual 4.0499e+01 6.36390166 Number of obs: 278, groups: ID, 55 Fixed effects: Estimate Std. Error t value (Intercept) 11.2575 0.6948 16.202 Day 0.3227 0.4660 0.692 Period2 -2.5361 0.7644 -3.318

Linear mixed model (random slope model) Period1 Period2

Linear mixed model (random slope & intercept) lmer(Kortisol~1+Day+Period2 +(1+Day|ID),data=kortisol) Random effects: Groups Name Variance Std.Dev. Corr ID (Intercept) 10.88014 3.29851 Day 0.10535 0.32457 -1.000 Residual 31.38000 5.60179 Number of obs: 278, groups: ID, 55 Fixed effects: Estimate Std. Error t value (Intercept) 11.2138 0.7629 14.698 Day 0.2656 0.4149 0.640 Period2 -2.5940 0.6891 -3.764 ICC=0.257

Linear mixed model (random slope model) Period1 Period2

Summary The interpretation of parameter estimates of categorical variables (preferably dummy variables) from linear models can be interpreted as mean differences, as from ordinary t-test This is equivalent in models for repeated or clustered observations!

Software Personal opinion PASW/SPSS Very easy to do simple models (menu/syntax) Arrange data Stata Steeper learning curve to start Easy () to extend the simpler models to more sophisticated models (e.g. for other distributions!) glamm

Software Personal opinion Steep learning curve Nice graphics MLwiN Based on mouse clicking (impossible syntax) Informative screen using formulas SAS “Similar” to SPSS

IGLS – Iterative Generalised Least Squares RIGLS – Residual/Restricted Iterative Generalised Least Squares MCMC – Markov Chain Monte Carlo Bootstrap – «Baron von Munchausen»

Extended models - Stata

Extended models - SPSS

Extended models – Stata (gllamm) Family (F): and link (g): gaussian identity poisson log gamma reciprocal binomial logit probit cll (complementary log-log) ll (log-log) ologit (o stands for ordinal) oprobit ocll mlogit sprobit (scaled probit) soprobit

Extended models gllamm also allows for probability weighting (e.g. to adjust for dropout) The “svyset” (survey set) extention also allows for probability weighting, and robust variance estimates (linear models, logistic models, …)

Random slope and intercept model For a random slope and random intercept model, we can express the general regression line(s) and the variance (components) as g(yij) = b0j + b1j·xij + eij b1j = b1+ wj b0j = b0 + uj eij ~ F (individual) uj ~N(0, su2) (group) wj ~N(0, sw2) (group)