Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biostatistics Case Studies 2015 Youngju Pak, PhD. Biostatistician Session 4: Regression Models and Multivariate Analyses.

Similar presentations


Presentation on theme: "Biostatistics Case Studies 2015 Youngju Pak, PhD. Biostatistician Session 4: Regression Models and Multivariate Analyses."— Presentation transcript:

1 Biostatistics Case Studies 2015 Youngju Pak, PhD. Biostatistician ypak@labiomed.org Session 4: Regression Models and Multivariate Analyses

2 What and Why?  Multivariate analysis (MVA) techniques allow more than two variables to be analysed at once.  Compared with univariate or bivariate  Data richness with computational technologies advanced  Data reductions or classifications  eg., Factor analysis, Principal Component Analysis(PCA)  Several variables are potentially correlated with some degree  potential confounding  bias the result  eg., Analysis of Covariance (ANCOVA), Multiple Linear or Generalized Linear Regression Models

3 What and Why ?  Many variables are all interrelated with multiple dependent and independent variables  eg., Multivariate Analysis of Variance (MANOVA), Path Models, Structural Equation Models(SEM), Partially Least Square(PLS) Models.  This Session will focus on multiple regression models.

4 Why regression models?  To reduce “Random Noise” in Data => better variance estimations by adding source of variability of your dependent variables  eg. ANCOVA  To determine a optimal set of predictors => predictive models  eg. Variable selection procedures for multiple regression models  To adjust for potential confounding effects  eg, regression models with covariates

5 Actual mathematical Models  ANOVA Y ij =μ+τ i + ij,, where Y ij represents the j th observation (j=1,2,…,n) on the i th treatment (i=1,2,…,l levels). The errors ij are assumed to be normally and independently (NID) distributed, with mean zero and variance σ 2.  ANCOVA with k number of covariates Y ij =μ+τ i +X 1ij + X 2ij + …+ X kij + ij,  MANOVA (with p number of outcome variables) Y(nxp) = X(nx[q+1]) B([q+1] x p) + E (n x p)

6 Actual mathematical Models  Simple Linear Regression Models (SLR) Y i = β 0 + β 1 X i + ε i µ Y (true mean value of Y)  ε =“error” (random noise due to random sampling error), assumed ε follow a normal distribution with mean=0, variance=σ 2  β 0 & β 1 = intercept & slope  often called Regression (or beta) Coefficients  Y=Dependent Variable(DV)  X=Independent Variable (IV) eg., Y= Insulin Sensitivity X= FattyAcid in percentage  Multiple Linear Regression Models (MLR)  Simple Logistic Models(SL)  Multiple Logistic Models(ML)

7 SLR: Example SPSS output Two-sided p-value=0.002. Thus, there is significant statistical evidence (alpha=0.05) to conclude that the true slope is not zero  Fatty Acid(%) is significantly related to insulin sensitivity. Mean Insulin sensitivity increase by 37.208 unit as Fatty Acid(%) increase by one percent.

8 SLR w/CI

9 Checking the assumptions using a residual Plot A plot has to be looked as “RANDOM” no special pattern is supposed to be shown if the assumptions are met.

10 Actual mathematical Models  Multiple Linear Regression Models (SLR) Y = β 0 + β 1 X 1 + β 2 X 2 + … + β k X k + ε µ Y (true mean value of Y)  Assumptions are the same as SLR with one more addition : All Xs are not highly correlated. If they are, this is called “Multicollinearity”, which will make model very unstable.  Diagnosis for multicollinearity  Variance Inflation Factor (VIF) = 1  OK  VIF < 5  Tolerable  VIF > 5  Problematic  Remove the variable which has a high VIF or do PCA  Multiple Linear Regression Models (MLR)  Simple Logistic Models(SL)  Multiple Logistic Models(ML)

11 MRL: Example m Y = -56.935 + 1.634X 1 + 0.249X 2 11  1.634*Flexibility For every 1 degree increase in flexibility, MEAN punt distance increases by 1.634 feet, adjusting for leg strength.  0.249*Strength For every 1 lb increase in strength, MEAN punt distance increases by 0.249 feet, adjusting for flexibility.

12 What do mean by “adjusted for”?  If categorical covariates?  eg.,  Mean % gain w/o adjustment for Gender  Exercise & Diet: (20% x 10+10% x 40) / 50 = 12 %  Exercise only: (15%x40 + 5%x10) / 50 = 13 %  Mean % gain with adjustment for Gender  Exercise & Diet: Male avg. x 0.5 + Female avg. x 0.5 = 20% x 0.5 + 10% x 0.5=15 %  Exercise only: Male avg. x 0.5 + Female avg. x 0.5 = 15% x 0.5 + 5% x 0.5=10% Mean muscle gain % (N) Exercise & DietExercise only Male20% (10) 15% (40) Female10% (40)5 % (10)

13 Why different?  % gain for males are 10% higher than female in both diet  potential confounding  However, two groups are unbalanced in terms of gender, i.e, 80% male for the exercise group while 20% female for the diet & exercise group  dilute the “treatment effect”  If continuous covariates such as baseline age, similar adjustment will be performed based on the correlation between % gain and the baseline age.

14 Graphical illustration : Adjusting for a continuous covariate * Changes in Adiponectin (a glucose regulating protein) b/w two groups

15 Multiple Logistic Regression Models The model: Logit(π)= β 0 + β 1 X 1 + β 2 X 2 + +β k X k where π=Prob (event =1), Logit(π)= ln[π /(1- π)] or π = e LP / (1+ e LP ), where Lp= β 0 + β 1 X 1 + β 2 X 2 + +β k X k

16 Interpretation of the coefficients in logistic regression models  For a continuous predictor, a coefficient (e β ) represents the multiplicative increase in the mean odds of Y=1 for one unit change in X  odds ratio for X+1 to X.  Similarly, for a nominal predictor, the coefficient represent the odds ratio for one group (X=1) to another (X=0).  Remember, MLR has other covariates. Hence, the interpretation of one coefficient is applied when other covariates are adjusted for. 16

17 Estimated Prob. Vs. Age 17

18 Other Models  Ordinal Logistic Regression for ordinal responses such as cancer stage I, II, III, IV : assumes the constant rate of change in OR between any two groups.  Poisson regressions when responses are count data such as # of pregnancy : over dispersion is common and some times a negative binomial distribution is used instead.  Mixed Model ; commonly used for a repeated measures ANOVA or ANCOVA. Time is used as within-subject factor and random factor. Mixed models are also used for nested design.  Cox proportional Hazard models: multivariate models for survival data.

19 General Linear Model vs. Generalized Linear Model(GLM)  A Linear Model  General Linear Model –eg., ANOVA, ANCOVA, MANOVA, MANCOVA, Linear regression, mixed model  A Non Linear Model  Generalized Linear Model –Eg., Logistic, Ordinary Logistic, Possion  All these used a link function for a response variable (Y) such as a logit link or possion link.  GEE(Generalized Estimating Equation) models are an extension of GLM.

20 Variable Selection Procedures  Forward  By adding a new predictor that as the lowest p-value and keep repeating this step until no more predictors to be added at 0.05 alpha level  Backward  Start a full model with all predictors and eliminate the predictor with the highest p-value and keep repeating this procedure until no more predictors left to be eliminated at 0.05 alpha level  Stepwise  Combination of Forward and Backward  Level of stay : 0.01, Level of entry: 0.05 usually used  Simulation studies show Backward is most recommendable based on many simulation studies.

21

22

23

24

25


Download ppt "Biostatistics Case Studies 2015 Youngju Pak, PhD. Biostatistician Session 4: Regression Models and Multivariate Analyses."

Similar presentations


Ads by Google