BIOST 536 Lecture 9 1 Lecture 9 – Prediction and Association example Low birth weight dataset Consider a prediction model for low birth weight (< 2500.

Slides:



Advertisements
Similar presentations
Multiple Regression and Model Building
Advertisements

Qualitative predictor variables
Lecture 28 Categorical variables: –Review of slides from lecture 27 (reprint of lecture 27 categorical variables slides with typos corrected) –Practice.
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Models with Discrete Dependent Variables
Multiple Logistic Regression RSQUARE, LACKFIT, SELECTION, and interactions.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Class 17: Tuesday, Nov. 9 Another example of interpreting multiple regression coefficients Steps in multiple regression analysis and example analysis Omitted.
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Midterm Review Goodness of Fit and Predictive Accuracy
Lecture 24: Thurs., April 8th
Comparison of Regularization Penalties Pt.2 NCSU Statistical Learning Group Will Burton Oct
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
BIOST 536 Lecture 4 1 Lecture 4 – Logistic regression: estimation and confounding Linear model.
A Longitudinal Study of Maternal Smoking During Pregnancy and Child Height Author 1 Author 2 Author 3.
Correlation and Regression Analysis
Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Simple Linear Regression Analysis
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Presenting Statistical Aspects of Your Research Analysis of Factors Associated with Pre-term Births in North Carolina.
Categorical Data Prof. Andy Field.
Logistic Regression. Outline Review of simple and multiple regressionReview of simple and multiple regression Simple Logistic RegressionSimple Logistic.
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
Simple Linear Regression
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
Evidence-Based Medicine 3 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
Selecting Variables and Avoiding Pitfalls Chapters 6 and 7.
Chapter 14 Multiple Regression Models. 2  A general additive multiple regression model, which relates a dependent variable y to k predictor variables.
Week 6: Model selection Overview Questions from last week Model selection in multivariable analysis -bivariate significance -interaction and confounding.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Outline 1) Descriptive Statistics 2) Define “association”. 3) Practice reading Table 1 for evidence of confounding, effect modification. 4) Practice reading.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Chapter 4: Introduction to Predictive Modeling: Regressions
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Tim Wiemken PhD MPH CIC Assistant Professor Division of Infectious Diseases University of Louisville, Kentucky Confounding.
1 Multivariable Modeling. 2 nAdjustment by statistical model for the relationships of predictors to the outcome. nRepresents the frequency or magnitude.
Multiple Logistic Regression STAT E-150 Statistical Methods.
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 1 Stats 330: Lecture 19.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
I271B QUANTITATIVE METHODS Regression and Diagnostics.
Model Selection and Estimation in Regression with Grouped Variables.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
CORRELATIONS: PART II. Overview  Interpreting Correlations: p-values  Challenges in Observational Research  Correlations reduced by poor psychometrics.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
Essentials of Business Statistics: Communicating with Numbers By Sanjiv Jaggia and Alison Kelly Copyright © 2014 by McGraw-Hill Higher Education. All rights.
POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)
Exact Logistic Regression
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression Chapter 14.
Lecture 3 (Chapter 4). Linear Models for Longitudinal Data Linear Regression Model (Review) Ordinary Least Squares (OLS) Maximum Likelihood Estimation.
Birthweight (gms) BPDNProp Total BPD (Bronchopulmonary Dysplasia) by birth weight Proportion.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
Beginners statistics Assoc Prof Terry Haines. 5 simple steps 1.Understand the type of measurement you are dealing with 2.Understand the type of question.
DISCRIMINANT ANALYSIS. Discriminant Analysis  Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
(www).
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
BINARY LOGISTIC REGRESSION
Advanced Quantitative Techniques
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

BIOST 536 Lecture 9 1 Lecture 9 – Prediction and Association example Low birth weight dataset Consider a prediction model for low birth weight (< 2500 grams) given the collection of variables available  Do not particularly care which variables are included  Want to maximize our prediction of the outcome  Need to validate our prediction on data not used to generate the model

BIOST 536 Lecture 9 2

3 Outcome variable Look at the distribution of birthweights Were low birthweight babies oversampled or was there bias in recording?

BIOST 536 Lecture 9 4 Simple descriptives Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 5 Number of first trimester physician visits may also have to be grouped for analysis (ptvgrp (0, 1, 2+))

BIOST 536 Lecture 9 6 Relationship of LBW with continuous variables age of mother and weight of mother Possible relationship of LBW with either age or weight univariately Need to consider setting aside an internal validation sample Not much data so use 75% for training and 25% for validation Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 7 Generate training and validation samples Generate a random number uniform on the interval (0,1) Assign an observation to the training sample if U < 0.75 and validation sample if U ≥ 0.75 Will not guarantee that there are exactly 75% of the observations or cases in the training sample If you want exactly 75% of the observations then sort by the random number and assign the first.75*n to the training sample Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 8 This has achieved greater balance in the observations, but still there are not 75% of the cases in the training set Just use the original training classification in the analysis below Still need to consider how to model the continuous variables age and weight without being too complex  Could just use linear age and weight terms  Could categorize into age groups and weight groups  Could use a simple polynomial (e.g. age and age squared)  Could use a smoother or a spline  Could use fractional polynomials Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 9 Fractional polynomials First consider a 2 degree polynomial model for age Two degree model not significantly better than age as a linear term so just use age as a linear variable Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 10 Fractional polynomials Plot of 2 degree model for age Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 11 Fractional polynomials Now consider a 2 degree polynomial model for weight Two degree model not significantly better than weight as a linear term so just use weight as a linear variable Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 12 Fractional polynomials Plot of 2 degree model for weight Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 13 Model exploration (stepwise) Try to screen for possible predictors (forward stepwise) Ptl is included as a linear term – may want to dichotomize Other race may be excluded due to sample size rather than magnitude Create indicator covariates Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 14 Model exploration (stepwise) Refit model Now other race is significant, but Smoke also is added to the model Backwards stepwise gives the same result (not shown) Still may want to dichotomize ptl and put that in the model Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 15 Model exploration Refit model replacing ptl with everptl Same number of parameters – 2 nd model is “better” and simpler Look at some goodness-of-fit tests Test is not reliable since # obs is too close to the number of covariate combinations Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 16 Model diagnostics Use Hosmer-Lemeshow goodness-of-fit test and calculate c-statistic (lroc) Model predicts pretty well for the training sample – also need to consider the validation sample Compute estimated probabilities and logits for both samples and compute predictive power Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 17 Model diagnostics Estimation in validation sample is still good, but certainly inferior to the training sample Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 18 Model prediction Look at classification statistics in the training and validation samples Are there any risk factors so high that LBW babies are probable? Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 19 Model prediction Consider smoking as a risk factor 35% of smokers are predicted as LBW (but they have many other diverse risk factors) Get estimated probability of LBW by weight & smoking, but no other elevated risk factors Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 20 Model prediction Higher risk for smokers with low weight, but would still need to have another risk factor to have a low birthweight baby with probability greater than 50% Consider hypertension as well Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 21 Model prediction Would have to have several risk factors to be at high risk We did not consider interactions, but they may help only slightly in improving overall model prediction Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 22 Model prediction Do the same factors also estimate actual birth weight in grams in a linear regression model? Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 23 Association example Same dataset Now interested in whether smoking is related to low birthweight  Assume this has not been tested before, but some animal models suggest a causal relationship  May want to control for factors believed to be related to LBW and/or smoking  Other variables are not of particular interest, but smoking may be a modifiable risk factor This specific hypothesis is proposed prior to data collection so use all the available data Unadjusted odds ratio for smoking suggests a possible association

BIOST 536 Lecture 9 24 Potential confounders or predictors of the outcome May want to consider other variables that are potential confounders for the relationship of smoking with LBW May also want to consider other variables that predict the outcome even if they are not confounders  Precision for smoking variable may be improved What is the relationship of smoking to some of the other variables in the data? Whites are much heavier smokers so race could be a potential confounder Consider some of the other covariates Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 25 Potential confounders or predictors of the outcome Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 26 Potential confounders or predictors of the outcome Explore continuous covariates as well Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 27 Control for age and race Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 28 Control for other factors Age is neither a predictor or a confounder, but leave in the model anyway Race is both a predictor and a confounder of the association of smoking and low birthweight Now consider some other potential confounders/predictors Weight may also be a confounder and a predictor Add hypertension and uterine irritability to the model Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 29 Control for other factors Predictors, but not confounders Only other potentially modifiable risk factor and/or confounder might be number of first trimester prenatal visits Conduct an unplanned exploratory analysis of this variable and outcome Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 30 Control for other factors Not a strong predictor as a grouped linear variable Consider as a categorical variable Runs into numerical issues due to sparseness of the data Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 31 Control for other factors Collapse into three levels Not a strong predictor – return to model for smoking, but add this as a potential confounder Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 32 Control for other factors Small change in the OR for smoking – leave in anyway Model shows no obvious lack-of-fit Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 33 Unadjusted and Adjusted Odds Ratios Suggests an association of smoking and low birthweight that remains after adjustment for age, race, weight of the mother, hypertension, uterine irritability, and number of physician visits LR test for smoking in the final model History of premature labor was accidentally omitted from this analysis but should have been included Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl) OR for smoking 95% CIWald p-value Unadjusted2.02(1.08, 3.78).028 Adjusted for age and race3.01(1.45, 6.23).003 Adjusted for age and race & other factors 2.71(1.23, 5.99).014

BIOST 536 Lecture 9 34 Unadjusted and Adjusted Odds Ratios Suggests an association of smoking and low birthweight that remains after adjustment for age, race, weight of the mother, hypertension, uterine irritability, and number of physician visits LR test for smoking in the final model History of premature labor was accidentally omitted from this analysis but should have been included Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl) OR for smoking 95% CIWald p-value Unadjusted2.02(1.08, 3.78).028 Adjusted for age and race3.01(1.45, 6.23).003 Adjusted for age and race & other factors 2.71(1.23, 5.99).014

BIOST 536 Lecture 9 35 Effect Modification Significant effect modification would render the entire previous analysis null and void  Some analysts prefer to start with interactions to rule out effect modification before looking at confounding Add interactions between smoking and each covariate and test in a LR test If significant, then the interpretation of the association of smoking and low birthweight depends on that effect modifier Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 36 Effect Modification with Age? No apparent effect modification by age Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 37 Effect Modification with Race? No apparent effect modification by race No significant effect modification by any variable Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

BIOST 536 Lecture 9 38 Final model OR for a smoker with hypertension compared to a non-smoker without hypertension all else being equal OR = 2.71*6.68 = OR for a smoker aged 30 compared to a nonsmoker aged 20 all else equal Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)