Multivariable model building with continuous data Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical Center Freiburg, Germany.

Slides:



Advertisements
Similar presentations
Part II: Coping with continuous predictors
Advertisements

Brief introduction on Logistic Regression
Week 12 November Four Mini-Lectures QMM 510 Fall 2014.
/k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
Departments of Medicine and Biostatistics
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Interactions With Continuous Variables – Extensions of the Multivariable Fractional Polynomial Approach Willi Sauerbrei Institut of Medical Biometry and.
HSRP 734: Advanced Statistical Methods July 24, 2008.
Multiple Linear Regression
Overview of Logistics Regression and its SAS implementation
RELATIVE RISK ESTIMATION IN RANDOMISED CONTROLLED TRIALS: A COMPARISON OF METHODS FOR INDEPENDENT OBSERVATIONS Lisa N Yelland, Amy B Salter, Philip Ryan.
Model assessment and cross-validation - overview
Detecting an interaction between treatment and a continuous covariate: a comparison between two approaches Willi Sauerbrei Institut of Medical Biometry.
Modelling continuous variables with a spike at zero – on issues of a fractional polynomial based procedure Willi Sauerbrei Institut of Medical Biometry.
x – independent variable (input)
Making fractional polynomial models more robust Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical Center Freiburg, Germany.
Statistics for Managers Using Microsoft® Excel 5th Edition
Flexible modeling of dose-risk relationships with fractional polynomials Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical.
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Issues In Multivariable Model Building With Continuous Covariates, With Emphasis On Fractional Polynomials Willi Sauerbrei Institut of Medical Biometry.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Validation of predictive regression models Ewout W. Steyerberg, PhD Clinical epidemiologist Frank E. Harrell, PhD Biostatistician.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Classification and Prediction: Regression Analysis
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Survival analysis Brian Healy, PhD. Previous classes Regression Regression –Linear regression –Multiple regression –Logistic regression.
The Use of Fractional Polynomials in Multivariable Regression Modeling Part I - General considerations and issues in variable selection Willi Sauerbrei.
DIY fractional polynomials Patrick Royston MRC Clinical Trials Unit, London 10 September 2010.
Building multivariable survival models with time-varying effects: an approach using fractional polynomials Willi Sauerbrei Institut of Medical Biometry.
Modelling continuous exposures - fractional polynomials Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical Center Freiburg,
Chapter 13: Inference in Regression
Advanced Statistics for Interventional Cardiologists.
Simple Linear Regression
Improved Use of Continuous Data- Statistical Modeling instead of Categorization Willi Sauerbrei Institut of Medical Biometry and Informatics University.
Essentials of survival analysis How to practice evidence based oncology European School of Oncology July 2004 Antwerp, Belgium Dr. Iztok Hozo Professor.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
Biostatistics Case Studies 2015 Youngju Pak, PhD. Biostatistician Session 4: Regression Models and Multivariate Analyses.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Statistical approaches to analyse interval-censored data in a confirmatory trial Margareta Puu, AstraZeneca Mölndal 26 April 2006.
Selecting Variables and Avoiding Pitfalls Chapters 6 and 7.
Basics of Regression Analysis. Determination of three performance measures Estimation of the effect of each factor Explanation of the variability Forecasting.
Week 6: Model selection Overview Questions from last week Model selection in multivariable analysis -bivariate significance -interaction and confounding.
D:/rg/folien/ms/ms-USA ppt F 1 Assessment of prediction error of risk prediction models Thomas Gerds and Martin Schumacher Institute of Medical.
Biostatistics Case Studies 2007 Peter D. Christenson Biostatistician Session 3: Incomplete Data in Longitudinal Studies.
Multivariable regression modelling – a pragmatic approach based on fractional polynomials for continuous variables Willi Sauerbrei Institut of Medical.
Use of FP and Other Flexible Methods to Assess Changes in the Impact of an exposure over time Willi Sauerbrei Institut of Medical Biometry and Informatics.
Data Analysis – Statistical Issues Bernd Genser, PhD Instituto de Saúde Coletiva, Universidade Federal da Bahia, Salvador
HSRP 734: Advanced Statistical Methods July 17, 2008.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Data Analysis in Practice- Based Research Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine October.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Ch15: Multiple Regression 3 Nov 2011 BUSI275 Dr. Sean Ho HW7 due Tues Please download: 17-Hawlins.xls 17-Hawlins.xls.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
DSCI 346 Yamasaki Lecture 6 Multiple Regression and Model Building.
Canadian Bioinformatics Workshops
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Logistic Regression APKC – STATS AFAC (2016).
Multivariable regression models with continuous covariates with a practical emphasis on fractional polynomials and applications in clinical epidemiology.
Multivariate Analysis Lec 4
What is Regression Analysis?
Linear Model Selection and regularization
Lecture 12 Model Building
Multivariate Linear Regression
Regression and Clinical prediction models
Presentation transcript:

Multivariable model building with continuous data Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical Center Freiburg, Germany Patrick Royston MRC Clinical Trials Unit, London, UK

2 Overview Issues in regression models Methods for variable selection Functional form for continuous covariates (Multivariable) fractional polynomials (MFP) Summary

3 Observational Studies Several variables, mix of continuous and (ordered) categorical variables Different situations: –prediction –explanation Explanation is the main interest here: Identify variables with (strong) influence on the outcome Determine functional form (roughly) for continuous variables The issues are very similar in different types of regression models (linear regression model, GLM, survival models...) Use subject-matter knowledge for modelling but for some variables, data-driven choice inevitable

4 X=(X 1,...,X p ) covariate, prognostic factors g(x) = ß 1 X 1 + ß 2 X ß p X p (assuming effects are linear) normal errors (linear) regression model Y normally distributed E (Y|X) = ß 0 + g(X) Var (Y|X) = σ 2 I logistic regression model Y binary Logit P (Y|X) = ln survival times T survival time (partly censored) Incorporation of covariates Regression models g(X) (g(X))

5 Central issue To select or not to select (full model)? Which variables to include?

6 Continuous variables – The problem “Quantifying epidemiologic risk factors using non- parametric regression: model selection remains the greatest challenge” Rosenberg PS et al, Statistics in Medicine 2003; 22: Discussion of issues in (univariate) modelling with splines Trivial nowadays to fit almost any model To choose a good model is much harder

7 Rosenberg et al, StatMed 2003 Alcohol consumption as risk factor for oral cancer

8 Building multivariable regression models Before dealing with the functional form, the ‚easier‘ problem of model selection: variable selection assuming that the effect of each continuous variable is linear

9 Multivariable models - methods for variable selection Full model –variance inflation in the case of multicollinearity Stepwise procedures  prespecified (  in,  out ) and actual significance level? forward selection (FS) stepwise selection (StS) backward elimination (BE) All subset selection  which criteria? C p Mallows= (SSE / ) - n + p 2 AICAkaike Information Criterion= n ln (SSE / n) + p 2 BICBayes Information Criterion= n ln (SSE / n) + p ln(n) fit penalty Combining selection with Shrinkage Bayes variable selection Recommendations??? Central issue: MORE OR LESS COMPLEX MODELS?

10 Backward elimination is a sensible approach -Significance level can be chosen -Reduces overfitting Of course required Checks Sensitivity analysis Stability analysis

11 Traditional approaches a) Linear function - may be inadequate functional form - misspecification of functional form may lead to wrong conclusions b) ‘best‘ ‘standard‘ transformation c) Step function (categorial data) - Loss of information - How many cutpoints? - Which cutpoints? - Bias introduced by outcome-dependent choice Continuous variables – what functional form?

12 StatMed 2006, 25:

13 Continuous variables – newer approaches ‘Non-parametric’ (local-influence) models –Locally weighted (kernel) fits (e.g. lowess) –Regression splines –Smoothing splines Parametric (non-local influence) models –Polynomials –Non-linear curves –Fractional polynomials Intermediate between polynomials and non-linear curves

14 Fractional polynomial models Describe for one covariate, X Fractional polynomial of degree m for X with powers p 1, …, p m is given by FPm(X) =  1 X p 1 + … +  m X p m Powers p 1,…, p m are taken from a special set {  2,  1,  0.5, 0, 0.5, 1, 2, 3} Usually m = 1 or m = 2 is sufficient for a good fit Repeated powers (p 1 =p 2 )  1 X p 1 +  2 X p 1 log X 8 FP1, 36 FP2 models

15 Examples of FP2 curves - varying powers

16 Examples of FP2 curves - single power, different coefficients

17 Our philosophy of function selection Prefer simple (linear) model Use more complex (non-linear) FP1 or FP2 model if indicated by the data Contrasts to more local regression modelling –Already starts with a complex model

events for recurrence-free survival time (RFS) in 686 patients with complete data 7 prognostic factors, of which 5 are continuous GBSG-study in node-positive breast cancer

19 FP analysis for the effect of age

20 χ 2 dfp-value Any effect? Best FP2 versus null Linear function suitable? Best FP2 versus linear FP1 sufficient? Best FP2 vs. best FP Function selection procedure (FSP) Effect of age at 5% level?

21 Many predictors – MFP With many continuous predictors selection of best FP for each becomes more difficult  MFP algorithm as a standardized way to variable and function selection (usually binary and categorical variables are also available) MFP algorithm combines backward elimination with FP function selection procedures

22 P-value Continuous factors Different results with different analyses Age as prognostic factor in breast cancer (adjusted)

23 Software sources MFP Most comprehensive implementation is in Stata –Command mfp is part since Stata 8 (now Stata 10) Versions for SAS and R are available –SAS –R version available on CRAN archive mfp package

24 Concluding comments – FPs FPs use full information - in contrast to a priori categorisation FPs search within flexible class of functions (FP1 and FP(2)-44 models) MFP is a well-defined multivariate model-building strategy – combines search for transformations with BE Important that model reflects medical knowledge, e.g. monotonic / asymptotic functional forms Bootstrap and shrinkage are useful ways to investigate model stability and bias in parameter estimates- particularly with flexible FPs and consequent danger of overfitting Preliminary transformation to reduce ‘end-effects’ problem MFP extensions Need to compare FP approach with splines and other methods but ‘standard‘ spline approach? reflect medical knowledge?

25 MFP extensions MFPI – treatment/covariate interactions MFPIgen – interaction between two continuous variables MFPT – time-varying effects in survival data

26 Towards recommendations for model-building by selection of variables and functional forms for continuous predictors under several assumptions IssueRecommendation Variable selection procedureBackward elimination; significance level as key tuning parameter, choice depends on the aim of the study Functional form for continuous covariatesLinear function as the 'default', check improvement in model fit by fractional polynomials. Check derived function for undetected local features Extreme values or influential pointsCheck at least univariately for outliers and influential points in continuous variables. A preliminary transformation may improve the model selected. For a proposal see R & S 2007 Sensitivity analysisImportant assumptions should be checked by a sensitivity analysis. Highly context dependent Check of model stabilityThe bootstrap is a suitable approach to check for model stability Complexity of a predictorA predictor should be 'as parsimonious as possible'

27 Summary Getting the big picture right is more important than optimising aspects and ignoring others strong predictors strong non-linearity strong interactions strong non-PH in survival model

28 Harrell FE jr. (2001): Regression Modeling Strategies. Springer. Royston P, Altman DG. (1994): Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling (with discussion). Applied Statistics, 43, Royston P, Sauerbrei W. (2003): Stability of multivariable fractional polynomial models with selection of variables and transformations: : a bootstrap investigation. Statistics in Medicine, 22, Royston P, Sauerbrei W. (2004): A new approach to modelling interactions between treatment and continuous covariates in clinical trials by using fractional polynomials. Statistics in Medicine, 23, Royston P, Sauerbrei W. (2005): Building multivariable regression models with continuous covariates, with a practical emphasis on fractional polynomials and applications in clinical epidemiology. Methods of Information in Medicine, 44, Royston P, Sauerbrei W. (2007): Improving the robustness of fractional polynomial models by preliminary covariate transformation: a pragmatic approach. Computational Statistics and Data Analysis, 51: Royston, P., Sauerbrei, W. (2008): ‘Multivariable Model-Building – A pragmatic approach to regression analysis based on fractional polynomials for continuous variables’. Wiley. Sauerbrei W. (1999): The use of resampling methods to simplify regression models in medical statistics. Applied Statistics, 48, Sauerbrei W, Meier-Hirmer C, Benner A, Royston P. (2006): Multivariable regression model building by using fractional polynomials: Description of SAS, STATA and R programs. Computational Statistics & Data Analysis, 50, Sauerbrei W, Royston P. (1999): Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. Journal of the Royal Statistical Society A, 162, Sauerbrei, W., Royston, P., Binder H (2007): Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Statistics in Medicine, 26: Sauerbrei W, Royston P, Look M. (2007): A new proposal for multivariable modelling of time-varying effects in survival data based on fractional polynomial time-transformation. Biometrical Journal, 49: Sauerbrei W, Royston P, Zapien K. (2007): Detecting an interaction between treatment and a continuous covariate: a comparison of two approaches. Computational Statistics and Data Analysis, 51: Schumacher M, Holländer N, Schwarzer G, Sauerbrei W. (2006): Prognostic Factor Studies. In Crowley J, Ankerst DP (ed.), Handbook of Statistics in Clinical Oncology, Chapman&Hall/CRC, References