Multivariable regression models with continuous covariates with a practical emphasis on fractional polynomials and applications in clinical epidemiology.

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

Part II: Coping with continuous predictors
Brief introduction on Logistic Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Departments of Medicine and Biostatistics
HSRP 734: Advanced Statistical Methods July 24, 2008.
Detecting an interaction between treatment and a continuous covariate: a comparison between two approaches Willi Sauerbrei Institut of Medical Biometry.
Modelling continuous variables with a spike at zero – on issues of a fractional polynomial based procedure Willi Sauerbrei Institut of Medical Biometry.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Making fractional polynomial models more robust Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical Center Freiburg, Germany.
Flexible modeling of dose-risk relationships with fractional polynomials Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical.
Lecture 6: Multiple Regression
BIOST 536 Lecture 9 1 Lecture 9 – Prediction and Association example Low birth weight dataset Consider a prediction model for low birth weight (< 2500.
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
Multivariable model building with continuous data Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical Center Freiburg, Germany.
BIOST 536 Lecture 4 1 Lecture 4 – Logistic regression: estimation and confounding Linear model.
Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.
EVIDENCE BASED MEDICINE
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Sample Size Determination Ziad Taib March 7, 2014.
Assessing Survival: Cox Proportional Hazards Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
DIY fractional polynomials Patrick Royston MRC Clinical Trials Unit, London 10 September 2010.
Building multivariable survival models with time-varying effects: an approach using fractional polynomials Willi Sauerbrei Institut of Medical Biometry.
Modelling continuous exposures - fractional polynomials Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical Center Freiburg,
Logistic Regression. Outline Review of simple and multiple regressionReview of simple and multiple regression Simple Logistic RegressionSimple Logistic.
Determining Sample Size
Simple Linear Regression
Assessing Survival: Cox Proportional Hazards Model
Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.
Use of FP and Other Flexible Methods to Assess Changes in the Impact of an exposure over time Willi Sauerbrei Institut of Medical Biometry and Informatics.
Hypothesis Testing Hypothesis Testing Topic 11. Hypothesis Testing Another way of looking at statistical inference in which we want to ask a question.
LOGISTIC REGRESSION A statistical procedure to relate the probability of an event to explanatory variables Used in epidemiology to describe and evaluate.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Introduction to logistic regression and Generalized Linear Models July 14, 2011 Introduction to Statistical Measurement and Modeling Karen Bandeen-Roche,
1 Multivariable Modeling. 2 nAdjustment by statistical model for the relationships of predictors to the outcome. nRepresents the frequency or magnitude.
Classification Ensemble Methods 1
Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡
1 Introduction to Modeling Beyond the Basics (Chapter 7)
REGRESSION MODEL FITTING & IDENTIFICATION OF PROGNOSTIC FACTORS BISMA FAROOQI.
Additional Regression techniques Scott Harris October 2009.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Canadian Bioinformatics Workshops
Stats Methods at IC Lecture 3: Regression.
Estimating standard error using bootstrap
Bootstrap and Model Validation
BINARY LOGISTIC REGRESSION
Chapter 7. Classification and Prediction
Statistical Data Analysis - Lecture /04/03
Distribution of the Sample Means
CJT 765: Structural Equation Modeling
Analysis of Covariance (ANCOVA)
A practical trial design for optimising treatment duration
Multiple logistic regression
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
CHAPTER 29: Multiple Regression*
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
What is Regression Analysis?
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Lecture 12 Model Building
Selecting the Right Predictors
Combined predictor Selection for Multiple Clinical Outcomes Using PHREG Grisell Diaz-Ramirez.
Model generalization Brief summary of methods
Diagnostics and Remedial Measures
Diagnostics and Remedial Measures
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Multivariable regression models with continuous covariates with a practical emphasis on fractional polynomials and applications in clinical epidemiology Professor Patrick Royston, MRC Clinical Trials Unit, London. Berlin, April 2005. 8/4/2005

The problem … “Quantifying epidemiologic risk factors using non-parametric regression: model selection remains the greatest challenge” Rosenberg PS et al, Statistics in Medicine 2003; 22:3369-3381 Trivial nowadays to fit almost any model To choose a good model is much harder 8/4/2005

Overview Context and motivation Introduction to fractional polynomials for the univariate smoothing problem Extension to multivariable models More on spline models Stability analysis Stata aspects Conclusions 8/4/2005

Motivation Often have continuous risk factors in epidemiology and clinical studies – how to model them? Linear model may describe a dose-response relationship badly ‘Linear’ = straight line = 0 + 1 X + … throughout talk Using cut-points has several problems Splines recommended by some – but are not ideal Lack a well-defined approach to model selection ‘Black box’ Robustness issues 8/4/2005

Problems of cut-points Step-function is a poor approximation to true relationship Almost always fits data less well than a suitable continuous function ‘Optimal’ cut-points have several difficulties Biased effect estimates Inflated P-values Not reproducible in other studies 8/4/2005

Example datasets 1. Epidemiology Whitehall 1 17,370 male Civil Servants aged 40-64 years Measurements include: age, cigarette smoking, BP, cholesterol, height, weight, job grade Outcomes of interest: coronary heart disease, all-cause mortality  logistic regression Interested in risk as function of covariates Several continuous covariates Some may have no influence in multivariable context 8/4/2005

Example datasets 2. Clinical studies German breast cancer study group (BMFT-2) Prognostic factors in primary breast cancer Age, menopausal status, tumour size, grade, no. of positive lymph nodes, hormone receptor status Recurrence-free survival time  Cox regression 686 patients, 299 events Several continuous covariates Interested in prognostic model and effect of individual variables 8/4/2005

Example: Systolic blood pressure vs. age 8/4/2005

Example: Curve fitting (Systolic BP and age – not linear) Inn1.do Underlying functional relationship is probably simple, but linear fits badly Notice how FP function smooths out complex, implausible irregularities in curve 8/4/2005

Empirical curve fitting: Aims Smoothing Visualise relationship of Y with X Provide and/or suggest functional form 8/4/2005

Some approaches ‘Non-parametric’ (local-influence) models Locally weighted (kernel) fits (e.g. lowess) Regression splines Smoothing splines (used in generalized additive models) Parametric (non-local influence) models Polynomials Non-linear curves Fractional polynomials Intermediate between polynomials and non-linear curves 8/4/2005

Local regression models Advantages Flexible – because local! May reveal ‘true’ curve shape (?) Disadvantages Unstable – because local! No concise form for models Therefore, hard for others to use – publication,compare results with those from other models Curves not necessarily smooth ‘Black box’ approach Many approaches – which one(s) to use? 8/4/2005

Polynomial models Do not have the disadvantages of local regression models, but do have others: Lack of flexibility (low order) Artefacts in fitted curves (high order) Cannot have asymptotes 8/4/2005

Fractional polynomial models Describe for one covariate, X multiple regression later Fractional polynomial of degree m for X with powers p1, … , pm is given by FPm(X) = 1 X p1 + … + m X pm Powers p1,…, pm are taken from a special set {2,  1,  0.5, 0, 0.5, 1, 2, 3} Usually m = 1 or m = 2 is sufficient for a good fit 8/4/2005

FP1 and FP2 models FP1 models are simple power transformations 1/X2, 1/X, 1/X, log X, X, X, X2, X3 8 models FP2 models are combinations of these For example 1(1/X) + 2(X2) 28 models Note ‘repeated powers’ models For example 1(1/X) + 2(1/X)log X 8/4/2005

FP1 and FP2 models: some properties Many useful curves A variety of features are available: Monotonic Can have asymptote Non-monotonic (single maximum or minimum) Single turning-point Get better fit than with conventional polynomials, even of higher degree 8/4/2005

Examples of FP2 curves - varying powers Fpexamp.gph, taken from c38\fig1a.gph. 8/4/2005

Examples of FP2 curves - single power, different coefficients 8/4/2005

A philosophy of function selection Prefer simple (linear) model Use more complex (non-linear) FP1 or FP2 model if indicated by the data Contrast to local regression modelling Already starts with a complex model 8/4/2005

Estimation and significance testing for FP models Fit model with each combination of powers FP1: 8 single powers FP2: 36 combinations of powers Choose model with lowest deviance (MLE) Comparing FPm with FP(m  1): compare deviance difference with 2 on 2 d.f. one d.f. for power, 1 d.f. for regression coefficient supported by simulations; slightly conservative 8/4/2005

Selection of FP function Has flavour of a closed test procedure Use 2 approximations to get P-values Define nominal P-value for all tests (often 5%) Fit linear and best FP1 and FP2 models Test FP2 vs. null – test of any effect of X (2 on 4 df) Test FP2 vs linear – test of non-linearity (2 on 3 df) Test FP2 vs FP1 – test of more complex function against simpler one (2 on 2 df) 8/4/2005

Example: Systolic BP and age Reminder: FP1 had power 3: 1 X3 FP2 had powers (1,1): 1 X + 2 X log X 8/4/2005

Aside: FP versus spline Why care about FPs when splines are more flexible? More flexible  more unstable More chance of ‘over-fitting’ In epidemiology, dose-response relationships are often simple Illustrate by small simulation example 8/4/2005

FP versus spline (continued) Logarithmic relationships are common in practice Simulate regression model y = 0 + 1log(X) + error Error is normally distributed N(0, 2) Take 0 = 0, 1 = 1; X has lognormal distribution Vary  = {1, 0.5, 0.25, 0.125} Fit FP1, FP2 and spline with 2, 4, 6 d.f. Compute mean square error Compare with mean square error for true model 8/4/2005

FP vs. spline (continued) Example of what data looks like 8/4/2005

FP vs. spline (continued) 8/4/2005

FP vs. spline (continued) 8/4/2005

FP vs. spline (continued) 200 replicates only 8/4/2005

FP vs. spline (continued) In this example, spline usually less accurate than FP FP2 less accurate than FP1 (over-fitting) FP1 and FP2 more accurate than splines Splines often had non-monotonic fitted curves Could be medically implausible Of course, this is a special example 8/4/2005

Multivariable FP (MFP) models Assume have k > 1 continuous covariates and perhaps some categoric or binary covariates Allow dropping of non-significant variables Wish to find best multivariable FP model for all X’s Impractical to try all combinations of powers Require iterative fitting procedure 8/4/2005

Fitting multivariable FP models (MFP algorithm) Combine backward elimination of weak variables with search for best FP functions Determine fitting order from linear model Apply FP model selection procedure to each X in turn fixing functions (but not ’s) for other X’s Cycle until FP functions (i.e. powers) and variables selected do not change 8/4/2005

Example: Prognostic factors in breast cancer Aim to develop a prognostic index for risk of tumour recurrence or death Have 7 prognostic factors 4 continuous, 3 categorical Select variables and functions using 5% significance level 8/4/2005

Univariate linear analysis Some people might choose to put all variables sig at 5% into multivariable model 8/4/2005

Univariate FP2 analysis Gain compares FP2 with linear on 3 d.f. All factors except for X3 have a non-linear effect 8/4/2005

Multivariable FP analysis 8/4/2005

Comments on analysis Conventional backwards elimination at 5% level selects X4a, X5, X6, and X1 is excluded FP analysis picks up same variables as backward elimination, and additionally X1 Note considerable non-linearity of X1 and X5 X1 has no linear influence on risk of recurrence FP model detects more structure in the data than the linear model 8/4/2005

Plots of fitted FP functions inn2 Note non-monotonicity of x5 function 8/4/2005

Survival by risk groups 8/4/2005

Robustness of FP functions Breast cancer example showed non-robust functions for nodes – not medically sensible Situation can be improved by performing covariate transformation before FP analysis Can be done systematically (work in progress) Sauerbrei & Royston (1999) used negative exponential transformation of nodes exp(–0.12 * number of nodes) 8/4/2005

Making the function for lymph nodes more robust Inn2a 8/4/2005

2nd example: Whitehall 1 MFP analysis No variables were eliminated by the MFP algorithm Weight is eliminated by linear backward elimination 8/4/2005

Plots of FP functions 8/4/2005

A new multivariable regression algorithm with spline functions Inspired by closed test procedure for selecting an FP function Start with predefined number of knots Determines maximum complexity of function Use predetermined knot positions E.g. at fixed percentile positions of distn. of x Simplest function (default) is linear Closed test procedure to reduce the knot set if some knots are not significant Apply backfitting procedure as in mfp Implemented in Stata as new command mrsnb 8/4/2005

Splines: Breast cancer example Selects variables similar to mfp Grade 2/3 omitted, otherwise selected variables are identical Knots: age(46, 53); transformed nodes(linear); PgR(7, 132) Deviance of selected model almost identical to mfp model 8/4/2005

Plots of fitted FP functions 8/4/2005

Improving the robustness of spline models Often have covariates with positively skew distributions – can produce curve artefacts Simple approach is to log-transform covariates with a skew distribution – e.g. 1 > 0.5 Then fit the spline model In the breast cancer example, this approach gives a more satisfactory log function for PgR 8/4/2005

Stability of FP models Models (variables, FP functions) selected by statistical criteria – cut-off on P-value Approach has several advantages … … and also is known to have problems Omission bias Selection bias Unstable – many models may fit equally well 8/4/2005

Stability investigation Instability may be studied by bootstrap resampling (sampling with replacement) Take bootstrap sample B times Select model by chosen procedure Count how many times each variable is selected Summarise inclusion frequencies & their dependencies Study fitted functions for each covariate May lead to choosing several possible models, or a model different from the original one 8/4/2005

Bootstrap stability analysis of the breast cancer dataset 5000 bootstrap samples taken (!) MFP algorithm with Cox model applied to each sample Resulted in 1222 different models (!!) Nevertheless, could identify stable subset consisting of 60% of replications Judged by similarity of functions selected 8/4/2005

Bootstrap stability analysis of the breast cancer dataset 8/4/2005

Bootstrap analysis: summaries of fitted curves from stable subset Stable subset comprised 60.1% of reps Functions are when variable was selected 8/4/2005

Presentation of models for continuous covariates The function + 95% CI gives the whole story Functions for important covariates should always be plotted In epidemiology, sometimes useful to give a more conventional table of results in categories This can be done from the fitted function 8/4/2005

Example: Cigarette smoking and all-cause mortality (Whitehall 1) 8/4/2005

Other issues (1) Handling continuous confounders May use a larger P-value for selection e.g. 0.2 Not so concerned about functional form here Binary/continuous covariate interactions Can be modelled using FPs (Royston & Sauerbrei 2004) Adjust for other factors using MFP 8/4/2005

Other issues (2) Time-varying effects in survival analysis Can be modelled using FP functions of time (Berger; also Sauerbrei & Royston, in progress) Checking adequacy of FP functions May be done by using splines Fit FP function and see if spline function adds anything, adjusting for the fitted FP function 8/4/2005

Stata aspects Command mfp is part of Stata 8 Example of use: mfp stcox x1 x2 x3 x4a x4b x5 x6 x7 hormon, select(0.05, hormon:1) Command mrsnb is available from PR mrsnb stcox x1 x2 x3 x4a x4b x5 x6 x7 hormon, select(0.05, hormon:1) Command mfpboot is available from PR Does bootstrap stability analysis of MFP models 8/4/2005

Concluding remarks (1) FP method in general No reason (other than convention) why regression models should include only positive integer powers of covariates FP is a simple extension of an existing method Simple to program and simple to explain Parametric, so can easily get predicted values FP usually gives better fit than standard polynomials Cannot do worse, since standard polynomials are included 8/4/2005

Concluding remarks (2) Multivariable FP modelling Many applications in general context of multiple regression modelling Well-defined procedure based on standard principles for selecting variables and functions Aspects of robustness and stability have been investigated (and methods are available) Much experience gained so far suggests that method is very useful in clinical epidemiology 8/4/2005

Some references Royston P, Altman DG (1994) Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Applied Statistics 43: 429-467 Royston P, Altman DG (1997) Approximating statistical functions by using fractional polynomial regression. The Statistician 46: 1-12 Sauerbrei W, Royston P (1999) Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. JRSS(A) 162: 71-94. Corrigendum JRSS(A) 165: 399-400, 2002 Royston P, Ambler G, Sauerbrei W. (1999) The use of fractional polynomials to model continuous risk variables in epidemiology. International Journal of Epidemiology, 28: 964-974. Royston P, Sauerbrei W (2004). A new approach to modelling interactions between treatment and continuous covariates in clinical trials by using fractional polynomials. Statistics in Medicine 23: 2509-2525. Royston P, Sauerbrei W (2003) Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation. Statistics in Medicine 22: 639-659. Armitage P, Berry G, Matthews JNS (2002) Statistical Methods in Medical Research. Oxford, Blackwell. 8/4/2005