Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C.

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Logistic Regression.
Survival Analysis. Statistical methods for analyzing longitudinal data on the occurrence of events. Events may include death, injury, onset of illness,
Introduction to Survival Analysis October 19, 2004 Brian F. Gage, MD, MSc with thanks to Bing Ho, MD, MPH Division of General Medical Sciences.
Lecture 16: Logistic Regression: Goodness of Fit Information Criteria ROC analysis BMTRY 701 Biostatistical Methods II.
Departments of Medicine and Biostatistics
1 BINARY CHOICE MODELS: PROBIT ANALYSIS In the case of probit analysis, the sigmoid function F(Z) giving the probability is the cumulative standardized.
Lecture 17: Regression for Case-control Studies BMTRY 701 Biostatistical Methods II.
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
بسم الله الرحمن الرحیم. Generally,survival analysis is a collection of statistical procedures for data analysis for which the outcome variable of.
BIOST 536 Lecture 3 1 Lecture 3 – Overview of study designs Prospective/retrospective  Prospective cohort study: Subjects followed; data collection in.
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
EPI 809/Spring Multiple Logistic Regression.
Event History Models Sociology 229: Advanced Regression Class 5
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Today Concepts underlying inferential statistics
Measures of disease frequency (I). MEASURES OF DISEASE FREQUENCY Absolute measures of disease frequency: –Incidence –Prevalence –Odds Measures of association:
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Assessing Survival: Cox Proportional Hazards Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Survival Analysis A Brief Introduction Survival Function, Hazard Function In many medical studies, the primary endpoint is time until an event.
Analysis of Complex Survey Data
1 Chapter 20 Two Categorical Variables: The Chi-Square Test.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Survival analysis Brian Healy, PhD. Previous classes Regression Regression –Linear regression –Multiple regression –Logistic regression.
1 BINARY CHOICE MODELS: PROBIT ANALYSIS In the case of probit analysis, the sigmoid function is the cumulative standardized normal distribution.
Essentials of survival analysis How to practice evidence based oncology European School of Oncology July 2004 Antwerp, Belgium Dr. Iztok Hozo Professor.
Survival Data John Kornak March 29, 2011
Dr Laura Bonnett Department of Biostatistics. UNDERSTANDING SURVIVAL ANALYSIS.
Assessing Survival: Cox Proportional Hazards Model
POTH 612A Quantitative Analysis Dr. Nancy Mayo. © Nancy E. Mayo A Framework for Asking Questions Population Exposure (Level 1) Comparison Level 2 OutcomeTimePECOT.
Design and Analysis of Clinical Study 11. Analysis of Cohort Study Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Excepted from HSRP 734: Advanced Statistical Methods June 5, 2008.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
Linear correlation and linear regression + summary of tests
When and why to use Logistic Regression?  The response variable has to be binary or ordinal.  Predictors can be continuous, discrete, or combinations.
HSRP 734: Advanced Statistical Methods July 17, 2008.
Introduction to Survival Analysis Utah State University January 28, 2008 Bill Welbourn.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
MBP1010 – Lecture 8: March 1, Odds Ratio/Relative Risk Logistic Regression Survival Analysis Reading: papers on OR and survival analysis (Resources)
Lecture 9: Analysis of intervention studies Randomized trial - categorical outcome Measures of risk: –incidence rate of an adverse event (death, etc) It.
Lecture 12: Cox Proportional Hazards Model
01/20151 EPI 5344: Survival Analysis in Epidemiology Cox regression: Introduction March 17, 2015 Dr. N. Birkett, School of Epidemiology, Public Health.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Biostatistics Case Studies 2014 Youngju Pak Biostatistician Session 5: Survival Analysis Fundamentals.
Statistical inference Statistical inference Its application for health science research Bandit Thinkhamrop, Ph.D.(Statistics) Department of Biostatistics.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
INTRODUCTION TO CLINICAL RESEARCH Survival Analysis – Getting Started Karen Bandeen-Roche, Ph.D. July 20, 2010.
Exact Logistic Regression
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Birthweight (gms) BPDNProp Total BPD (Bronchopulmonary Dysplasia) by birth weight Proportion.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
The Probit Model Alexander Spermann University of Freiburg SS 2008.
Beginners statistics Assoc Prof Terry Haines. 5 simple steps 1.Understand the type of measurement you are dealing with 2.Understand the type of question.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
The Probit Model Alexander Spermann University of Freiburg SoSe 2009
Nonparametric Statistics
Logistic Regression APKC – STATS AFAC (2016).
April 18 Intro to survival analysis Le 11.1 – 11.2
CHAPTER 7 Linear Correlation & Regression Methods
Notes on Logistic Regression
Statistics 103 Monday, July 10, 2017.
Multiple logistic regression
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Nonparametric Statistics
Association, correlation and regression in biomedical research
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Presentation transcript:

Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C. Schindler Swiss Tropical and Public Health Institute University of Basel Annual meeting of the Swiss Societies of Clinical Neurophysiology and of Neurology, Lugano, May 3 rd 2012

Binary outcome data

Binary endpoints Y Examples: Y = „death within 1 year“ Y = 1, event occurred = 0, otherwise Y = „disease progression within 2 years“ Y = „remission within three months“

1. Meaning of E(Y) for a binary variable Y E(Y) = Mean of Y at the population level = P(Y = 0) · 0 + P(Y = 1) · 1 = P(Y = 1) Preliminaries: Thus, mean of Y = probability of the event represented by Y.

2. Notion of odds Odds (Y) = P(Y = 1) / P(Y = 0) Example: = P(Y = 1) / [1 – P(Y = 1)] Y = „disease progression“ P(Y =1) = 0.3 Odds(Y) = 0.3 / [1 – 0.3] = 0.3 / 0.7 = P(Y = 1) = Odds(Y) / [1 + Odds(Y)] (*) P(Y=1) = / [ ] = 0.3

3. Notion of odds ratio (OR) X = „high risk“ (0 -> normal risk, 1 -> increased risk) P(Y = 1 | X = 0) = P(Y = 1) in subjects with X = 0 = 0.2 Odds(Y | X = 0) = 0.2 / 0.8 = 0.25 P(Y = 1 | X = 1) = P(Y = 1) in subjects with X = 1 = 0.4 Odds(Y | X = 1) = 0.4 / 0.6 = OR(Y | X) = OR(Y | X = 1)/OR(Y | X = 0) = / 0.25 = 2.67

with outcome (Y = 1) w/o outcome (Y = 0) with risk factor (X = 1) w/o risk factor (X = 0) Symmetry of OR: Prospective (cohort study): OR of Y between X=1 and X=0 = 40/60 : 20/80 = 2.67 Retrospective (case control study) OR of X between Y=1 and Y=0 = 40/20 : 60/80 = 2.67

4. Calculus of odds and odds ratios Example: risk of disease progression without risk factors A and B = 20%. OR of disease progression associated with risk factor A = 2.0. OR of disease progression associated with risk factor B = 3.0. Odds without risk factors = 0.2 / (1 – 0.2) = Corresponding risks: a) 0.5/1.5 = 0.33, b) 0.75/1.75 = 0.43, c) 1.5/2.5 = 0.6 c) Odds with both risk factors = 0.25  2.0  3.0 = 1.5 (if factors do not interact) b) Odds with risk factor B only = 0.25  3.0 = 0.75 a) Odds with risk factor A only = 0.25  2.0 = 0.5

Regresssion models for probabilities / odds

Idea: As in classical regression consider E(Y) =  0 +  1 · risk score Equivalent formulation of the model: P(Y = 1) =  0 +  1 · risk score Problem: P(Y = 1) =  0 +  1 · risk score in (0, 1) can take values outside (0,1)

Solution: P(Y = 1) = F (  0 +  1 · risk score ) where F(z) is a function whose values are always in (0, 1) z = linear predictor (linear prediction score)

z F(z) Logistic function (standard choice) F(z) = e z / (1 + e z ) a) e z F(z) 0 => F(z) > 0.

Y = Outcome (1 = present, 0 = absent) X = risk factor (1 = present, 0 = absent) Linear predictor (logit): z =  0 +  1 · x Recalling that P(Y = 1|X) = Odds(Y|X) / [1 + Odds(Y|X)] shows that

x = 1: Odds Ratio of Y between x=1 and x=0: x = 0: = >  0 = ln (Odds(Y|X = 0))  1 = ln (OR(Y|X))

Logistic regression Probit regression  = cumulative density function of standard normal distribution (another sigmoid shaped function ranging from 0 to 1)

Logistic regression Number of obs = 200 LR chi2(1) = 9.66 Prob > chi2 = Log likelihood = Pseudo R2 = outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] risk_factor | _cons | ln (Odds Ratio) = Odds Ratio = e = 2.67 ln (Odds(Y|X = 0) = Odds(Y | X = 0) = e = 0.25 P(Y = 1 | X = 0) = 0.25 / ( ) = 0.2 P(Y = 1 | X = 1) = 0.25  2.67 / (  2.67) = 0.4

Summary exp(coefficient of risk factor) = OR of outcome between those with and those without risk factor (cohort study) = OR of risk factor between those with and those without outcome (case-control study) exp(intercept term) = odds of outcome among unexposed subjects

Direct output of Odds ratio and odds of unexposed: 95%-confidence interval of odds ratio:( 1.42, 5.02) Logistic regression Number of obs = 200 LR chi2(1) = 9.66 Prob > chi2 = Log likelihood = Pseudo R2 = outcome | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] risk_factor | _cons | Note: 1. Confidence intervals of odds ratios are asymmetrical! 2. If they do not include 1, then the respective association is statistically significant at the 5% level.

Ritchie K. et al., The neuroprotective effects of caffeine – a prospective population study (The Three City Study), Neurology 2007; 69: Example from the literature outcome = cognitive decline, measured as either a) decline by at least 6 units in Isaacs set test or b) decline by at least 2 units in Benton visual retention test over four years

1. On average, odds of CD increased by 6% per additional year of age at baseline. 2. On average, odds of CD increased by 8% per additional unit in baseline cognitive test. 3. Compared to subjects with 5 years of education, the odds of CD among subjects with ≥12 years of education was reduced by more than 40%. 1) 2) 3)

Significantly reduced risk of  Isaacs  -6 among women who drank more than 3 units of caffeine per day at baseline compared to women who only drank 0 – 1 units Main result of paper:

CAVE: Odds ratio (OR)  relative risk (RR) Odds  risk (relative frequency) Interpretation of OR as relative risk and of odds as risk (relative frequency) is only appropriate if risks are small (i.e., < 10%) NOTE: Odds > risk (relative frequency) possible situations for OR and RR: a) OR < RR < 1 b) OR = RR = 1 c) 1 < RR < OR

Model comparison Likelihood ratio test Akaike information criterion (AIC) Bayesian information criterion (BIC) Pseudo-R2

Likelihood L of a model Probability of observing exactly the same outcome data again if exactly the same predictor data are given, provided that the model describes reality in the best possible way with the given variables. ln(L) is always  0, since probabilities are  1. The perfect model would have L = 1 and ln(L) = 0. The better the model, the closer ln(L) is to 0. log-Likelihood ln(L) natural logarithm of the likelihood

Comparison of nested models A model M 1 is said to be nested in another model M 2, if M 1 is a special case of M 2, e.g., if all the terms of M 1 are also in M 2 but not vice versa. Under the hypothesis, that the additional terms of M 2 are of no predictive value, the difference D = 2  [ln(L 2 ) - ln(L 1 )] has an approximate Chi 2 -distribution with df 2 - df 1 degrees of freedom (where df i = number of parameters of model i). likelihood ratio 2  ln(L 2 /L 1 ) = CAVE: Both models must be based on exactly the same data. In particular, their n‘s must be identical.

Akaike information criterion („smaller is better“) AIC = - 2 ln(L) + 2  p (p = number of parameters of the model in addition to the intercept parameter) penalty for complexity of the model The two models compared must be based on exactly the same data (same n!), but they need not be nested and can contain different variables.

Bayesian information criterion (Schwarz criterion) („smaller is better“) BIC = - 2 ln(L) + p  ln(n) (p = number of parameters of the model in addition to the intercept parameter, n = sample size) penalty for complexity of the model The two models compared need not be based on the same data nor do they have to be nested.

Pseudo-R2 There exists an analog of R2 with logistic and other generalized linear regression models. 0 ln(L null model ) ln(L model ) Pseudo-R2 = [ ln(L model ) - ln(L null model ) ] / [ 0 - ln(L null model ) ] „variance explained“ „total variance“ Null model = model with an intercept term only.

Goodness of fit of a logistic regression model Can be assessed using the Hosmer-Lemeshow-Test. Mechanics of the test: 1. For each subject, the logistic regression model predicts its individual probability of having Y = Subjects are then categorized into a certain number of classes based on the size of their predicted probabilities. 3. In each of the classes, the proportion of subjects with Y = 1 is determined and compared with the mean value of the predicted probabilities (ideally the two values should coincide in each class).

Analysis of survival data

Censored and uncensored survival times x o Observation period (= individual time scale!) true event -> uncensored survival time loss to follow-up -> censored survival time Event-free survival until t 1 -> censored survival time t0t0 t1t1

In survival analyses, two outcome variables are needed: 2. Variable for event-free time observed = time until event (uncensored observation) observation time (censored observation) 1. Event variable1 = event was observed -> uncensored observation 0 = event was not observed -> censored observation

Simple group comparisons of survival data 2. Comparison of survival curves using the log rank or the Wilcoxon test. 1.Construction of survival curves using the method of Kaplan-Meier

analysis time (weeks) group = 1group = 2 group = 3 Kaplan-Meier survival estimates

S(t) = Proportion of patients without event until time t S(t+  t)  S(t) – S(t)  h(t)   t h(t) = instantaneous event risk (hazard at time t) S(t+  t) – S(t)  t  S(t)  -h(t) S(t+  t) – S(t) tt S(t)  -h(t) S(t) = -h(t) d/dt [ln(S(t))] = -h(t) S(t) t S(0) S(t+  t) t+  t S(t+  t) – S(t)-S(t)  h(t)   t  h(t)= - d/dt [ln(S(t))]

h 2 (t) / h 1 (t) = HR (hazard ratio) h 2 (t) = HR  h 1 (t) Assumption of proportional hazards (PH) h 1 (t) = hazard function in group 1 h 2 (t) = hazard function in group 2 PH: ratio of hazards is independent of time t

Hazard functions of group 1 and 2 are proportional. Hazard function of group 3 violates PH assumption analysis time (weeks) group = 1 group = 2 group = 3 Smoothed hazard estimates analysis time (weeks) Kaplan-Meier survival curves group = 1 group = 2 group = 3

Logarithmized hazards of group 1 and 2 run parrallel (PH-assumption) analysis time (weeks) group = 1 group = 2 group = 3 Smoothed hazard estimates analysis time (weeks) group = 1 group = 2 group = 3 Logarithmized hazard estimates

Modelling of the hazard ratio

Sir David Roxbee Cox * Cambridge University Birkbeck College, London Imperial College, London Oxford University

HR = e  x x = 1with risk factor x = 0without risk factor x dichotomous HR = e  1 = e  = hazard ratio associated with risk factor

HR = e  x x continuous, e.g. x = age at baseline HR = e  1 = e  = hazard ratio associated with a unit increase in x (cross-sectional comparison)

Multiple proportional hazard regression model („Cox-regression model“) HR= e 1x11x1  2 x 2 kxkkxk Reference category: subjects with x 1 = x 1 =... = x k = 0

No. of subjects = 1500 Number of obs = 1500 No. of failures = 671 Time at risk = LR chi2(2) = Log likelihood = Prob > chi2 = _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] group_2 | group_3 | In our example: On average, the hazards in group 3 and 2 were higher by 83% (95%-CI: 51 to 121%) and 39% (95%-CI: 15 to 70%), respectively, than in group 1 (= reference group).

No. of subjects = 1500 Number of obs = 1500 No. of failures = 579 Time at risk = LR chi2(2) = Log likelihood = Prob > chi2 = _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] group_2 | group_3 | Analysis restricted to first year: No. of subjects = 872 Number of obs = 872 No. of failures = 92 Time at risk = LR chi2(2) = Log likelihood = Prob > chi2 = _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] group_2 | group_3 | Analysis restricted to survivors of first year:

Conclusion: 1. Hazard ratio between group 2 and group 1 was very similar in both years, i.e., around 1.4. (confirming the proportionality of the two hazard functions). 2. Hazard ratio between group 3 and group 1 was higher than 2 in the first year but smaller than 1 in the second year (confirming the sharp decrease of the hazard function of group 3, falling below the one of group 1 after 1 year).

Koch M. et al., The natural history of secondary progressive multiple sclerosis, J. Neurol Neurosurg Psychiatry 2001; 81: Example from the literature Some study characteristics: patients from British Columbia with a remitting disease at baseline. - Onset of immunomodulatory treatment was considered as censoring event.

Kaplan-Meier curves: PH-assumption was probably not satisfied by the factor gender!

1.On average, the covariate-adjusted hazard of secondary progression was higher in men than in women by 43% and the median of time to secondary progression was lower by 25% (i.e., 17.1 vs years). 2.On average, the covariate-adjusted hazard of secondary progression increased by 5% with each additional year of age at baseline.

CAVE: 1.In general, the ratio of median or mean survival times between two groups is not inversely proportional to the hazard ratio. But in the special case of constant hazards, this relation holds for the mean survival times. 2. The hazard ratio may be interpreted as relative risk only if the event rates are small in the respective groups during the time period of interest.

Thank you for your attention again!