The SSC presented a data set on cervical cancer for analysis. Purpose of the analysis: determine the different attributes (covariates) for predicting relapse.

Slides:

Advertisements

Similar presentations

Residuals Residuals are used to investigate the lack of fit of a model to a given subject. For Cox regression, there’s no easy analog to the usual “observed.

Advertisements

Survival Analysis. Key variable = time until some event time from treatment to death time for a fracture to heal time from surgery to relapse.

X Treatment population Control population 0 Examples: Drug vs. Placebo, Drugs vs. Surgery, New Tx vs. Standard Tx Let X = decrease (–) in cholesterol.

Brief introduction on Logistic Regression

Logistic Regression Psy 524 Ainsworth.

Logistic Regression.

Analysis of variance (ANOVA)-the General Linear Model (GLM)

Lecture 16: Logistic Regression: Goodness of Fit Information Criteria ROC analysis BMTRY 701 Biostatistical Methods II.

Departments of Medicine and Biostatistics

HSRP 734: Advanced Statistical Methods July 24, 2008.

SC968: Panel Data Methods for Sociologists

ANOVA: PART II. Last week  Introduced to a new test:  One-Way ANOVA  ANOVA’s are used to minimize family-wise error:  If the ANOVA is statistically.

April 25 Exam April 27 (bring calculator with exp) Cox-Regression

Cervical Cancer Case Study Eshetu Atenafu, Sandra Gardner, So-hee Kang, Anjela Tzontcheva University of Toronto Department of Public Health Sciences (Biostatistics)

x – independent variable (input)

Correlation and Autocorrelation

Part I – MULTIVARIATE ANALYSIS

Biostatistics in Research Practice Time to event data Martin Bland Professor of Health Statistics University of York

Midterm Review Goodness of Fit and Predictive Accuracy

Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.

Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.

An Introduction to Logistic Regression

Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.

Decision Tree Models in Data Mining

Assessing Survival: Cox Proportional Hazards Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

STAT E-150 Statistical Methods

1 Chapter 20 Two Categorical Variables: The Chi-Square Test.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.

Survival analysis Brian Healy, PhD. Previous classes Regression Regression –Linear regression –Multiple regression –Logistic regression.

Estimating cancer survival and clinical outcome based on genetic tumor progression scores Jörg Rahnenführer 1,*, Niko Beerenwinkel 1,, Wolfgang A. Schulz.

Essentials of survival analysis How to practice evidence based oncology European School of Oncology July 2004 Antwerp, Belgium Dr. Iztok Hozo Professor.

G Lecture 121 Analysis of Time to Event Survival Analysis Language Example of time to high anxiety Discrete survival analysis through logistic regression.

Dr Laura Bonnett Department of Biostatistics. UNDERSTANDING SURVIVAL ANALYSIS.

1 Introduction to medical survival analysis John Pearson Biostatistics consultant University of Otago Canterbury 7 October 2008.

Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.

Slide 1 The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics.

Cervical Cancer Case Study Supervising Professor: Dr. P.D.M. Macdonald Team Members: Christine Calzonetti, Simo Goshev, Rongfang Gu, Shahidul Mohammad.

Chapter 12 The Analysis of Categorical Data and Goodness-of-Fit Tests.

1 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 12 The Analysis of Categorical Data and Goodness-of-Fit Tests.

University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.

Linear correlation and linear regression + summary of tests

Introduction to Survival Analysis Utah State University January 28, 2008 Bill Welbourn.

Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.

Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.

Three Statistical Issues (1) Observational Study (2) Multiple Comparisons (3) Censoring Definitions.

1 Multivariable Modeling. 2 nAdjustment by statistical model for the relationships of predictors to the outcome. nRepresents the frequency or magnitude.

BC Jung A Brief Introduction to Epidemiology - XIII (Critiquing the Research: Statistical Considerations) Betty C. Jung, RN, MPH, CHES.

Logistic Regression. Linear regression – numerical response Logistic regression – binary categorical response eg. has the disease, or unaffected by the.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.

Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡

Logistic Regression Analysis Gerrit Rooks

1 Using dynamic path analysis to estimate direct and indirect effects of treatment and other fixed covariates in the presence of an internal time-dependent.

Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.

Nonparametric Statistics

Additional Regression techniques Scott Harris October 2009.

NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.

Practical Solutions Additional Regression techniques.

LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.

Stats Methods at IC Lecture 3: Regression.

Comparing Cox Model with a Surviving Fraction with regular Cox model

April 18 Intro to survival analysis Le 11.1 – 11.2

Survival curves We know how to compute survival curves if everyone reaches the endpoint so there is no “censored” data. Survival at t = S(t) = number still.

INTRODUCTION The SSC presented a data set on cervical cancer for analysis. Purpose of the analysis: determine the different attributes (covariates) for.

Notes on Logistic Regression

Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II

Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II

Logistic Regression.

Presentation transcript:

The SSC presented a data set on cervical cancer for analysis. Purpose of the analysis: determine the different attributes (covariates) for predicting relapse for women that had cervical cancer and surgery, as well as classifying the patients into Low, Medium and High risk. It has been assumed that prediction will be done with the information obtained right after the surgery. Hence, variable outcomes observed in between surgery date and last follow-up date will not be used. Such variables are "if patients received radiation therapy or not” and "dead with disease, dead without disease, alive with disease, etc." which was taken at time of last follow-up. 905 patients entered the study, 34 patients were dropped since they had no follow-up date yet. Covariates: surgery date last follow-up date age of the patient at time of surgery capillary lymphatic spaces (0=negative, 1,2=positive) (Cls) cell differentiation (1=better, 2=moderate, 3=worst) (Grad) histology of the cancer cells (determined by the pathologist, ranges from 0 to 6) (Histolog) disease left after surgery (0=clear, 1=para-vaginal area, 2=vaginal area, 3=both) (Margins) depth of the tumour (in mm.) (Maxdepth) pelvis involvement (O=negative, 1=positive) (Pellymph) size of the tumour (in mm.) INTRODUCTION

EXPLORATORY ANALYSIS Univariate plots by variables, such as these, were performed to better understand their behaviour. Also pairwise contingency tables were used as an exploratory tool.

EXPLORATORY ANALYSIS Complex model, a smaller tree might do...When dropping observations with NA,too much information is lost, will use NA’s as a factor in all variables. Classification trees are used to uncover inherent structure in data. These are binary arrangements created by splitting observations into “more homogeneous” groups, dictated by rules of the form:(e.g.) “if Age<24 and Cls is positive then response is likely 1” Misclassification Rate= Residual Mean Deviance=

EXPLORATORY ANALYSIS This smaller tree is easier to follow and the misclassification ratio is still of acceptable size. Maxdepth, Size and Cls are observed to be important variables in the structure. Just as regression uses Residual Sum of Squares as a diagnostic of fit, trees use Residual Deviance. Hence a decrease in deviance means a better fitted tree. In regression, more parameters might give a better fit but complex interpretation. Here, number of terminal nodes is analogous to the latter. Pruning of a tree can be done based on the following: Misclassification Rate= Residual Mean Deviance=

Variable Size is of importance as seen in trees. Nevertheless, it has many missing values, and analyses usually drop such observations. In order to keep information we categorised it with the missing values as the lowest of the levels and used the quartiles as cutoffs for the other levels. SURVIVAL ANALYSIS A Cox Proportional Hazards model was assumed. During the process of modeling, it was seen that the important levels of Size were three categories: Not Measured (NA’s), 30 and >30 The model for prediction agreed on included Age, Cls, Maxdepth and Size as predictors, along with two two-way interactions: Age with Cls and Maxdepth with Size. Specifically the hazard as a function of time can be seen as Proportional Hazards Assumption was not violated neither individually nor as a global model (pvalue=0.14)

SURVIVAL ANALYSIS The Cox curves were calculated as the average of the curves corresponding to the different covariate patterns, rather than plotting curves with the average VALUE of the covariates. (used S-plus function avg.surv created by Dr. R. Brant, CHS Dept, U of C )

As with any model, assumptions are needed. The assumption of non-informative censored data (censoring not related to the chances of recurrence) was used. Some interesting results and interpretation for the model: The hazard ratio for comparison between having Cls positive1 to positive2, keeping all other variables fixed: SURVIVAL ANALYSIS So, for Age=30, the hazard ratio= , that is, the hazard of having a relapse when Cls positive1 is times greater than the hazard of relapse when Cls positive2 at age 30 Now, for Age=50 hazard ratio= With analogous interpretation. We can see the effect of the interaction between Age and Cls So, for Maxdepth=10, hazard ratio= , that is, the hazard of having a relapse when Size is less than 30 is times the hazard of relapse when Size>30. Similarly, we can look at hazard ratio for an increase in tumour size, mainly:

LOGISTIC REGRESSION ANALYSIS The main model for a Logistic regression is to regress the log of the odds of a binary output event as a linear function of covariates. Odds is the ratio of the probability of an event happening and the probability of the same event not happening Recall that during the process of modeling, it was seen that important levels of Size were really three categories: Not Measured (NA’s), 30 and >30 The model for prediction agreed on included Age, Cls, Maxdepth, Size and Pellymph as predictors, along with a two-way interaction between Age and Cls. Specifically the logistic model can be seen as The statistical significant model included an interaction between Pellymph and Size >30. However, there were only three observations with such values and the inclusion of this interaction created problems for prediction. Hence, for the sake of interpretability and in order to be able to predict, we decided to drop it. The change in residual deviance from the fuller model to the one kept was from to

LOGISTIC REGRESSION ANALYSIS The usual plot for this type of analysis is a probability curve. Given the fact that we had 2 continuous variables in our model, we present some examples of probability surfaces. This enables to look for any Age/Maxdepth combination Observe interaction of Age and cls

LOGISTIC REGRESSION ANALYSIS We can see that Age plays a bigger role when Cls has level of positive2 Changing from Size 30 increments the probability of relapse, for a fixed set of the other variables (compare top to bottom)

LOGISTIC REGRESSION ANALYSIS Some interesting results and interpretation for the model: The odds ratio for comparison between having Cls positive1 to positive2, keeping all other variables fixed: So, for Age=30, the odds ratio= , that is, the odds of having a relapse when Cls positive1 are times greater than the odds of relapse when Cls positive2. Now, for Age=50, the odds ratio= , with analogous interpretation. We can see the effect of the interaction between Age and Cls Similarly, we can look at odds ratio for an increase of 10mm in tumour depth, mainly: So, for fixed values of other variables, and an increase in 10 for Maxdepth, the odds ratio= That is, the odds of having a relapse when tumour is 10mm deeper are times greater.

LOGISTIC REGRESSION ANALYSIS One of the purposes of the case study was to classify patients in Low, Medium and High risk of relapse. We suggest to do this using the probabilities obtained from this logistic regression in the following way: Calculate the probability from the model for each patient. If the probability is within a prefixed range, then it is set as Low, if it is within another range Medium and so on. For example : Low if in (0,.35], Med if in (.35,.60] and High if >.60 Another way for classifying, would involve at risk or not at risk as the possible classifications (as a +/- test). Although this gives only two possibilities, predictive values can be calculated and hence have a measure of accuracy. Do this by setting a cutoff point for the probabilities calculated and set the value of the test for the patient as + or -. Some examples for different cutoffs follow. Classified + if predicted Pr(D) >= True Classified | D ~D Total | 5 3 | 8 - | | Total | | 657 True D defined as relapse ~= 0 Positive predictive value Pr( D| +) 62.50% Negative predictive value Pr(~D| -) 94.30% Correctly classified 93.91% For the next cutoff values the table itself is omitted.

LOGISTIC REGRESSION ANALYSIS Classified + if predicted Pr(D) >=.25 True D defined as relapse ~= 0 Positive predictive value Pr( D| +) 33.33% Negative predictive value Pr(~D| -) 94.76% Correctly classified 92.24% Classified + if predicted Pr(D) >=.4 True D defined as relapse ~= 0 Positive predictive value Pr( D| +) 70.00% Negative predictive value Pr(~D| -) 94.59% Correctly classified 94.22% Classified + if predicted Pr(D) >=.6 True D defined as relapse ~= 0 Positive predictive value Pr( D| +) 50.00% Negative predictive value Pr(~D| -) 93.74% Correctly classified 93.61% As a “goodness of fit”, a table for groups follows Logistic model for relapse, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) Group Prob Obs_1 Exp_1 Obs_0 Exp_0 Total number of observations = 657 number of groups = 10 Pvalue=

CONCLUSIONS Given the nature of the study, and the assumption that prediction of relapse would be done right after surgery, variables observed after surgery were not taken into account. These were: Status of patient at last follow-up date and if patients received radiation. Contrary to what we expected, Disease left after surgery did not play an important role in prediction. There was agreement throughout the different analyses (exploratory, survival and logistic) regarding the importance of the inclusion of three covariates: Maxdepth, Capillary Lymphatic Spaces (Cls) and Size. The effect of variable Age on relapse is affected by its interaction with Capillary Lymphatic Spaces (cls) The important variables for predicting the survival to relapse are Age, Cls, Size and Maxdepth. The important variables for predicting the probability of relapse are Age, Cls, Size, Maxdepth and Pellymph. FUTURE WORK It would be of relevance to check the importance of covariates when separating the response variable as no relapse, relapse before a specific time and relapse after that time. Use of trees as a classification tool rather than an exploratory tool.

AKNOWLEDGEMENTS We would like to thank the following for their help and support in the creation of this poster: StatCar lab, Mathematics and Statistics Dept., U of C Dr. R. Brant, CHS, U of C Dr. P. Ehlers, Math and Stats, U of C B. Teare, Math and Stats, U of C Learning Commons, U of C BIBLIOGRAPHY Rose, S., Lecture notes for Biostatistics II Venables, W.N. and Ripley, B.D. Modern Applied Statistics with S-plus, Springer Statistics and Computing Series, New York, 1994 Insightful, S-plus 2000 Guide to Statistics, Seattle, 1999