Biostatistics Case Studies 2008 Peter D. Christenson Biostatistician Session 6: Classification Trees.

Slides:



Advertisements
Similar presentations
Chapter 7 Classification and Regression Trees
Advertisements

Brief introduction on Logistic Regression
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Chapter 8 – Logistic Regression
Receiver Operating Characteristic (ROC) Curves
Chapter 7 – Classification and Regression Trees
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Chapter 7 – Classification and Regression Trees
Biostatistics Case Studies 2006 Peter D. Christenson Biostatistician Session 5: Reporting Subgroup Results.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Midterm Review Goodness of Fit and Predictive Accuracy
BIOST 536 Lecture 9 1 Lecture 9 – Prediction and Association example Low birth weight dataset Consider a prediction model for low birth weight (< 2500.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Inferences About Process Quality
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
8/15/2015Slide 1 The only legitimate mathematical operation that we can use with a variable that we treat as categorical is to count the number of cases.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Categorical Data Prof. Andy Field.
Biostatistics Case Studies 2009 Peter D. Christenson Biostatistician Session 1: Classification Trees.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Lecture Notes 4 Pruning Zhangxi Lin ISQS
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 4: Taking Risks and Playing the Odds: OR vs.
by B. Zadrozny and C. Elkan
Biostatistics Case Studies Peter D. Christenson Biostatistician Session 5: Analysis Issues in Large Observational Studies.
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 5: Methods for Assessing Associations.
Biostatistics Case Studies Peter D. Christenson Biostatistician Session 2: Diagnostic Classification.
Chapter 9 – Classification and Regression Trees
Rasch trees: A new method for detecting differential item functioning in the Rasch model Carolin Strobl Julia Kopf Achim Zeileis.
Biostatistics Case Studies 2007 Peter D. Christenson Biostatistician Session 3: Incomplete Data in Longitudinal Studies.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 4: Study Size and Power.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 4: Study Size and Power.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 3: Testing Hypotheses.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
Biostatistics Case Studies 2007 Peter D. Christenson Biostatistician Session 1: The Logic Behind Statistical Adjustment.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
Evaluating Classification Performance
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
D/RS 1013 Discriminant Analysis. Discriminant Analysis Overview n multivariate extension of the one-way ANOVA n looks at differences between 2 or more.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
BINARY LOGISTIC REGRESSION
Chapter 7. Classification and Prediction
Rule Induction for Classification Using
Introduction to Data Mining and Classification
(classification & regression trees)
STT : Intro. to Statistical Learning
Presentation transcript:

Biostatistics Case Studies 2008 Peter D. Christenson Biostatistician Session 6: Classification Trees

Case Study Goal of paper: Classify subjects as IR or non-IR using subject characteristics other than definitive IR such as clamp: Gender, weight, lean body mass, BMI, waist and hip circumferences, LDL, HDL, Total Chol, triglycerides, FFA, DBP, SBP, fasting insulin and glucose, HOMA, family history of diabetes, and some derived ratios from these.

Major Conclusion Using All Predictors BMI HOMA IR Non-IR p 336, 1st column:

Broad Approaches to Classification 1.Regression, discriminant analysis - modeling. 2.Cluster analyses - geometric. 3.Trees - partitioning.

Overview of Classification Trees General concept, based on subgroupings: 1.Form combinations of subgroups according to High or Low on each characteristic. 2.Find actual IR rates in each subgroup. 3.Combine subgroups that give similar IR rates. 4.Classify as IR if IR rate is large enough. Notes: 1.No model or statistical assumptions. 2.And so no p-values. 3.Many options are involved in grouping details. 4.Actually implemented hierarchically – next slide.

Figure 2 Classify as IRClassify as non-IR

Alternative: Logistic Regression 1. Find equation: Prob(IR) = function(w 1 *BMI + w 2 *HOMA + w 3 *LDL +...) where the w’s are weights (coefficients). 2.Classify as IR if Prob(IR) is large enough. Note: Assumes specific statistical model. Gives p-values (depends on model being correct). Need to use Prob(IR) which is very model- dependent, unlike High/Low categorizations.

Trees or Logistic Regression? Logistic: Not originally designed for classifying, but for finding Prob(IR). Requires specification of predictor interrelations, either known or through data examination; thus, not as flexible. Dependent on correct model. Can prove whether predictors are associated with IR. Trees: Designed for classifying. Interrelations not pre-specified, but detected in the analysis. Does not prove associations “beyond reasonable doubt”, as provided in regression.

Outline of this Session Use simulated data to: Classify via logistic regression using only one predictor. Classify via trees using one predictor. Show that results are identical. Show how results differ when 2 predictors are used.

IR and HOMA Simulated Data: N=2138 with IR rate increasing with HOMA as in actual data in paper. Overall IR rate = 700/2138 = 33%.

IR and HOMA: Logistic Fit Fitted logistic model predicts probability of IR as: Prob(IR) = e u /(1 + e u ), where u= (HOMA) A logistic curve has this sigmoidal shape

Using Logistic Model for Classification The logistic model proves that the risk of IR ↑ as HOMA ↑ (significance of the coefficient is p<0.0001). How can we classify as IR or not based on HOMA? Use Prob(IR). –Need cutpoint c so that we classify as: If Prob(IR) > c then classify as IR If Prob(IR) ≤ c then classify as non-IR Regression does not supply c. It is chosen to balance sensitivity and specificity.

IR and HOMA: Logistic with Arbitrary Cutpoint If cutpoint c=0.50 is chosen, then we have: Sensitivity = 440/( ) = 62.9% Specificity = 1339/( ) = 93.1% Assign IR Assign non-IR Actual IR: N=440 Non-IR: N=99 Actual IR: N=260 Non-IR: N=1339

IR and HOMA: Logistic with Other Cutpoints From SAS (“Event” = IR): Classification Table --Correct--- -Incorrect Percentages Prob Non- Non- Sensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG Often, the overall percentage correct is used to choose the “optimal” cutpoint. Here, that gives cutpoint=0.37, with % correct=85.2%, sensitivity=71.7% and specificity=91.7%

Using Classification Trees with One Predictor Choose every possible HOMA value and find sensitivity and specificity (as in creating a ROC curve). Assign relative weights to sensitivity and specificity, often equal, as previously. Need cutpoint h so that we classify as: If HOMA > h then classify as IR If HOMA ≤ h then classify as non-IR

IR and HOMA: Trees with Other Cutpoints Correct Incorrect Percentages HOMA Non- Non- Sensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG If the overall percentage correct is used to choose the “optimal” cutpoint, then cutpoint=4.61, with % correct=85.2%, sensitivity=71.7% and specificity=91.7%.

IR and HOMA: Final, Simple Tree This is exactly the result from the logistic regression, since the logistic function is monotone in HOMA. See next slide. 700/ % IR 198/ % IR 502/ % IR HOMA≤4.61HOMA>4.61

IR and HOMA: Logistic equivalent to Tree Assign IR Assign non-IR Actual IR: N=502 Non-IR: N=119 Actual IR: N=198 Non-IR: N=1319 Assign IRAssign non-IR Actual IR: N=502 Non-IR: N=119 Actual IR: N=198 Non-IR: N=1319

Summary: Classifying IR from HOMA Logistic regression gives Prob(IR) = e u /(1 + e u ), where u= (HOMA), so Prob(IR) large is equivalent to HOMA large. This is not the case with 2 predictors, as in next slide. One Predictor Same Classification with Trees and Logistic because:

% IR 1.%IR increases non-smoothly with both HOMA and BMI. 2.Logistic regression fits a smooth surface to these %s.

Classifying IR from HOMA and BMI Logistic regression gives Prob(IR) = e u /(1 + e u ), where u= (HOMA) (BMI), so Prob(IR) ↑s smoothly as 0.87(HOMA) (BMI) does. Trees allow different IR rates in (HOMA, BMI) subgroups. Two Predictors Different Classification with Trees and Logistic because:

Classifying IR from HOMA and BMI: Logistic BMI HOMA Non-IR IR Equation: 0.87(HOMA) (BMI) = cutpoint Logistic regression forces a smooth partition such as the following, although adding HOMA-BMI interaction could give curvature to the demarcation line. Compare this to the tree partitioning on the next slide.

Classifying IR from HOMA and BMI: Trees BMI HOMA IR Non-IR Trees partition HOMA-BMI combinations into subgroups, some of which are then combined as IR and non-IR. We now consider the steps and options that need to be specified in a tree analysis.

Implausible Biological Conclusions from Trees BMI HOMA IR Non-IR

Potential Logistic Modeling Inadequacy Logistic regression Prob(IR) = e u /(1 + e u ), where u= (HOMA) (BMI), so Prob(IR) ↑s smoothly as 0.87(HOMA) (BMI) does Data may not be fit well by: Data does follow logistic curve.

Classification Tree Software CART from Statistica from SAS: in the Enterprise Miner module. SPSS: Has Salford CART in their Clementine Data Mining module.

Remaining Slides: Overview of Decisions Needed for Classification Trees

Classification Tree Steps There are several flavors of tree methods, each with many options, but most involve: Specifying criteria for predictive accuracy. Tree building. Tree building stopping rules. Pruning. Cross-validation.

Specifying criteria for predictive accuracy Misclassification cost generalizes the concept of misclassification rates so that some types of misclassifying are given greater weight. Relative weights, or costs are assigned to each type of misclassification. A prior probability of each outcome is specified usually as the observed prevalence of outcome in the data, but could be from previous research or for other populations. The costs and priors together give the criteria for balancing specificity and sensitivity. Observed prevalence and equal weights → minimizing overall misclassification.

Tree Building Recursively apply what we did for HOMA for each of the two resulting partitions, then for the next set, etc. Every factor is screened at every step. The same factor may be reused. Some algorithms allow certain linear combinations of factors (e.g., as logistic regression provides, called discriminant functions) to be screened. An “impurity measure” or “splitting function” specifies the criteria for measuring how different two potential new subgroups are. Some choices are “Gini”, chi-square and G-square.

Tree Building Stopping Rules It is possible to continue splitting and building a tree until all subgroups are “pure” with only one type of outcome. This may be too fine to be useful. One alternative is “minimum N”, to allow only pure or only subgroups of a minimum size. Another choice is “Fraction of objects” in which a minimum fraction of an outcome class, or a pure class is obtained.

Tree Pruning Pruning tries to solve the problem of lack of generalizability due to over-fitting the results to the data at hand. Start at the latest splits and measure the magnitude of the reduced misclassification due to that split. Remove the split if it is not large. How large is “not large”. This can be made at least objective, if not foolproof, by a complexity parameter related to the depth of the tree, i.e. number of levels of splits. Combining that with the misclassification cost function gives a “cost- complexity pruning”, used in this paper.

Cross Validation At least two data sets are used. The decision rule is built with training set(s), and applied to test set(s). If the misclassification costs for the test sets are similar to that for the training sets, then that decision rule is considered “validated”. With large datasets, as in business data mining, only one training and one test set is used. For smaller datasets, “v-fold cross-validation” is used. The data is randomly split into v sets. Each set serves as the test set once, with the combined remaining v-1 sets as the training set, and v-1 times as part of the training set, for v analyses. Average cost is compared to that for the entire set.