Decision Tree & Bootstrap Forest

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Brief introduction on Logistic Regression
Departments of Medicine and Biostatistics
INFERENTIAL STATISTICS. Descriptive statistics is used simply to describe what's going on in the data. Inferential statistics helps us reach conclusions.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Model assessment and cross-validation - overview
Chapter 7 – Classification and Regression Trees
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Analysis of frequency counts with Chi square
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Chi-square Test of Independence
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
An Introduction to Logistic Regression
Chapter 7 Correlational Research Gay, Mills, and Airasian
Correlation and Regression Analysis
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Statistics made simple Modified from Dr. Tammy Frank’s presentation, NOVA.
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Inferential Statistics
How Can We Test whether Categorical Variables are Independent?
Statistics for the Social Sciences Psychology 340 Fall 2013 Tuesday, November 19 Chi-Squared Test of Independence.
Simple Linear Regression
Lecture Notes 4 Pruning Zhangxi Lin ISQS
Let’s flip a coin. Making Data-Based Decisions We’re going to flip a coin 10 times. What results do you think we will get?
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.
Introduction To Biological Research. Step-by-step analysis of biological data The statistical analysis of a biological experiment may be broken down into.
Chapter 9 – Classification and Regression Trees
Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
Associate Professor Arthur Dryver, PhD School of Business Administration, NIDA url:
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Basic concept Measures of central tendency Measures of central tendency Measures of dispersion & variability.
Why Model? Make predictions or forecasts where we don’t have data.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Education 793 Class Notes Presentation 10 Chi-Square Tests and One-Way ANOVA.
Chapter-8 Chi-square test. Ⅰ The mathematical properties of chi-square distribution  Types of chi-square tests  Chi-square test  Chi-square distribution.
FPP 28 Chi-square test. More types of inference for nominal variables Nominal data is categorical with more than two categories Compare observed frequencies.
APPLICATION OF DATAMINING TOOL FOR CLASSIFICATION OF ORGANIZATIONAL CHANGE EXPECTATION Şule ÖZMEN Serra YURTKORU Beril SİPAHİ.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Chapter 6: Analyzing and Interpreting Quantitative Data
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
PCB 3043L - General Ecology Data Analysis.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.1 Categorical Response: Comparing Two Proportions.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Konstantina Christakopoulou Liang Zeng Group G21
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡
Machine Learning 5. Parametric Methods.
From OLS to Generalized Regression Chong Ho Yu (I am regressing)
ALISON BOWLING MAXIMUM LIKELIHOOD. GENERAL LINEAR MODEL.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Nonparametric Statistics
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
CHAPTER 15: THE NUTS AND BOLTS OF USING STATISTICS.
Bootstrap and Model Validation
Eco 6380 Predictive Analytics For Economists Spring 2016
Ensemble methods: Bagging and boosting
Ungraded quiz Unit 6.
Decision Tree Chong Ho (Alex) Yu.
Ungraded quiz Unit 5.
Ensemble learning Reminder - Bagging of Trees Random Forest
15.1 The Role of Statistics in the Research Process
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Ensemble methods: Bagging and boosting
Presentation transcript:

Decision Tree & Bootstrap Forest C. H. Alex Yu Park Ranger of National Bootstrap Forest

What not regression? OLS regression is good for small- sample analysis. If you have an extremely large sample (e.g. Archival data), the power level may aproach 1 (.99999, but it cannot be 1). What is the problem?

What is decison tree? We need to grow trees first in order to grow a forest. Classification tree, recursive partition tree Developed by Breiman et al. (1984) Aim to find which independent variable(s) can make successively a decisive partition of the data with reference to the dependent variable.

We need a quick decision! When heart attack patients are admitted to a hospital, physiological measures, including heart rate, blood pressure, and background information of the patient, such as personal medical history and family medical history, are usually obtained. But can you afford any delay? Do you need to run 20 tests before doing anything?

We need a decision tree! Breiman et al. developed a three-question decision tree. What is the patient's minimum systolic blood pressure over the initial 24-hour period? What is his/her age? Does he/she display sinus tachycardia?

Nested-if logic If the patient's minimum systolic blood pressure over the initial 24 hour period is greater than 91, then if the patient's age is over 62.5 years, then if the patient displays sinus tachycardia, then and only then the patient is predicted not to survive for at least 30 days.

Titanic survivors A night to remember! After the disaster, people asked: What types of people tend to survive?

Decision tree

Leaf report: Nested-if and interaction If the passenger is a female and she has a first-class ticket, then the probability of survival is .9299 (sex X class interaction). If the passenger is a male and he is not a young boy ( age > = 10), then his chance of survival is .1693 (sex and age interaction).

Leaf report: Nested-if and interaction How unfair! Prior research shows that women are more robust against hypothermia!

ROC curves and AUC 4 possible outcomes: Sensitivity = TP/(TP+FN) true positive (TP) false positive (FP) false negative (FN) true negative (TN). Sensitivity = TP/(TP+FN) 1 - Specificity = 1 - [TN/(FP+TN)] No model = 50% chance

Criterion of judging ROC curves The left is a coin- flipping model. This cut-off is a suggestion only. Don't fall into the alpha < 0.05 reasoning.

Cross-validation You can hold back a portion of your data for cross-validation. If you let the program randomly divide your data, you may not get the same result every time. If you assign a group number (validation ID) to the observations, you have the same subsets every time.

Example from PISA: Saturated The same variable recurs. It is saturated. You cannot add anything to make it better.

After pruning A very simple model

Compare DT with logistic regression

How about? Stepwise regression: Take a long time Generalized regression: Take forever. Go to high tea ay Hotel Hilton and come after

Non-linear relationship Unlike regression that is confinded to linear modeling, the decision tree can detect non-linear relationship, e.g. In a data set about relapse to drug use collected by Dr. Rachel Castaneda, it was found that participants who never or very often see a counselor tend to use drug. Participants who sometimes see a counselor tend to be drug-free.

What is the decision criterion? Splitting criterion: LogWorth examines each independent variable to identify which one can decisively split the sample with reference to the dependent variable. If the input is continuous (e.g. household income), every value of the variable could be a potential split point. If the input is categorical (e.g. gender), then the average value of the outcome variable is taken for each level of the predictor variable.

What is the decision criterion? Afterwards, a 2x2 crosstab table is formed (e.g. [‘proficient’ or ‘not proficient’] x [‘male’ or ‘female’]). Next, Pearson’s Chi-square is used for examining the association between 2 variables. But the result of Chi-square is dependent on the sample size. When the sample size is extremely large, the p-value is close to 0 and virtually everything appears to be ‘significant.’

What is the decision criterion? As a remedy, the quality of the split is reported by LogWorth, which is defined as –log10(p). Because the LogWorth statistics is the inverse of the p value, a bigger LogWorth is considered better. If the outcome variable is categorical, G^2 (the likelihood ratio of chi-square) is reported. LogWorth was invented by R. A. Fisher!

Decision tree in SPSS Chi-square automatic interaction detection (CHAID): Pick the predictor that has the strongest interaction with the DV. Exhaustive CHAID: modification of CHAID that examines all possible splits for each predictor Quick, unbiased, efficient statistical tree (QUEST): Fast and avoids bias. Classification tree and regression (CRT): try to make each subgroup as homogeneous as possible.

Decision in SPSS Shih (2004): when the Pearson chi-square statistic is used as the splitting criterion, in which the splits with the largest value is usually chosen to channel observations into corresponding subnodes, it may result in variable selection bias. This problem is especially serious when the numbers of the available split points for each variable are different. On many occasions CRT and JMP’s tree produce virtually the same results.

Decision in tree It may be too complicated. But you cannot prune it.

Bootstrap forest Bootstrap forest is built on the idea of bootstrapping. Originally it is called random forest, and it is trademarked by the inventor Breiman (1928-2005). RF pick random predictors & random subjects. JMP calls it bootstrap forest (pick random subjects only) TETRAD or path searching (pick random predictors)

Random forest Breiman synthesized statistics and computer sciences. The idea of random forest was published in the journal named “Machine learning.” Unsupervised learning: fully data-driven. In a single tree the data are partitioned based on MSE (MSW) at the most. A single tree has another form of resampling feature: Cross-validation, but that is resampling without replacement.

The power of random forest!

Exanple of PISA Yu, C. H., Wu, S. F., & Mangan, C. (in press). Identifying crucial and malleable factors of successful science learning from the 2012 PISA. In Myint Swe Khine (Ed.), Science Education in East Asia: Pedagogical Innovations and Best Practices. New York, NY: Springer What factors can predict PISA science test performance? OECD collects data about student attitudes, family background, teacher qualification, school resources. There are about 400 predictors. It is inappropriate to run a regression model.

Bootstrap Forest in JMP Pro

Bootstrap forest in JMP The bootstrap forest is an ensemble of many classification trees resulting from repeated sampling from the same data set (with replacement). Afterward, the results are combined to reach a converged conclusion. While the validity of one single analysis may be questionable due to a small sample size, the bootstrap can validate the results based on many samples.

Column contributions The importance of the predictors is ranked by both the number of split and the sum of squares statistics, and these results are presented in a table called column contributions. The number of splits is simply a vote count: How often does this variable appear in all trees?

Committee decision It is important to point out that the column contribution table lists all the important predictors shown in all trees, but it is not necessarily to accept all of them. The committee of experts: Each tree is a judge that can vote. The final decision is based on the vote counts. Common rule (suggestion ony): If the DV is binary (1/0, P/F) or multi- nominal, we use majority or plurality rule voting. If the DV is continuous, we use average predicted values.

When the dependent variable is categorical, the splits are determined by G2, which is based on the LogWorth statistic. LogWorth is similar to the likelihood ratio Chi-square statistic. The observations are analyzed in a crosstab table and the fitness between the observed and the actual counts are compared. In other words, the fitted values are the estimated proportions within the groups.

Findings of PISA study

Much better than regression! Salford systems (2015) compared several predictive modeling methods using an engineering data set. It was found that OLS regression could explain 62% of the variance whereas the mean square error is 107! But in the random forest the R- square is 91% while the MSE is as low as 26%.

Recommended strategies If the reviewer gives you harsh comments and demands more evidence, you can drop bootstrap forest on their head as the ultimate “nuclear weapon.” But if you use bootstrap forest the first time, then you have nothing more to do. Decision tree (classification tree, recursive partition tree) has another form of resampling (cross-validation). Do not use bootstrap forest in the first paper submission. Usually CV is good enough for validation.

Recommended strategies You can use two criteria (the number of splits and G2 or the number of splits and SS) to select variables if and only if the journal welcomes technical details (e.g. Journal of Data Science; Journal of Data Mining in Education...etc.) If the journal does not like technical details, use the number of splits only. If the reviewers don't understand SS, Logworth, G2...etc., it is more likely that your paper will be rejected.

Assignment 10.1 Open the data set 'US demographics' from the JMP sample data library. Use Household income as Y. Use almost all others as the predictors, except region, gross state product, latitude, and longitude. Use decision tree and hold back 30% of the data as validation portion. What is the best predictive model?

Assignment 10.2 Download the data set 'PISA_ANN.jmp' from the Unit 9 folder. Choose Analyze → Modeling → Partition Put ability into Y and all others into X, except proficiency. Run a bootstrap forest. Hold back 30% of the data as validation portion. Number of trees: 100 Which predictors would you retain?