BMTRY790: Machine Learning Summer 2017

BMTRY790: Machine Learning Summer 2017
Decision Trees Part II BMTRY790: Machine Learning Summer 2017

Tree-Based Methods Non-parametric and semi-parametric methods
Models can be represented graphically making them very easy to interpret Intuitive and easy to interpret relative to many statistical models Similar to how clinicians think to make decisions about patient care Examples of popular decision tree methods include: Classification and Regression Trees (CART) and extensions Conditional CART Logic Regression

General Decision Trees
A majority of decision tree algorithms have three major tasks: Defining how to partition the data at each step Determining when to stop partitioning Determining how to predict the value of outcome y in a partition Many approaches to the first task Many tree of algorithms use univariate splits xi ≤ c (xi is continuous or semi-continuous) xi ∈ B (xi is categorical) Choice of splits often found by an exhaustive search Optimizes a node impurity criterion (e.g. gini index) or sum of squared error

Partitioning for Regression
For continuous y, define a measure of impurity of a node to help decide on how to split a node, or which node to split Consider a region Rm, we model our response according to Consider split s for variable Xj to define pair of regions

Regression: Impurity of a Node
We want to choose s and j such that The inner min solve by (i.e. finding best split for each Xj) Once these best splits are determined, it is a simple process to choose the Xj and its associated s that is the overall minimizer

Partitioning for Classification
In this case we also need a measure of impurity of a node to help decide on how to split a node, or which node to split The measure should be at a maximum when a node is equally divided amongst all classes The impurity will be 0 if prediction is perfect. Consider the proportion of observations of class k in node m:

Classification Measures of Impurity
Possible measures of impurity: (1) Misclassification Rate -Situations can occur where no split improves the misclassification rate -The misclassification rate can be equal when one option is clearly better for the next step (2) Deviance/Cross-entropy: (3) Gini Index:

There are also several ways to accomplish the second task Stopping rules  set rules the stop growth of tree at a specific point Pruning  grow tree to perfect fit and the prune Examples of rules for either approach minimum number of observations in a node to allow a split minimum number of observations in a terminal node maximum depth of any node in the tree Can use validation approached to select these parameters

The third task, prediction of outcome y, is relatively easy For regression trees the prediction rule is For classification trees the prediction rule is

Classification and Regression Trees: CART
Main Idea: divide feature space into disjoint set of rectangles and fit a simple model to each one For regression, the predicted output is just the mean of the training data in that region. Fit a linear function in each region For classification the predicted class is just the most frequent class in the training data in that region. Estimate class probabilities by the frequencies in the training data in that region.

CART Trees CART trees: graphical representations of a CART model
Features in a CART tree: Nodes/partitions: binary splits in the data based on one feature (e.g. X1 < t1) Branches: paths leading down from a split Go left if X follows the rule defines at the split Go right if X doesn’t meet criteria Terminal nodes: rectangular region where splitting ends X1 < t1 X2 < t2 X1 < t3 X2 < t4 R1 R2 R3 R4 R5

CART Tree Predictions Predictions from a tree are made from the top down For new observation with feature vector X, follow path through the tree that is true for the observed vector until reaching a terminal node Prediction matches prediction made at terminal node In the case of classification, it is the class with the largest probability at that node X1 < t1 X2 < t2 X1 < t3 X2 < t4 R1 R2 R3 R4 R5

Building a CART Tree Recursive Partitioning (aka Greedy Search) Algorithm (1) Consider all possible splits for each features/predictors -Think of split as a cutpoint (when X is binary this is always ½) (2) Select predictor and split on that predictor that provides the best separation of the data (3) Divide data into 2 subsets based on the selected split (4) For each subset, consider all possible splits for each feature (5) Select the predictor/split that provides best separation for each subset of data (6) Repeat this process until some desired level of data purity is reach  this is a terminal node!

Additional Features of CART
CART trees able to build models when missing data present Surrogate splits: Say node m splits on xi < c and xi has missing values Apply recursive partitioning to other predictors to identify surrogate splits to group missing values of xi into xi < c or xi > c Rank all variables x-i according to how well they group Use “best” surrogate split to determine which branch missing values xi of belongs to CART trees also provide a measure of variable importance Sum goodness of split measure for xi across all nodes that split on xi Also sum adjusted goodness of fit for splits that include a surrogate split on xi Scale importance across all x’s so the importance sums to 100

Recursive Partitioning Splits
Consider the following data X1 X2 X3 Y 0.35 3.22 0.51 -0.04 2.96 1.43 1.14 2.92 2.3 0.59 2.36 2.25 1.57 -0.24 0.47 2.27 2.94 0.69 2.26 2.28 -0.02 3.1 1 0.76 2.11 0.81 0.02 2.01 3.36 0.11 1.95 0.31 1.93 3.15 0.23 1.85 2.64 0.6 1.74 1.33 0.13 1.68 1.66 -1.35 0.62 2.21 X1 > 0.33 X2 X3 Y 0.35 3.22 0.51 -0.04 2.96 1.43 1.14 2.92 2.3 0.59 2.36 2.25 1.57 -0.24 0.47 2.27 2.94 0.69 2.26 2.28 -0.02 3.1 1 0.76 2.11 0.81 0.02 2.01 3.36 0.11 1.95 0.31 1.93 3.15 0.23 1.85 2.64 0.6 1.74 1.33 0.13 1.68 1.66 -1.35 0.62 2.21 Best split for each variable using recursive partitioning

Consider the following data X1 X2 X3 Y 0.35 3.22 0.51 -0.04 2.96 1.43 1.14 2.92 2.3 0.59 2.36 2.25 1.57 -0.24 0.47 2.27 2.94 0.69 2.26 2.28 -0.02 3.1 1 0.76 2.11 0.81 0.02 2.01 3.36 0.11 1.95 0.31 1.93 3.15 0.23 1.85 2.64 0.6 1.74 1.33 0.13 1.68 1.66 -1.35 0.62 2.21 X1 X2 > 2.06 X3 Y 0.35 3.22 0.51 -0.04 2.96 1.43 1.14 2.92 2.3 0.59 2.36 2.25 1.57 -0.24 0.47 2.27 2.94 0.69 2.26 2.28 -0.02 3.1 1 0.76 2.11 0.81 0.02 2.01 3.36 0.11 1.95 0.31 1.93 3.15 0.23 1.85 2.64 0.6 1.74 1.33 0.13 1.68 1.66 -1.35 0.62 2.21 Best split for each variable using recursive partitioning

Consider the following data X1 X2 X3 Y 0.35 3.22 0.51 -0.04 2.96 1.43 1.14 2.92 2.3 0.59 2.36 2.25 1.57 -0.24 0.47 2.27 2.94 0.69 2.26 2.28 -0.02 3.1 1 0.76 2.11 0.81 0.02 2.01 3.36 0.11 1.95 0.31 1.93 3.15 0.23 1.85 2.64 0.6 1.74 1.33 0.13 1.68 1.66 -1.35 0.62 2.21 X1 X2 X3 < 1.07 Y 0.35 3.22 0.51 -0.04 2.96 1.43 1.14 2.92 2.3 0.59 2.36 2.25 1.57 -0.24 0.47 2.27 2.94 0.69 2.26 2.28 -0.02 3.1 1 0.76 2.11 0.81 0.02 2.01 3.36 0.11 1.95 0.31 1.93 3.15 0.23 1.85 2.64 0.6 1.74 1.33 0.13 1.68 1.66 -1.35 0.62 2.21 Best split for each variable using recursive partitioning

Surrogate Splits Now say we were missing one value of x2…
0.35 3.22 0.51 -0.04 2.96 1.43 1.14 2.92 2.3 0.59 2.36 2.25 1.57 -0.24 0.47 2.27 2.94 0.69 2.26 2.28 -0.02 3.1 1 0.76 2.11 0.81 0.02 2.01 3.36 0.11 1.95 0.31 1.93 3.15 0.23 1.85 2.64 0.6 1.74 1.33 0.13 1.68 1.66 -1.35 0.62 2.21 X1 > 0.33 X2 > 2.06 X3 Y 0.35 1 0.51 -0.04 1.43 1.14 2.3 0.59 2.25 1.57 -0.24 0.47 2.94 0.69 2.28 -0.02 3.1 0.76 0.81 0.02 3.36 0.11 2.01 0.31 3.15 0.23 2.64 0.6 1.33 0.13 1.66 -1.35 2.21 Choose best surrogate split for x-2 to group x2 using recursive partitioning

Surrogate Splits Now say we were missing one value of x2…
0.35 3.22 0.51 -0.04 2.96 1.43 1.14 2.92 2.3 0.59 2.36 2.25 1.57 -0.24 0.47 2.27 2.94 0.69 2.26 2.28 -0.02 3.1 1 0.76 2.11 0.81 0.02 2.01 3.36 0.11 1.95 0.31 1.93 3.15 0.23 1.85 2.64 0.6 1.74 1.33 0.13 1.68 1.66 -1.35 0.62 2.21 X1 > 0.33 X2 > 2.06 X3 < 2.64 Y 0.35 1 0.51 -0.04 1.43 1.14 2.3 0.59 2.25 1.57 -0.24 0.47 2.94 0.69 2.28 -0.02 3.1 0.76 0.81 0.02 3.36 0.11 2.01 0.31 3.15 0.23 2.64 0.6 1.33 0.13 1.66 -1.35 2.21 Choose best surrogate split for x-2 to group x2 using recursive partitioning

Fitting CART in R There are several packages in R that will fit a CART model rpart: probably most commonly used package. Has a lot of good functionality However, plots produced by rpart are not high quality tree: Less commonly used Not as much “statistical” functionality as rpart and less well documented Graphics are better Can interactively work with fitted CART model from tree once plotted rpart.plot: package designed to extend and improve plotting of CART trees from rpart package

CART Regression: Ozone Example
Example: Ozone Concentrations over time Data include 111 observations for an environmental study Measured six variables over 111 (nearly) consecutive days. Ozone: surface concentration of ozone(parts per billion) Solar.R: level of solar radiation (lang) Wind: Wind speed (mph) Temp: Air temperature (oF) Month: Calendar month Day: Day of the month

rpart Package library(rpart); library(rpart.plot) Cair1<-rpart(log(Ozone)~., data=air) Cair1 n=116 (37 observations deleted due to missingness) node), split, n, deviance, yval * denotes terminal node 1) root ) Temp< ) Solar.R< * 5) Solar.R>= * 3) Temp>= ) Temp< ) Day< * 13) Day>= * 7) Temp>= ) Temp< * 15) Temp>= *

rpart Package summary(Cair1) Call: rpart(formula = log(Ozone) ~ ., data = air) n=116 (37 observations deleted due to missingness) CP nsplit rel error xerror xstd Variable importance Temp time Month Solar.R Wind Day

rpart Package summary(Cair1) … Node number 1: 116 observations, complexity param= mean= , MSE= left son=2 (52 obs) right son=3 (64 obs) Primary splits: Temp < 77.5 to the left, improve= , (0 missing) Solar.R < 81.5 to the left, improve= , (5 missing) Wind < 8.3 to the right, improve= , (0 missing) time < 23.5 to the left, improve= , (0 missing) Month < 6.5 to the left, improve= , (0 missing) Surrogate splits: Month < 6.5 to the left, agree=0.750, adj=0.442, (0 split) time < 28.5 to the left, agree=0.750, adj=0.442, (0 split) Wind < 8.9 to the right, agree=0.681, adj=0.288, (0 split) Solar.R < 138 to the left, agree=0.647, adj=0.212, (0 split) Day < 10.5 to the right, agree=0.612, adj=0.135, (0 split)…

plot(Cair1, compress=T, main="CART model 1 for Ozone")
text(Cair1, use.n=T)

library(rpart.plot) rpart.plot(Cair1, compress=T, main="CART model 1 for Ozone")

rpart Package ###Tuning the CART model using “prune” Cair1 n=116 (37 observations deleted due to missingness) node), split, n, deviance, yval * denotes terminal node 1) root ) Temp< … 14) Temp< * 15) Temp>= * prune(Cair1, cp=0.05) 1) root ) Temp< ) Solar.R< * 5) Solar.R>= * 3) Temp>= ) Temp< * 7) Temp>= *

rpart Package ###Tuning CART on from end using rpart.control Cair2<-rpart(log(Ozone)~., data=air, control=rpart.control(minsplit=6, minbucket=2, cp=0.01)) Cair3<-rpart(log(Ozone)~., data=air, control=rpart.control(minsplit=10, minbucket=5, cp=0.05)) n=116 (37 observations deleted due to missingness) node), split, n, deviance, yval * denotes terminal node 1) root ) Temp< ) Solar.R< * 5) Solar.R>= * 3) Temp>= ) Temp< * 7) Temp>= *

rpart Package ###Tuning using caret package (Data must be complete) air2<-na.omit(air[,c(1:4,7)]) trair<-train(log(Ozone) ~ ., data=air2, method="rpart", trControl=trainControl(method="boot", number=1000)) trair CART 111 samples, 4 predictor No pre-processing Resampling: Bootstrapped (1000 reps) Summary of sample sizes: 111, 111, 111, 111, 111, 111, ... Resampling results across tuning parameters: cp RMSE Rsquared RMSE was used to select the optimal model using the smallest value. The final value used for the model was cp =

Lupus Nephritis Example: Treatment response in patients with treatment response Data include 213 observations examining treatment response at 1 year in patient with lupus nephritis Data includes Demographics: age, race Treatment response: Yes/No Clinical Markers: c4c, dsDNA, EGFR, UrPrCr Urine markers: IL2ra, IL6, IL8, IL12

rpart Package ### Fitting classification model, treatment response lupus nephritis ln<-read.csv("H:/public_html/BMTRY790_MachineLearning/Datasets/LupusNephritis.csv") Cmod1<-rpart(CR90~., data=ln, method="class", control=rpart.control(minsplit=6, minbucket=2, cp=0.05)) Cmod1 n= 280 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root ( ) 2) il2ra< ( ) * 3) il2ra>= ( ) 6) egfr< ( ) * 7) egfr>= ( ) *

rpart Package ### Training model using caret trln<-train(as.factor(CR90) ~ ., data=ln, method="rpart", trControl=trainControl(method="boot", number=1000)) trln CART 280 samples 10 predictor 2 classes: '0', '1' No pre-processing Resampling: Bootstrapped (1000 reps) Summary of sample sizes: 280, 280, 280, 280, 280, 280, ... Resampling results across tuning parameters: cp Accuracy Kappa Accuracy was used to select the optimal model using the largest value. The final value used for the model was cp =

Issues with CART CART is biased toward variables that have more distinct values. Variables with m distinct values allows (m− 1) splits Continuous variables often have m ~ n while categorical variable m << n Second, CART also has difficulty capturing additive structure Example: X’s associated with Y through additive interactions and main effects Thirdly, CART (and most tree-based methods) have issues with instability of trees Referred to as a weak learner Small changes in the data can result in very different trees

Alternatives To CART QUEST: Selects variable, xi, on which it will split and then selects split xi selected using ANOVA or c2 approach (depending on xi) CRUISE: Similar to QUEST in that it selects xi and then selects split Allows multiple splits for xi at the node Includes test for interactions during variable selection GUIDE: constructs piecewise-constant, multiple linear, and simple polynomial tree models (i) fit model to the training data at the current node (ii) cross-tabulate signs of residuals with each predictor, selecting most significant c2 statistic (iii) select best split on the selected variable using the appropriate loss function Conditional CART: unified framework embedding recursive binary partitioning with piecewise constant fits into the well-defined theory of permutation tests These are just a few (others include CHAID, C4.5, C5.0, MARS, ordinalCART)

Logic Regression Logic regression is an alternative decision tree method Uses exclusively binary features although the response can be binary or continuous Logic regression produces models that represent Boolean (i.e. logical) combinations of binary predictors “AND”= “OR” = “NOT” = !Xj Provides greater flexibility to model complex interactions among binary predictors

Logic Regression Like other decision tree approaches
Non-parametric/semi-parametric data driven method Also able to model continuous and categorical responses Logic regression can also be used for time to event outcomes (survival) Unlike previous decision trees we discusses, logic regression requires features to be Note, features in CART/decision trees also treated as binary but “best” split for a given feature identified during model development Logic regression uses simulated annealing algorithm, rather than recursive partitioning to find the “best” model

Comparing Tree Structures
Common Decision Tree Approaches nodes: define splits on features in the tree branches: define relationship between features in the tree terminal nodes: Defined by subgroup of the data (prediction defined by results in terminal nodes) predictions: Start at the top of the tree and going down the appropriate splits until a terminal node is reached Logic Regression knot (nodes): “AND”/ “OR” operators in the tree leaves (terminal nodes): Define the features (i.e. X’s) use in the tree predictions: the tree can be broken down into a set of logical relationships among features, if an observation matches one of the feature combinations defined in the tree, the class is predicted to be a 1

Logic Regression Branches in a decision tree can be thought of as “AND” combinations of the features at each node in the branch The only “OR” operator in a decision tree occurs at the first split Inclusion of “OR”, “AND”, and “NOT” operators in logic regression trees provides greater flexibility than is seen in CART models For data with predominantly binary predictors (i.e. single nucleotide polymorphisms, survey data, etc.) logic regression is a better option Consider an example…

CART vs. Logic Regression Trees
Assume features are a set of binary predictors and the following combinations predict that an individual has disease (versus being healthy): (X1 AND X4) OR (X1 AND !X5) CART Logic Regression X1 < 0.5 OR AND AND X5 < 0.5 Predict C = 0 X1 X4 X1 !X5 X4 < 0.5 Predict C = 1 Predict C = 0 Predict C = 1

Logic Regression Trees Not Unique
Logic regression trees are not necessarily unique. We can represent our same model with two seemingly different trees… (X1 AND X4) OR (X1 AND !X5) Logic Regression Tree 1 Logic Regression Tree 2 OR AND AND AND OR X1 X1 X4 X1 !X5 X4 !X5

Cancer Example of Logic Regression
Logic Model for Stage P1 Cancer: Equivalent Logic Tree: AND H-ras FGFR3 9q- !p21 OR Rb !p53 -LR is a statistical method capable of discovering logical combinations of predictors that describe a binary response -We will the model of urinary bladder cancer posed by Mitra et. al. as an example of how a LR model might look -The LR model pictured here describes the model shown in the picture -Note: a particular logic tree does not necessarily represent a unique representation of the model Mitra, A. P. et al. , 2006

Logic Model for Stage P1 Cancer: Equivalent Logic Tree: AND H-ras FGFR3 9q- !p21 OR Rb !p53 Mitra, A. P. et al. , 2006

Logic Model for Stage P1 Cancer: Equivalent Logic Tree: AND H-ras FGFR3 9q- !p21 OR Rb !p53 -Note that all three interaction terms in the model include loss of p53 -If fail to capture this information when collecting data, may be difficult to correctly identify relationships among other predictors that are observed Mitra, A. P. et al. J Clin Oncol; 24:

Fitting a Logic Regression Model
CART considers one feature at a time Logic regression considers all logical combinations of predictors (up to a fixed size) As a result the search space can be very large Recursive partitioning could be used for logic regression, however, since it only considers one predictor at a time, it may fail to identify optimal combinations of predictors Simulated annealing is a stochastic learning algorithm that is a viable alternative to recursive partitioning

Simulated Annealing Original idea for the simulated annealing (SA) from metallurgy With metal mixtures, control strength, malleability, etc. by controlling temperature at which the metals anneal SA borrows from this idea to effectively search large feature space without getting “stuck” at local max/min SA conducts its search using a combination of Random walk Markov chain Rate of annealing controlled by Number of random steps the algorithm takes Cooling schedule (staring and ending annealing “temperature”)

Simulated Annealing (0) Select measure of model fit (scoring function)
Select a random starting point in the search space (i.e. one possible combination of features) Randomly select of 7 possible moves to update the current model Compare the new model to the old model using some measure of fit If the new model is better, accept unconditionally If the new model is worse, with some probability choose the new model -This probability related to the current annealing “temperature” Repeat for some set number of iterations (number of steps)

Six Allowable Moves Ruczinski I, Kooperberg C, LeBlanc M (2003). Logic Regression. JCGS, 3(12),

Simulated Annealing Measures of model fit: 1. Classification
Misclassification (classification model) Simulated annealing does not have the same issues for misclassification that recursive partitioning does Deviance/cross-entropy (logistic classification model) 2. Regression: Sums-of-square error 3. Time to Event/Survival: Hazard function 4. User specified model fit/loss function -Write your own

Other Fitting Considerations
Choosing annealing parameters (start/end temperatures, number of iterations) Determine appropriate set of annealing parameters prior to running model Starting temperature: select to accept 90-95% of “worse” models Ending temperature: select so >5% of worse models are accepted Iterations is a matter of choice  more iterations = longer run time Choosing maximum number of leaves/trees Also good to determine appropriate number of leaves/trees before running final model Can be done by cross-validation (select model size with smallest CV error)

Example: Breast Cancer
Study examining factors that impact breast cancer grade. Specifically the PI is interested in determining factors associated with higher cancer grade Greater than vs. less than grade 3 Predictors include Patient Age, Race (AA vs. White) Positive on ECHO (Yes/No) Estrogen receptor status (present/absent) Her2 status (present/absent) Surgical Margin (positive/negative)

Fitting Logic Regression Model in R
### First ensure all predictor variables are binary library(LogicReg) BC<-read.csv("H:\\public_html\\BMTRY790_MachineLearning\\Datasets\\BC_trees.csv") BC$agecat<-ifelse(BC$age<61, 0,1) BC$gradecat<-ifelse(BC$gradecat<3, 0, 1)

### Fitting a classification model ###Using CV to select number of leaves anneal.params <- logreg.anneal.control(start = 2, end = -1, iter = ) logreg(resp=BC2$gradecat, bin=BC2[,c(1:8,10,11)], type=1, select=3, ntrees=1, nleaves=c(3,8), kfold=5, anneal.control = anneal.params) The number of trees in these models is 1 The model size is 2 training-now training-ave test-now test-ave Step 1 of 5 [ 1 trees; 2 leaves] CV score: Step 2 of 5 [ 1 trees; 2 leaves] CV score: Step 3 of 5 [ 1 trees; 2 leaves] CV score: Step 4 of 5 [ 1 trees; 2 leaves] CV score: Step 5 of 5 [ 1 trees; 2 leaves] CV score:

logreg(resp=BC2$gradecat, bin=BC2[,c(1:8,10,11)], type=1, select=3, ntrees=1, nleaves=c(3,8), kfold=5, anneal.control = anneal.params) … The model size is 8 training-now training-ave test-now test-ave Step 1 of 5 [ 1 trees; 8 leaves] CV score: Step 2 of 5 [ 1 trees; 8 leaves] CV score: Step 3 of 5 [ 1 trees; 8 leaves] CV score: Step 4 of 5 [ 1 trees; 8 leaves] CV score: Step 5 of 5 [ 1 trees; 8 leaves] CV score: ntree nleaf train.ave train.sd cv/test cv/test.sd

### Fitting a classification model using the selected number of leaves BUT make sure annealing parameters good anneal.params <- logreg.anneal.control(start = 1, end = -1, iter = , update=1000) fit<-logreg(resp=BC2$gradecat, bin=BC2[,c(1:8,10,11)], type=1, select=1, ntrees=1, nleaves=5, anneal.control = anneal.params) log-temp current score best score acc / rej /sing ( 0) (137) (144) (139) … ( 47) ( 44) ( 47) 953 0

### Fitting a classification model using the selected number of leaves BUT make sure annealing parameters good anneal.params <- logreg.anneal.control(start = 1, end = -1, iter = , update=1000) fit<-logreg(resp=BC2$gradecat, bin=BC2[,c(1:8,10,11)], type=1, select=1, ntrees=1, nleaves=5, anneal.control = anneal.params) log-temp current score best score acc / rej /sing ( 0) (200) (168) 63 0 … ( 41) ( 44) ( 54) ( 51) 949 0

### Fitting a logistic regression tree(s) instead of a classification tree ###Using CV to select number of leaves anneal.params <- logreg.anneal.control(start = 2, end = -4, iter = ) logreg(resp=BC$gradecat, bin=BC[,c(3:8,10)], type=2, select=3, ntrees=c(1,2), nleaves=c(3,8), kfold=5, anneal.control = anneal.params) ntree nleaf train.ave train.sd cv/test cv/test.sd

### Fitting a logistic classification model using the selected number of leaves/trees ### Again check annealing parameters good anneal.params <- logreg.anneal.control(start = 2, end = -4, iter = , update=5000) fit<-logreg(resp=BC$gradecat, bin=BC[,c(3:8,10)], type=2, select=1, ntrees=1, nleaves=7, anneal.control = anneal.params) log-temp current score best score acc / rej /sing current parameters ( 0) (958) (928) (979) … (253) (238) (240) (290)

### Fitting a logistic classification model using the selected number of leaves/trees ### Again check annealing parameters good anneal.params <- logreg.anneal.control(start = 1, end = -3.5, iter = , update=5000) fit<-logreg(resp=BC$gradecat, bin=BC[,c(3:8,10)], type=2, select=1, ntrees=1, nleaves=7, anneal.control = anneal.params) log-temp current score best score acc / rej /sing current parameters ( 0) (902) (***) (977) … (314) (275) (274) (269)

fit score * ((her2cat and ((not prcat) or (not echocat))) or ((not er) and (racecat or (not margincat))))

Some Final Notes The Good:
Tree-based methods are very flexible for classification and regression The models can be presented graphically for easier interpretation Clinicians like them, they tend to think this way anyway Decision trees such as CART are very useful in settings where features are continuous and binary Logic regression is useful for modeling data with all binary features (e.g. SNP data)

Some Final Notes The Bad (and sometimes ugly):
They do have a tendency to over-fit the data not uncommon among machine learning methods They are referred to as weak learners small changes in the data results in very different models There has been extensive research into improving the performance of these methods which is the topic of our next class.

BMTRY790: Machine Learning Summer 2017

Similar presentations

Presentation on theme: "BMTRY790: Machine Learning Summer 2017"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BMTRY790: Machine Learning Summer 2017

Similar presentations

Presentation on theme: "BMTRY790: Machine Learning Summer 2017"— Presentation transcript:

Similar presentations

About project

Feedback