Download presentation
Presentation is loading. Please wait.
Published byLeslie Miller Modified over 9 years ago
1
Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning 1
2
Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning 2
3
Objectives Recognize the differences between categorical and continuous data analysis. Identify the scale of measurement for your response variable. Examine the distribution of categorical data. 3
4
Categorical Data Categorical data represents categories, classes and classifications, groups, or qualitative characteristics or attributes. –respondent gender ( male or female ) –product disposition ( conforming or nonconforming ) –patient mortality ( survived or died ) Continuous data represents measurements. –length, time, temperature, concentration Categorical data is qualitative, continuous data is quantitative. Categorical data values are discrete and the distance between categories is unknown. 4
5
Categorical Response The methods presented in this course are appropriate for a response (dependent variable) that is categorical. –Methods such as the Student t-test, a two-way analysis of variance (ANOVA), or multiple least squares linear regression are not appropriate. The explanatory variables (independent or predictor variable) can be continuous or categorical. –The nature of the explanatory variable can also determine which methods are appropriate. 5
6
Probability The analysis or modeling of a continuous response directly applies to the value or measurement itself. –This approach is not possible for a categorical response. The analysis or modeling of a categorical response is based on the proportion or probability of each level. 6
7
Common Applications Medicine, epidemiology, and public health Sociology and behavioral science Marketing and demographics Political science Quality and Six Sigma 7
8
8
9
1.01 Multiple Answer Poll What is your area of application for categorical data analysis? a.Medicine, epidemiology, and public health b.Sociology and behavioral science c.Marketing and demographics d.Political science e.Quality and Six Sigma f.Other 9
10
Data Type for Categorical Data You might use either the numeric or the character data type to represent categorical data, such as customer satisfaction. –1, 2, 3, 4, 5 (a Likert scale) –Poor, fair, good, very good, excellent You must use the numeric data type to represent continuous data, such as a physical measurement. 10
11
Modeling Type for Categorical Data You must use either the nominal or ordinal modeling type for categorical data. –Nominal variables contain values without any natural ordering. Hair color, gender, political affiliation, or county of residence –Ordinal variables contain values with a natural order. Satisfaction index, income category, or level of education You must use the continuous modeling type for interval or ratio data. 11
12
12
13
1.02 Multiple Choice Poll What is the best choice for the data type and modeling type for the combination of variables Age (in years) and Gender ( male or female )? a.(numeric, continuous) and (character, ordinal) b.(numeric, ordinal) and (character, continuous) c.(numeric, continuous) and (character, nominal) d.(character, nominal) and (numeric, continuous) 13
14
1.02 Multiple Choice Poll – Correct Answer What is the best choice for the data type and modeling type for the combination of variables Age (in years) and Gender ( male or female )? a.(numeric, continuous) and (character, ordinal) b.(numeric, ordinal) and (character, continuous) c.(numeric, continuous) and (character, nominal) d.(character, nominal) and (numeric, continuous) 14
15
Titanic Example You will use the Titanic data set to explore the nature of categorical data. –Class: first, second, or third class passengers, or crew members –Age: adult or child –Sex: male or female –Survived: yes or no 15
16
This demonstration illustrates the concepts discussed previously. Categorical Data Example 16
17
17
18
1.03 Multiple Choice Poll What data type and modeling type are used for the Age variable? a.Character, ordinal b.Numeric, nominal c.Character, nominal d.Character, continuous 18
19
1.03 Multiple Choice Poll – Correct Answer What data type and modeling type are used for the Age variable? a.Character, ordinal b.Numeric, nominal c.Character, nominal d.Character, continuous 19
20
20
21
Distribution of Continuous Data Continuous data might be realized as an infinity of values, within an arbitrary level of discreteness, over a given range. The distribution or frequency of these values depends on the process that generates them. –Many examples can be described by the normal distribution. The distribution might be asymmetric when values approach a natural boundary. The distribution might exhibit unusual tails. 21
22
Distribution Models for Continuous Data Many mathematical models exist for continuous data. The model parameters determine the characteristics of the distribution. –The model is fit to the data by determining the best values for the parameters. The model can be expressed as functions: –probability density function (PDF) –cumulative distribution function (CDF) Common examples of models are the normal, lognormal, Weibull, Johnson, and gamma distributions. 22
23
Distribution of Categorical Data Categorical data might be realized only as discrete values, few or many. The distribution or frequency of these values depends on the process that generates them. –Many examples of dichotomous responses can be described by the binomial distribution. The distribution might not be symmetric. The distribution of many levels might exhibit unusual tails. 23
24
Distribution Models for Categorical Data Many mathematical models exist for categorical data. The model parameters determine the characteristics of the distribution. –The model is fit to the data by determining the best values for the parameters. The model can be expressed as functions: –probability mass function (PMF) –cumulative distribution function (CDF) Common examples of models are the binomial, negative binomial, geometric, hypergeometric, and Poisson distributions. 24
25
Binomial Distribution Model The basis for this distribution is a Bernoulli trial. –There are only two possible outcomes of each trial. Generally, 1 for success or 0 for failure. –Each individual outcome (y i ) is independent of the others (in other words, the probability of the outcome 1 is always the same). Total number of successes (outcome of 1) is y. 25
26
Binomial Distribution Model The binomial distribution describes the probability of y, the number of successes, from 0 to n. The parameters in this model are n, the number of trials, and , the probability of outcome 1 in each trial. The expected value (mean) is n and the variance is n (1- ) for the binomial distribution. 26
27
Example of Binomial Distribution A college basketball player finished the last season with a record of 77% success making free throws. –What performance should you expect from this player if her free-throw success rate has not changed? Specifically, how many baskets should she make in 25 attempts? 27
28
28
29
1.04 Multiple Choice Poll What is the parameter π in the binomial distribution model? a.The total number of successes b.The probability of success in each trial c.The number of possible outcomes from each trial d.The proportion of failures in each trial 29
30
1.04 Multiple Choice Poll – Correct Answer What is the parameter π in the binomial distribution model? a.The total number of successes b.The probability of success in each trial c.The number of possible outcomes from each trial d.The proportion of failures in each trial 30
31
Graphics for Frequency and Proportion Statistical graphics are designed to interpret the data. The bar chart represents the frequency of each level by the length of its bar. The mosaic plot represents the proportion of each level by the length of its segment. 31
32
Multinomial Distribution Model Some categorical responses have more than two possible values. The idea of the binomial distribution can be extended to the multinomial distribution. 32
33
Test Proportions There might be supposed proportions for each of the categories in the response variable. The sample can be used to test that supposition. JMP calls this command test probabilities. Enter a probability for only the subset of levels that you want to test, and leave the others blank, when you have a response with more than two levels. –Enter 1 for all levels to test if they are equal. 33
34
Chi-Square Test for Proportions The appropriate test of proportions is based on the chi-square statistic. –This statistic is covered in detail in the next section. The test is available for three situations: –Test whether probabilities are not equal to supposition –Test whether probabilities are greater than supposed –Test whether probabilities are less than supposed 34
35
Poisson Distribution Sometimes the number of trials is not fixed and there is no practical upper limit. The response y is the count of events over time. The Poisson distribution is often a good model for the distribution of y. This model has a single parameter, . 35
36
36 This demonstration illustrates the concepts discussed previously. Examining Distributions
37
37
38
Exercise This exercise reinforces the concepts discussed previously. 38
39
Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning 39
40
Objectives Determine whether an association exists among categorical variables. Perform a stratified analysis of categorical variables. 40
41
Association An association exists between two variables if the distribution of one variable changes when the level (or value) of the other variable changes. If there is no association, the distribution of the first variable is the same, regardless of the level of the other variable. 41
42
No Association 42 72%28% 72% Is mood associated with the weather?
43
Association 43 82%18% 40%60% Is mood associated with the weather?
44
44 This demonstration illustrates the concepts discussed previously. Recognizing Associations
45
Marginal Distribution in an Association The marginal distribution of the response ignores the explanatory variable. The mosaic plot explores the data without regard to any association. 45
46
Conditional Distribution in an Association The conditional distribution of the response describes the frequency of the responses for each level of the explanatory variable. The mosaic plot explores the data and the possibility of an association. 46
47
Two-Dimensional Mosaic Plot This mosaic plot includes the marginal distribution on the right and conditional distribution on the left. 47 conditionalmarginal
48
48 This demonstration illustrates the concepts discussed previously. Exploring Associations
49
49
50
1.05 Quiz Is there an association between the severity of an adverse reaction and the treatment? 50
51
1.05 Quiz – Correct Answer Is there an association between the severity of an adverse reaction and the treatment? No, the distribution of ADR SEVERITY is the same between the two levels of TREATMENT GROUP. 51
52
Test for Association The row percentage (proportion or probability) is used to test the association between Survived and Class. 52
53
Null Hypothesis H 0 : There is no association between Survived and Class. The probability of surviving is the same, regardless of the class of the passenger. 53
54
Alternative Hypothesis H 1 : There is an association between Survived and Class. The probability of surviving is different between crew, first, second, and third class passengers. 54
55
Chi-Square Test The expected frequencies are based on the marginal distribution, or null hypothesis. 55 NO ASSOCIATION observed frequencies=expected frequencies ASSOCIATION observed frequencies≠expected frequencies
56
Expected Frequency The expected frequency of each cell is based on the marginal distribution (null hypothesis). It is the product of the marginal proportion of the explanatory variable and the marginal frequency of the response. 0.4021 * 1490=599.114 56
57
Pearson Chi-Square Statistic The observed frequency is compared to the expected frequency. The cell statistics are accumulated into the sample statistic. (73.886) 2 /599.114=9.112 57
58
p -Value for Chi-Square Test This p-value is the probability of observing a chi-square sample statistic at least as large as the one actually observed, given that there is no association between the variables probability of the association that you observe in the data occurring by chance. 58
59
Chi-Square Tests Chi-square tests and the corresponding p-values determine whether an association exists do not measure the strength of an association depend on and reflect the sample size. 59
60
Agreement A stronger relationship than an association might be sought when the two variables use the same levels. Agreement measures the strength of such a relationship. Cohen’s kappa, κ, for agreement. Bowker’s test of symmetry (association) McNemar’s test of agreement (Bowker’s test when levels are the same) 60
61
Trend in Association Two variables might exhibit a trend in the association between their ordered levels. –The response has two levels. –The predictor is ordinal. The Cochran-Armitage test is available for a trend. 61
62
62 This demonstration illustrates the concepts discussed previously. Chi-Square Test
63
63
64
1.06 Quiz Is there sufficient evidence that an association exists between adverse effect severity and treatment? 64
65
1.06 Quiz – Correct Answer Is there sufficient evidence that an association exists between adverse effect severity and treatment? No, the p-value for the Pearson chi-square statistic is 0.7919, so there is insufficient evidence to reject the null (that no association exists) at α=0.05. 65
66
66
67
When Not to Use the Chi-Square Test 67 When more than 20% of the cells have expected counts less than five 2 Expected
68
Observed versus Expected Values 68 3.434.576.00 4.415.887.71 4.165.557.29 Observed ValuesExpected Values 158 567 656 4 of 9 cells, or 44%, with expected value less than 5 1 of 9 cells, or 11%, with observed value less than 5
69
Small Samples – Fisher’s Exact Test 69 Fisher’s Exact Test SAMPLE SIZE Small Large
70
Example: Tea and Milk Suppose you want to test whether someone can determine whether a cup of tea with milk had the milk poured first or the tea poured first. 70
71
Fisher’s Exact Test Example 8 Cups of Tea: 4 with Milk First and 4 with Tea First Predict which cups had tea poured first. 71 4 4 4 4 M T MT Fixed Marginal Totals Prepared Test
72
Basis for Fisher’s Exact Test 72 0 4 4 0 4 4 4 4 2 2 2 2 4 4 4 4 1 3 3 1 4 4 4 4 row and column totals fixed Other possible samples: M M T T 3 4 4 44 0 04 4 Prepared Test 3 1 1 3 4 4 4 4 Sample:
73
Fisher’s Exact Test Hypotheses Null Hypothesis: There is no association. Alternative Hypothesis: There is an association. Left-tailed Right-tailed Two-tailed 73
74
Left-Tailed Alternative Hypothesis 74 Left-tailed p-value M 3 1 1 3 4 4 4 4 M T T Actual Test 0 4 4 0 4 4 4 4 2 2 2 2 4 4 4 4 1 3 3 1 4 4 4 4 The alternative hypothesis is that the prediction is worse than that by chance.
75
Right-Tailed Alternative Hypothesis 75 Right-tailed p-value M 3 1 1 3 4 4 4 4 M T T Prepared Test 3 4 4 44 0 04 4 The alternative hypothesis is that the prediction is better than that by chance.
76
Two-Tailed Alternative Hypothesis 76 Two-tailed p-value 0 4 4 0 4 4 4 4 2 2 2 2 4 4 4 4 1 3 3 1 4 4 4 4 M 3 1 1 3 4 4 4 4 M T T Prepared Test 3 4 4 44 0 04 4
77
77 This demonstration illustrates the concepts discussed previously. Fisher’s Exact Test
78
78
79
1.07 Quiz Which test should you use for the alternative hypothesis that finishing the prescription decreases the chance of a relapse, and is the test significant at α=0.05? 79
80
1.07 Quiz – Correct Answer Which test should you use for the alternative hypothesis that finishing the prescription decreases the chance of a relapse, and is the test significant at α=0.05? The Left test is for the specified hypothesis and the p-value=0.0007 is significant at the α=0.05 level. 80
81
81
82
Stratified Data Analysis Stratified data analysis is the process of dividing subjects into groups defined by the levels of a third variable. Use this analysis when you want to examine the association between two variables within the levels of a third variable. 82
83
Unstratified Data Analysis Of the 39 single people, 23% have lung cancer and 77% do not. Of the 36 married people, 17% have lung cancer and 83% do not. 83
84
Stratified Data Analysis Of the 28 single smokers, 28% have lung cancer and 72% do not. Of the 14 married smokers, 28% have lung cancer and 72% do not. 84
85
Cochran-Mantel-Haenszel Statistics 85
86
Sample Size for CMH versus Chi-Square Recommended that you have either sample size of 25 for each degree of freedom in original table or at least 80% of cells with expected frequency of at least 5 (same as unstratified test). 86
87
1. Correlation of Scores 87 B A Test linear association
88
2. Row Scores by Column Categories 88 B A Test equal row scores
89
3. Column Scores by Row Categories 89 B A Test equal column scores
90
4. General Association of Categories 90 B A 2 2 Test general association
91
91
92
1.08 Multiple Choice Poll Which CMH test is the most appropriate for Survived (nominal, columns) versus Class (ordinal, rows) when stratified by Sex? a.Row Scores by Column Categories b.General Association of Categories c.Correlation of Scores d.Column Scores by Row Categories 92
93
1.08 Multiple Choice Poll – Correct Answer Which CMH test is the most appropriate for Survived (nominal, columns) versus Class (ordinal, rows) when stratified by Sex? a.Row Scores by Column Categories b.General Association of Categories c.Correlation of Scores d.Column Scores by Row Categories A. Row Scores for ordinal Class by Column Categories of nominal Survived. 93
94
CMH Statistics and 2x2 Tables 94 2x2 All CMH statistics are equal
95
When Do CMH Tests Lack Power? The CMH statistics accumulate over the strata. If the association is similar in all strata, then the statistics are strengthened. –This case is easier to detect, and the tests have more power. If the association changes or reverses across strata, then the statistics are weakened. –This case is more difficult to detect, and the tests have less power. 95
96
Concordance and Discordance A crosstabulation of ordinal data introduces the ideas of concordance and discordance. –These ideas involve a pair of observations. The association might exhibit a trend. A pair is concordant if one observation that is ranked higher on X is also ranked higher on Y. A pair is discordant if one observation that is ranked higher on X is ranked lower on Y. A pair is tied if both observations have the same level for X and Y. 96
97
Measures of Association Measures of association for ordinal variables serve like the correlation coefficient for continuous variables that exhibit a linear trend. Gamma: ignores ties Kendall’s b : corrects for ties Stuart’s c : corrects for table size and ties Somer’s D: asymmetric modification of b Lambda: measures improvement in predicting Y, given X; two asymmetric forms Uncertainty Coefficient U: proportion of uncertainty explained 97
98
98 This demonstration illustrates the concepts discussed previously. CMH Tests
99
99
100
100 Exercise This exercise reinforces the concepts discussed previously.
101
101
102
1.09 Multiple Choice Poll The Correlation of Scores CMH test has which null hypothesis? a.There is no linear association between the row and column variables in any stratum. b.The mean scores for each column are equal in each stratum. c.The mean scores for each row are equal in each stratum. d.There is no association between the row and column variables in any stratum. 102
103
1.09 Multiple Choice Poll – Correct Answer The Correlation of Scores CMH test has which null hypothesis? a.There is no linear association between the row and column variables in any stratum. b.The mean scores for each column are equal in each stratum. c.The mean scores for each row are equal in each stratum. d.There is no association between the row and column variables in any stratum. 103
104
Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning 104
105
Objectives Explain how correspondence analysis can help you find associations. Perform a simple correspondence analysis. Interpret a correspondence plot. 105
106
What Is Correspondence Analysis? Correspondence analysis is a data analysis technique that enables you to display the associations between the levels of two or more categorical variables graphically extract information from a frequency table with many levels for the rows and columns. 106
107
Row and Column Profiles Row and column percentages are used to obtain row and column profiles. 107 AB C 1 4 19.55 27.39 25.91 23.27 54.55 25.53 2 17.27 24.20 28.84 29.49 25.31 26.12 53.49 53.00 24.47 3 17.67 24.20 17.51 24.20 28.18 25.31 54.55 25.53 Gives Row Profile Gives Column Profile Row % Column %
108
Example Data collected for these two categorical variables: –Mental health status ( well, mild symptom formation, moderate symptom formation, or impaired ) –Parent socioeconomic status ( A through F ) Is there an association? Which levels of each variable are associated? 108
109
Rows A and B have similar profiles. Their points are close together and fall away from the origin in the same direction. The profile for Row F is different. Its point falls away from the origin in a different direction. Correspondence Plot 109
110
Rows A and B and Column Well fall in approximately the same direction from the origin, and are relatively close to one another. Association 110
111
111
112
1.10 Multiple Answer Poll In correspondence analysis, which of the following are true? (Choose all answers that apply.) a.Row points that fall far from each other but in the same direction away from the origin indicate that they have similar profiles. b.Column points that fall close together and in the same direction away from the origin indicate that they have similar profiles. c.Row and column points that fall in the same direction away from the origin indicate that they have an association. 112
113
1.10 Multiple Answer Poll – Correct Answers In correspondence analysis, which of the following are true? (Choose all answers that apply.) a.Row points that fall far from each other but in the same direction away from the origin indicate that they have similar profiles. b.Column points that fall close together and in the same direction away from the origin indicate that they have similar profiles. c.Row and column points that fall in the same direction away from the origin indicate that they have an association. 113
114
Sample Data Set 114
115
Analysis Approaches You want to perform an analysis that takes into account the three variables Movie, Age, and Gender. There are several approaches. Analyze a two-way table where the columns correspond to the levels of Movie and the rows correspond to combinations of the levels of Age and Gender. Treat Gender as a stratification variable and analyze males and females separately. 115
116
116 This demonstration illustrates the concepts discussed previously. Correspondence Analysis
117
117
118
118 Exercise This exercise reinforces the concepts discussed previously.
119
119
120
1.11 Quiz Ice cream brands A through D are tested by a panel, and rated from 1 through 9 (with 9 as the best score). What can you conclude from the Correspondence Analysis? 120
121
1.11 Quiz – Correct Answer Ice cream brands A through D are tested by a panel, and rated from 1 through 9 (with 9 as the best score). What can you conclude from the Correspondence Analysis? Answers will vary. 121
122
Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning 122
123
Objectives Define recursive partitioning. Understand the splitting criteria used in JMP. Review algorithm parameters available in JMP. Use the Partition platform in JMP. 123
124
Recursive Partitioning Recursive partitioning refers to segmenting the data into groups that are as homogeneous as possible with respect to the dependent variable (Y) and maximizing the difference in the response of the groups. Successive splits produce a structure of rules and groups known as a decision tree, a model of the data. –Splits are binary. –The reverse of splitting is pruning. The tree helps interpret the associations in the data. 124
125
Split into New Groups 125 size ( Large ) size ( Medium, Small ) What factors determine the country from which cars are purchased? n =303 Country n=42 n=261
126
Model Metrics R square represents the amount of uncertainty in the data that has been accounted for by the explanatory variables. –Larger R 2 is better. Akaike’s Information Criterion (AIC c ) measures the decrease in the uncertainty but adds a penalty for excessive splitting. –Smaller AIC c is better. 126
127
Splitting Metrics Candidate G 2 measures the change in the entropy. –Larger G 2 values are better. Candidate LogWorth is the negative log of the p-value for the likelihood ratio chi-square. –Larger LogWorth values are better. –Monte Carlo simulation adjusts the p-value. The criterion for the best split is LogWorth. 127
128
Partition Algorithm: Calculate Split Metric 128 size Log Worth
129
Partition Algorithm: Find Best Cutting Point 129 Best Split size Log Worth
130
Partition Algorithm: Calculate for Other Variables 130 type Log Worth
131
Partition Algorithm: Compare the Best Splits 131 Best Split type Best Split size
132
Partition Algorithm: Partition with Best Split 132
133
Partition Algorithm: Repeat within Partitions 133
134
Under-fitting and Over-fitting Under-fitting is a situation where too few splits are used and prediction suffers. –The uncertainty could be reduced further. Over-fitting is a situation where too many splits are used and prediction also suffers. –The model incorporates features of random noise in the data, which will not be repeated again. Both problems adversely affect model predictions. 134
135
Crossvalidation Crossvalidation attempts to find the optimum number of splits. The sample data are divided into groups. One group is designated as the hold-out set. –It is not used to train (fit) the model (tree). –It is used for predictions (as if it were future cases). The other group is used to train the model. JMP offers two methods of crossvalidation. –K-fold crossvalidation –Excluded rows 135
136
K -fold Crossvalidation Divide the data into k groups. Designate one group as the hold-out set. Designate the other groups for making the tree. Rotate the roles of the training groups and the hold-out set until all groups have been held out once. Combine the statistics of the hold-out sets. 136
137
Evaluate Crossvalidation Specify the number of groups, k. –The default is 5 groups. The -2LogLikelihood measures the decrease in the uncertainty from the overall probabilities. K-fold crossvalidations leads to over-fitting. 137
138
Crossvalidation by Excluded Rows A portion of the sample is randomly selected. Exclude these rows to make the hold-out set. The other rows are used to make the tree. There is no universal rule for the size of the portion for the hold-out set. –25% to 50% 138
139
Stopping Rule You can avoid repeatedly clicking the Split button by clicking the Go button that appears when crossvalidation is used. The Partition platform continues to split until the R 2 value for the validation data is better than what the next 10 splits would obtain. The R 2 for the training and the validation data is presented in a run chart in the Split History report. 139
140
Akaike’s Information Criterion It is a popular and rigorous criterion for comparing models. It is based on the likelihood of the data under the current model (partition). It includes a penalty for over-fitting. It includes a correction for small samples. Smaller values suggest better models. 140 penalty correction
141
Special Cases Limit the splitting by specifying the smallest group size. –Default minimum size is 5 cases. Outliers form their own nodes and do not interfere with the rest of the tree. Linear relationships with continuous explanatory variables might require very many splits to adequately model the effects. 141
142
Missing Data A missing response causes the entire case to be excluded unless you enable the Missing Value Categories option when launching Partition. –A new response level is added for missing values. A missing categorical explanatory variable is imputed (random selection of other levels) or a new category is created for missing values. A missing continuous explanatory variable is randomly assigned to one of the two splits. 142
143
Evaluate Model: ROC Curve The receiver operating characteristic curve (ROC) evaluates the ability of the model to distinguish the levels of the response. It is based on the sensitivity (true positive rate) and the 1-specificity (false positive rate). 143
144
Sensitivity The sensitivity is the probability or rate of a true positive prediction of the given level. For this example, if the model predicts Survived= no for 992 cases out of 1004 cases where it is true, then the sensitivity is 0.988 or 98.8%. The sensitivity should be near 1. 144
145
Specificity The specificity is the probability or rate of a true negative prediction of the given level. For this example, if the model does not predict Survived= no for 184 cases out of 494 cases where it is not true, then the specificity is 0.37 or 37%. 1 – specificity, or the false positive rate, should ideally be near 0. 145
146
Evaluate Model: ROC Curve Rank order the fitted probabilities for the response. For each row, move up if the response is correct, move right if the response is wrong. 146
147
Area under the Curve The area under the ROC curve (AUC) measures the goodness of fit for the tree to the data. A general rule for interpretation of AUC: 147 ResultDiscrimination AUC=0.5None 0.7< AUC< 0.8Acceptable 0.8< AUC< 0.9Excellent AUC>0.9Outstanding
148
Evaluate Model: Lift Curve Shows performance of tree predictions. Orders cases by predicted probability. Compares proportion of cases with one response level in a given portion to proportion of cases with this response overall. 148
149
Evaluate Model: Confusion Matrix The actual response is compared to the predicted response from the model in the confusion matrix. A model that predicts better than chance has more cases on the diagonal than off the diagonal. This example shows a no response that is predicted well and a yes response that is not predicted well. The confusion matrix is not useful for model selection when the marginal distribution is not near a probability of 0.5 for both levels. 149
150
150 This demonstration illustrates the concepts discussed previously. Recursive Partitioning
151
151
152
152 Exercise This exercise reinforces the concepts discussed previously.
153
153
154
1.12 Quiz In which leaf, and on what variable, will JMP next split? 154
155
1.12 Quiz – Correct Answer In which leaf, and on what variable, will JMP next split? Of the leaves, the highest LogWorth is for Age (.7313), in the Gender(Female) leaf. This is where JMP will next split. 155
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.