1 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression
2 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression
Objectives Recognize the differences between categorical and continuous data analysis. Identify the scale of measurement for your response variable. 3
Categorical versus Continuous Data Analysis 4
Identifying the Scale of Measurement Before analyzing, select the measurement scale for each variable. 5 VARIABLE AGREE NO OPINION DISAGREE
Nominal Variables Variable: Type of Beverage or
Ordinal Variables 7 Variable: Size of Beverage SmallMediumLarge
Continuous Variables Variable: Volume of Beverage 4.0
1.01 Quiz A car dealer records several inventory variables, including Type (automatic or standard), Time (the number of seconds it takes for the car to go from 0 to 60 mph), and Model (basic, middle, or luxury). Match the modeling type on the left with the appropriate component on the right. 1. ContinuousA. Type 2. OrdinalB. Time 3. NominalC. Model 10
1.01 Quiz – Correct Answer A car dealer records several inventory variables, including Type (automatic or standard), Time (the number of seconds it takes for the car to go from 0 to 60 mph), and Model (basic, middle, or luxury). Match the modeling type on the left with the appropriate component on the right. 1. ContinuousA. Type 2. OrdinalB. Time 3. NominalC. Model 11 1-B, 2-C, 3-A
What’s Next? 12 Ah ha! Ordinal! Agree No Opinion Disagree opinion
13
14 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression
Objectives Examine the distribution of categorical variables. Determine whether an association exists among categorical variables. Perform a stratified analysis of categorical variables. 15
Sample Data Set 16
17 This demonstration illustrates the concepts discussed previously. Examining Distributions
Association An association exists between two variables if the distribution of one variable changes when the level (or values) of the other variable changes. If there is no association, the distribution of the first variable is the same, regardless of the level of the other variable. 18
No Association 19 72%28% 72% Is your manager’s mood associated with the weather?
Association 20 82%18% 40%60% Is your manager’s mood associated with the weather?
21 This demonstration illustrates the concepts discussed previously. Recognizing Associations
1.02 Quiz Is there an association between finishing a prescription (Rx) and experiencing a relapse? 23
1.02 Quiz – Correct Answer Is there an association between finishing a prescription (Rx) and experiencing a relapse? Yes. The distribution of Yes/No for Did not finish Rx is different from the distribution of Yes/No for Finished Rx. 24
Tests for Association 25 Row percents of Income by Purchase $100 +Under $100 Low32%68% Medium32%68% High48%52% Purchase Income
Null Hypothesis There is no association between Income and Purchase. The probability of purchasing items of $100 or more is the same, regardless of income level. 26
Alternative Hypothesis There is an association between Income and Purchase. The probability of purchasing items over $100 is different between Low, Medium, and High income customers. 27
Chi-Square Test 28 NO ASSOCIATION observed frequencies = expected frequencies ASSOCIATION observed frequencies = expected frequencies \
p -Value for Chi-Square Test This p-value is the probability of observing a chi-square statistic at least as large as the one actually observed, given that there is no association between the variables probability of the association you observe in the data occurring by chance. 29
Chi-Square Tests Chi-square tests and the corresponding p-values determine whether an association exists do not measure the strength of an association depend on and reflect the sample size. 30
31 This demonstration illustrates the concepts discussed previously. Chi-Square Test
1.03 Quiz Is there sufficient evidence that an association exists between Relapsed and Rx Status? 33
1.03 Quiz – Correct Answer Is there sufficient evidence that an association exists between Relapsed and Rx Status? Yes there is sufficient evidence that an association exists between Relapsed and Rx Status. The p-value for the Pearson chi-square statistic is.0005, so at alpha=.05, there is sufficient evidence to reject the null (that no association exists) in favor of the alternative (that an association exists). 34
When Not to Use the Chi-Square Test 35 When more than 20% of the cells have expected counts less than five 2 Expected
Observed versus Expected Values Observed ValuesExpected Values
Small Samples – Fisher’s Exact Test 37 Fisher’s Exact Test SAMPLE SIZE Small Large
Example: Tea and Milk Suppose you want to test whether someone can determine if a cup of tea with milk had the milk poured first or the tea poured first. 38
Fisher’s Exact Test Example 9 Cups of Tea: 4 with Milk First and 5 with Tea First Predict which cups had tea poured first M T MT Fixed Marginal Totals Actual Guess
Basis for Fisher’s Exact Test row and column totals fixed Other possibilities M M T T Actual Guess
Fisher’s Exact Test Hypotheses Null Hypothesis: There is no association. Alternative Hypothesis: There is an association. Two-tailed Left-tailed Right-tailed 41
Left-Tailed Alternative Hypothesis Left-tailed p-value M M T T Actual Guess
Right-Tailed Alternative Hypothesis 43 Right-tailed p-value M M T T Actual Guess
Two-Tailed Alternative Hypothesis Two-tailed p-value M M T T Actual Guess
45 This demonstration illustrates the concepts discussed previously. Fisher’s Exact Test
1.04 Quiz What can you conclude from each of the p-values from the Fisher’s Exact Test for the association between Relapsed and Rx Status? 47
1.04 Quiz – Correct Answer What can you conclude from each of the p-values from the Fisher’s Exact Test for the association between Relapsed and Rx Status? The Left p-value =.0007, so there is sufficient evidence to conclude that the probability of a relapse is greater for those who did not finish the Rx than for those who did. The Right p-value =.9999, so there is not sufficient evidence to conclude that the probability of a relapse is greater for those who finished the Rx than for those who did not. The 2-Tail p-value =.0008, so there is sufficient evidence to conclude that the probability of a relapse is different depending on whether a Rx was finished or not. 48
What Happens If There Is a Third Variable? 49 Income Gender $100
Stratified Data Analysis Stratified data analysis is the process of dividing subjects into groups defined by the levels of a third variable. Use this analysis when you want to examine the association between two variables within the levels of a third variable. 50
Stratified Data Analysis Of the 39 single people, 23% have lung cancer and 77% do not. Of the 36 married people, 17% have lung cancer and 83% do not. 51
Stratified Data Analysis Of the 28 single smokers, 28% have lung cancer and 72% do not. Of the 14 married smokers, 28% have lung cancer and 72% do not. 52
Cochran-Mantel-Haenszel Statistics 53
CMH versus Chi-Square 54
1. Correlation of Scores 55 B A Test linear association
2. Row Scores by Column Categories 56 B A Test equal row scores
3. Column Scores by Row Categories 57 B A Test equal column scores
4. General Association of Categories 58 B A 2 2 Test general association
CMH Statistics and 2x2 Tables 59 2 X 2 CMH statistics are all equal
When Do CMH Statistics Lack Power? 60 Response Reversed in Strata
61 This demonstration illustrates the concepts discussed previously. CMH Tests
62
63 Exercise This exercise reinforces the concepts discussed previously.
1.05 Multiple Choice Poll The Correlation of Scores CMH test has which null hypothesis? a.There is no linear association between the row and column variables in any stratum. b.The mean scores for each column are equal in each stratum. c.The mean scores for each row are equal in each stratum. d.There is no association between the row and column variables in any stratum. 65
1.05 Multiple Choice Poll – Correct Answer The Correlation of Scores CMH test has which null hypothesis? a.There is no linear association between the row and column variables in any stratum. b.The mean scores for each column are equal in each stratum. c.The mean scores for each row are equal in each stratum. d.There is no association between the row and column variables in any stratum. 66
67 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression
Objectives Define partitioning. Understand the splitting criteria used in JMP. Review algorithm parameters available in JMP. Use the Partition platform in JMP. 68
Recursive Partitioning Partitioning refers to segmenting the data into groups that are as homogeneous as possible with respect to the dependent variable (Y). 69
Divide and Conquer 70 n = 42 n = 261 size ( Large ) size ( Medium, Small ) What factors affect the country from which cars are purchased? n =303 Country
Tree Algorithm: Calculate Separation of the Response 71 X1 Separation of Response
Tree Algorithm: Find Best Split for the Independent Variable 72 X1 Best Split X1
Tree Algorithm: Repeat for the Other Independent Variables 73 X2 Separation of Means
Tree Algorithm: Compare the Best Splits 74 Best Split X2 Best Split X1
Tree Algorithm: Partition with Best Split 75
Tree Algorithm: Repeat within Partitions 76
77 This demonstration illustrates the concepts discussed previously. Recursive Partitioning
78
79 Exercise This exercise reinforces the concepts discussed previously.
1.06 Quiz In which leaf, and on what variable, will JMP next split? 81
1.06 Quiz – Correct Answer In which leaf, and on what variable, will JMP next split? Of the leaves, the highest LogWorth is for Age (.7313), in the Gender(Female) leaf. This is where JMP will next split. 82
83 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression
Objectives Explain the concepts of logistic regression. Fit a logistic regression model using JMP software. Examine logistic regression output. 84
Overview 85 Categorical or Continuous Categorical Linear Regression Analysis Logistic Regression Analysis Predictor Response Analysis
Types of Logistic Regression 86 Nominal Ordinal Binary Two Categories Three or More Categories Response Variable Type of Logistic Regression Binary Nominal Ordinal Yes No
What Does Logistic Regression Do? The logistic regression model uses the predictor variables, which can be categorical or continuous, to predict the probability of specific outcomes. In other words, logistic regression is designed to describe probabilities associated with the values of the response variable. 87
The Logistic Curve The relationship between the probability of a response variable and a predictor variable might be an S-shaped curve. Linear regression cannot model this relationship, but logistic regression can. 88
Logistic Regression Curves This graph shows the relationship between the probability of Sale to Price. 89
Logit Transformation 90 where iindexes all cases (observations). p i is the probability that the event (a sale, for example) occurs in the i th case. 1- p i is the probability that the event (a sale, for example) does not occur in the i th case logis the natural log (to the base e).
Assumption 91 p i Predictor Logit Transform
Logistic Regression Model 92 logit (p i ) = B 0 + B 1 X 1 where logit(p i )is the logit transformation of the probability of the event B 0 is the intercept of the regression line B 1 is the slope of the regression line.
Likelihood Function A likelihood function expresses the probability of the observed data as a function of the unknown categorical parameters. The goal is to derive values of the parameters such that the probability of the observed data is as large as possible. 93
Maximum Likelihood Estimate 94 Log-likelihood
Model Inference 95 0 LogL 1 LogL 0 Log-likelihood function
Logistic Curve 96 Weak Relationship Strong Relationship Very Strong Relationship
Example of Binary Logistic Regression Model You want to predict the probability of defaulting on credit card payments based on having or not having a history of late payments. You can postulate this model: logit (Probability of Defaulting) = B 0 + B 1 *(Late Payment) 97
98 This demonstration illustrates the concepts discussed previously. Binary Logistic Regression
1.07 Quiz You want to predict the probability of a defect, given the width of a product. What kind of association exists between Defect and Width – a strong relationship or a weak relationship? 100
1.07 Quiz – Correct Answer You want to predict the probability of a defect, given the width of a product. What kind of association exists between Defect and Width – a strong relationship or a weak relationship? Weak – The fitted regression line is nearly flat, indicating a weak association between Defect and Width. 101
Multiple Logistic Regression 102
Interaction 103
104 This demonstration illustrates the concepts discussed previously. Multiple Logistic Regression
What Is an Odds Ratio? An odds ratio indicates how much more likely, with respect to odds, a certain event occurs in one group relative to its occurrence in another group. Example:How much more likely are females to purchase 100 dollars or more in items compared to males? Example:How much more likely is a person with a history of late payments on credit cards to default on a loan relative to a person who does not have a history of late payments? 105
Probability of Outcome 106 Default on Loan Yes No Yes Late Payments (Group A) 2060 No Late Payments (Group B) 1090 Total30150 Probability of defaulting = 20/80 (.25) in Group A Probability of not defaulting = 60/80 (.75) in Group A Total
Odds 107 Odds of Outcome in Group A probability of defaulting in group with history of late payments probability of not defaulting in group with history of late payments 0.25 ÷ 0.75 = 0.33 ÷
Odds Ratio 108 Odds Ratio of Group A to Group B odds of defaulting in group with history of late payments odds of defaulting in group with no history of late payments 0.33 ÷ 0.11 = 3 ÷
Properties of the Odds Ratio 109
Odds Ratio from a Logistic Regression Model For a predictor variable that has only two levels, you can exponentiate twice the parameter estimate that JMP provides to obtain the odds ratio. Estimated odds ratio = exp(2*parameter estimate) What are the odds a female purchases more than 100 dollars in items compared to a male? 110
111 This demonstration illustrates the concepts discussed previously. Odds Ratios
112
113 Exercise This exercise reinforces the concepts discussed previously.
1.08 Multiple Choice Poll Suppose processes A and B are used to make a product, and each product is evaluated as defective or non-defective. Suppose the probability of a defective from A is.2 and of a non-defective from A is.8. Which is true? a.The odds of a defective from group A is given by.8/.2 = 4. b.The odds of a defective from group A is given by.2/.8 =
1.08 Multiple Choice Poll – Correct Answer Suppose processes A and B are used to make a product, and each product is evaluated as defective or non-defective. Suppose the probability of a defective from A is.2 and of a non-defective from A is.8. Which is true? a.The odds of a defective from group A is given by.8/.2 = 4. b.The odds of a defective from group A is given by.2/.8 =
1.09 Multiple Choice Poll The odds of getting a defective product from process A is.25. What is its interpretation? a.You expect only 1/4 as many defectives as non- defectives from process A. b.You expect only 1/4 as many defectives as non- defectives from process B. 117
1.09 Multiple Choice Poll – Correct Answer The odds of getting a defective product from process A is.25. What is its interpretation? a.You expect only 1/4 as many defectives as non- defectives from process A. b.You expect only 1/4 as many defectives as non- defectives from process B. 118