Download presentation
Presentation is loading. Please wait.
Published byAlexander McDaniel Modified over 9 years ago
1
1 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression
2
2 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression
3
Objectives Recognize the differences between categorical and continuous data analysis. Identify the scale of measurement for your response variable. 3
4
Categorical versus Continuous Data Analysis 4
5
Identifying the Scale of Measurement Before analyzing, select the measurement scale for each variable. 5 VARIABLE AGREE NO OPINION DISAGREE
6
Nominal Variables Variable: Type of Beverage or 6 1 2 3 123
7
Ordinal Variables 7 Variable: Size of Beverage SmallMediumLarge
8
Continuous Variables 8 0 1.0 3.0 2.0 Variable: Volume of Beverage 4.0
10
1.01 Quiz A car dealer records several inventory variables, including Type (automatic or standard), Time (the number of seconds it takes for the car to go from 0 to 60 mph), and Model (basic, middle, or luxury). Match the modeling type on the left with the appropriate component on the right. 1. ContinuousA. Type 2. OrdinalB. Time 3. NominalC. Model 10
11
1.01 Quiz – Correct Answer A car dealer records several inventory variables, including Type (automatic or standard), Time (the number of seconds it takes for the car to go from 0 to 60 mph), and Model (basic, middle, or luxury). Match the modeling type on the left with the appropriate component on the right. 1. ContinuousA. Type 2. OrdinalB. Time 3. NominalC. Model 11 1-B, 2-C, 3-A
12
What’s Next? 12 Ah ha! Ordinal! Agree No Opinion Disagree opinion
13
13
14
14 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression
15
Objectives Examine the distribution of categorical variables. Determine whether an association exists among categorical variables. Perform a stratified analysis of categorical variables. 15
16
Sample Data Set 16
17
17 This demonstration illustrates the concepts discussed previously. Examining Distributions
18
Association An association exists between two variables if the distribution of one variable changes when the level (or values) of the other variable changes. If there is no association, the distribution of the first variable is the same, regardless of the level of the other variable. 18
19
No Association 19 72%28% 72% Is your manager’s mood associated with the weather?
20
Association 20 82%18% 40%60% Is your manager’s mood associated with the weather?
21
21 This demonstration illustrates the concepts discussed previously. Recognizing Associations
23
1.02 Quiz Is there an association between finishing a prescription (Rx) and experiencing a relapse? 23
24
1.02 Quiz – Correct Answer Is there an association between finishing a prescription (Rx) and experiencing a relapse? Yes. The distribution of Yes/No for Did not finish Rx is different from the distribution of Yes/No for Finished Rx. 24
25
Tests for Association 25 Row percents of Income by Purchase $100 +Under $100 Low32%68% Medium32%68% High48%52% Purchase Income
26
Null Hypothesis There is no association between Income and Purchase. The probability of purchasing items of $100 or more is the same, regardless of income level. 26
27
Alternative Hypothesis There is an association between Income and Purchase. The probability of purchasing items over $100 is different between Low, Medium, and High income customers. 27
28
Chi-Square Test 28 NO ASSOCIATION observed frequencies = expected frequencies ASSOCIATION observed frequencies = expected frequencies \
29
p -Value for Chi-Square Test This p-value is the probability of observing a chi-square statistic at least as large as the one actually observed, given that there is no association between the variables probability of the association you observe in the data occurring by chance. 29
30
Chi-Square Tests Chi-square tests and the corresponding p-values determine whether an association exists do not measure the strength of an association depend on and reflect the sample size. 30
31
31 This demonstration illustrates the concepts discussed previously. Chi-Square Test
33
1.03 Quiz Is there sufficient evidence that an association exists between Relapsed and Rx Status? 33
34
1.03 Quiz – Correct Answer Is there sufficient evidence that an association exists between Relapsed and Rx Status? Yes there is sufficient evidence that an association exists between Relapsed and Rx Status. The p-value for the Pearson chi-square statistic is.0005, so at alpha=.05, there is sufficient evidence to reject the null (that no association exists) in favor of the alternative (that an association exists). 34
35
When Not to Use the Chi-Square Test 35 When more than 20% of the cells have expected counts less than five 2 Expected
36
Observed versus Expected Values 36 3.434.576.00 4.415.887.71 4.165.557.29 Observed ValuesExpected Values 158 567 656
37
Small Samples – Fisher’s Exact Test 37 Fisher’s Exact Test SAMPLE SIZE Small Large
38
Example: Tea and Milk Suppose you want to test whether someone can determine if a cup of tea with milk had the milk poured first or the tea poured first. 38
39
Fisher’s Exact Test Example 9 Cups of Tea: 4 with Milk First and 5 with Tea First Predict which cups had tea poured first. 39 4 5 4 5 M T MT Fixed Marginal Totals Actual Guess
40
Basis for Fisher’s Exact Test 40 0 4 4 1 4 4 5 5 2 2 2 3 4 4 5 5 3 1 1 4 4 4 5 5 row and column totals fixed Other possibilities M M T T 3 4 5 45 0 05 4 Actual Guess 1 3 3 2 4 4 5 5
41
Fisher’s Exact Test Hypotheses Null Hypothesis: There is no association. Alternative Hypothesis: There is an association. Two-tailed Left-tailed Right-tailed 41
42
Left-Tailed Alternative Hypothesis 42 0 4 4 1 4 4 5 5 Left-tailed p-value M 1 3 3 2 4 4 5 5 M T T Actual Guess
43
Right-Tailed Alternative Hypothesis 43 Right-tailed p-value M 1 3 3 2 4 4 5 5 M T T 2 2 2 3 4 4 5 5 3 1 1 4 4 4 5 5 4 0 05 4 45 5 Actual Guess
44
Two-Tailed Alternative Hypothesis 44 0 4 4 1 4 4 5 5 Two-tailed p-value M 1 3 3 2 4 4 5 5 M T T 2 2 2 3 4 4 5 5 3 1 1 4 4 4 5 5 4 4 5 5 40 05 Actual Guess
45
45 This demonstration illustrates the concepts discussed previously. Fisher’s Exact Test
47
1.04 Quiz What can you conclude from each of the p-values from the Fisher’s Exact Test for the association between Relapsed and Rx Status? 47
48
1.04 Quiz – Correct Answer What can you conclude from each of the p-values from the Fisher’s Exact Test for the association between Relapsed and Rx Status? The Left p-value =.0007, so there is sufficient evidence to conclude that the probability of a relapse is greater for those who did not finish the Rx than for those who did. The Right p-value =.9999, so there is not sufficient evidence to conclude that the probability of a relapse is greater for those who finished the Rx than for those who did not. The 2-Tail p-value =.0008, so there is sufficient evidence to conclude that the probability of a relapse is different depending on whether a Rx was finished or not. 48
49
What Happens If There Is a Third Variable? 49 Income Gender $100
50
Stratified Data Analysis Stratified data analysis is the process of dividing subjects into groups defined by the levels of a third variable. Use this analysis when you want to examine the association between two variables within the levels of a third variable. 50
51
Stratified Data Analysis Of the 39 single people, 23% have lung cancer and 77% do not. Of the 36 married people, 17% have lung cancer and 83% do not. 51
52
Stratified Data Analysis Of the 28 single smokers, 28% have lung cancer and 72% do not. Of the 14 married smokers, 28% have lung cancer and 72% do not. 52
53
Cochran-Mantel-Haenszel Statistics 53
54
CMH versus Chi-Square 54
55
1. Correlation of Scores 55 B A Test linear association
56
2. Row Scores by Column Categories 56 B A Test equal row scores
57
3. Column Scores by Row Categories 57 B A Test equal column scores
58
4. General Association of Categories 58 B A 2 2 Test general association
59
CMH Statistics and 2x2 Tables 59 2 X 2 CMH statistics are all equal
60
When Do CMH Statistics Lack Power? 60 Response Reversed in Strata
61
61 This demonstration illustrates the concepts discussed previously. CMH Tests
62
62
63
63 Exercise This exercise reinforces the concepts discussed previously.
65
1.05 Multiple Choice Poll The Correlation of Scores CMH test has which null hypothesis? a.There is no linear association between the row and column variables in any stratum. b.The mean scores for each column are equal in each stratum. c.The mean scores for each row are equal in each stratum. d.There is no association between the row and column variables in any stratum. 65
66
1.05 Multiple Choice Poll – Correct Answer The Correlation of Scores CMH test has which null hypothesis? a.There is no linear association between the row and column variables in any stratum. b.The mean scores for each column are equal in each stratum. c.The mean scores for each row are equal in each stratum. d.There is no association between the row and column variables in any stratum. 66
67
67 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression
68
Objectives Define partitioning. Understand the splitting criteria used in JMP. Review algorithm parameters available in JMP. Use the Partition platform in JMP. 68
69
Recursive Partitioning Partitioning refers to segmenting the data into groups that are as homogeneous as possible with respect to the dependent variable (Y). 69
70
Divide and Conquer 70 n = 42 n = 261 size ( Large ) size ( Medium, Small ) What factors affect the country from which cars are purchased? n =303 Country
71
Tree Algorithm: Calculate Separation of the Response 71 X1 Separation of Response
72
Tree Algorithm: Find Best Split for the Independent Variable 72 X1 Best Split X1
73
Tree Algorithm: Repeat for the Other Independent Variables 73 X2 Separation of Means
74
Tree Algorithm: Compare the Best Splits 74 Best Split X2 Best Split X1
75
Tree Algorithm: Partition with Best Split 75
76
Tree Algorithm: Repeat within Partitions 76
77
77 This demonstration illustrates the concepts discussed previously. Recursive Partitioning
78
78
79
79 Exercise This exercise reinforces the concepts discussed previously.
81
1.06 Quiz In which leaf, and on what variable, will JMP next split? 81
82
1.06 Quiz – Correct Answer In which leaf, and on what variable, will JMP next split? Of the leaves, the highest LogWorth is for Age (.7313), in the Gender(Female) leaf. This is where JMP will next split. 82
83
83 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression
84
Objectives Explain the concepts of logistic regression. Fit a logistic regression model using JMP software. Examine logistic regression output. 84
85
Overview 85 Categorical or Continuous Categorical Linear Regression Analysis Logistic Regression Analysis Predictor Response Analysis
86
Types of Logistic Regression 86 Nominal Ordinal Binary Two Categories Three or More Categories Response Variable Type of Logistic Regression Binary Nominal Ordinal Yes No
87
What Does Logistic Regression Do? The logistic regression model uses the predictor variables, which can be categorical or continuous, to predict the probability of specific outcomes. In other words, logistic regression is designed to describe probabilities associated with the values of the response variable. 87
88
The Logistic Curve The relationship between the probability of a response variable and a predictor variable might be an S-shaped curve. Linear regression cannot model this relationship, but logistic regression can. 88
89
Logistic Regression Curves This graph shows the relationship between the probability of Sale to Price. 89
90
Logit Transformation 90 where iindexes all cases (observations). p i is the probability that the event (a sale, for example) occurs in the i th case. 1- p i is the probability that the event (a sale, for example) does not occur in the i th case logis the natural log (to the base e).
91
Assumption 91 p i Predictor Logit Transform
92
Logistic Regression Model 92 logit (p i ) = B 0 + B 1 X 1 where logit(p i )is the logit transformation of the probability of the event B 0 is the intercept of the regression line B 1 is the slope of the regression line.
93
Likelihood Function A likelihood function expresses the probability of the observed data as a function of the unknown categorical parameters. The goal is to derive values of the parameters such that the probability of the observed data is as large as possible. 93
94
Maximum Likelihood Estimate 94 Log-likelihood
95
Model Inference 95 0 LogL 1 LogL 0 Log-likelihood function
96
Logistic Curve 96 Weak Relationship Strong Relationship Very Strong Relationship
97
Example of Binary Logistic Regression Model You want to predict the probability of defaulting on credit card payments based on having or not having a history of late payments. You can postulate this model: logit (Probability of Defaulting) = B 0 + B 1 *(Late Payment) 97
98
98 This demonstration illustrates the concepts discussed previously. Binary Logistic Regression
100
1.07 Quiz You want to predict the probability of a defect, given the width of a product. What kind of association exists between Defect and Width – a strong relationship or a weak relationship? 100
101
1.07 Quiz – Correct Answer You want to predict the probability of a defect, given the width of a product. What kind of association exists between Defect and Width – a strong relationship or a weak relationship? Weak – The fitted regression line is nearly flat, indicating a weak association between Defect and Width. 101
102
Multiple Logistic Regression 102
103
Interaction 103
104
104 This demonstration illustrates the concepts discussed previously. Multiple Logistic Regression
105
What Is an Odds Ratio? An odds ratio indicates how much more likely, with respect to odds, a certain event occurs in one group relative to its occurrence in another group. Example:How much more likely are females to purchase 100 dollars or more in items compared to males? Example:How much more likely is a person with a history of late payments on credit cards to default on a loan relative to a person who does not have a history of late payments? 105
106
Probability of Outcome 106 Default on Loan Yes No Yes Late Payments (Group A) 2060 No Late Payments (Group B) 1090 Total30150 Probability of defaulting = 20/80 (.25) in Group A Probability of not defaulting = 60/80 (.75) in Group A Total 80 100 180
107
Odds 107 Odds of Outcome in Group A probability of defaulting in group with history of late payments probability of not defaulting in group with history of late payments 0.25 ÷ 0.75 = 0.33 ÷
108
Odds Ratio 108 Odds Ratio of Group A to Group B odds of defaulting in group with history of late payments odds of defaulting in group with no history of late payments 0.33 ÷ 0.11 = 3 ÷
109
Properties of the Odds Ratio 109
110
Odds Ratio from a Logistic Regression Model For a predictor variable that has only two levels, you can exponentiate twice the parameter estimate that JMP provides to obtain the odds ratio. Estimated odds ratio = exp(2*parameter estimate) What are the odds a female purchases more than 100 dollars in items compared to a male? 110
111
111 This demonstration illustrates the concepts discussed previously. Odds Ratios
112
112
113
113 Exercise This exercise reinforces the concepts discussed previously.
115
1.08 Multiple Choice Poll Suppose processes A and B are used to make a product, and each product is evaluated as defective or non-defective. Suppose the probability of a defective from A is.2 and of a non-defective from A is.8. Which is true? a.The odds of a defective from group A is given by.8/.2 = 4. b.The odds of a defective from group A is given by.2/.8 =.25. 115
116
1.08 Multiple Choice Poll – Correct Answer Suppose processes A and B are used to make a product, and each product is evaluated as defective or non-defective. Suppose the probability of a defective from A is.2 and of a non-defective from A is.8. Which is true? a.The odds of a defective from group A is given by.8/.2 = 4. b.The odds of a defective from group A is given by.2/.8 =.25. 116
117
1.09 Multiple Choice Poll The odds of getting a defective product from process A is.25. What is its interpretation? a.You expect only 1/4 as many defectives as non- defectives from process A. b.You expect only 1/4 as many defectives as non- defectives from process B. 117
118
1.09 Multiple Choice Poll – Correct Answer The odds of getting a defective product from process A is.25. What is its interpretation? a.You expect only 1/4 as many defectives as non- defectives from process A. b.You expect only 1/4 as many defectives as non- defectives from process B. 118
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.