1 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic.

Slides:



Advertisements
Similar presentations
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
Advertisements

Simple Logistic Regression
CHAPTER 23: Two Categorical Variables The Chi-Square Test ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture.
Statistical Inference for Frequency Data Chapter 16.
Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the.
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
CHAPTER 11 Inference for Distributions of Categorical Data
Chi-square Test of Independence
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Decision Tree Models in Data Mining
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Inferential Statistics
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.5 Small Sample.
Chapter 10 Analyzing the Association Between Categorical Variables
How Can We Test whether Categorical Variables are Independent?
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.3 Determining.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.3 Determining.
AM Recitation 2/10/11.
Categorical Data Prof. Andy Field.
Correlation and Linear Regression
Chapter 13 – 1 Chapter 12: Testing Hypotheses Overview Research and null hypotheses One and two-tailed tests Errors Testing the difference between two.
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Analysis of Categorical Data
1 Psych 5500/6500 Chi-Square (Part Two) Test for Association Fall, 2008.
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 15 Inference for Counts:
Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 10 Inferring Population Means.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on Categorical Data 12.
Chapter 3: Generalized Linear Models 3.1 The Generalization 3.2 Logistic Regression Revisited 3.3 Poisson Regression 1.
Slide Slide 1 Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing 8-3 Testing a Claim about a Proportion 8-4 Testing a Claim About.
User Study Evaluation Human-Computer Interaction.
Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Chapter 2: Logistic Regression 2.1 Likelihood Approach 2.2 Binary Logistic Regression 2.3 Nominal and Ordinal Logistic Regression Models 1.
FPP 28 Chi-square test. More types of inference for nominal variables Nominal data is categorical with more than two categories Compare observed frequencies.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Overview.
1 Chapter 2: Logistic Regression and Correspondence Analysis 2.1 Fitting Ordinal Logistic Regression Models 2.2 Fitting Nominal Logistic Regression Models.
1 Chapter 11: Analyzing the Association Between Categorical Variables Section 11.1: What is Independence and What is Association?
Multiple Logistic Regression STAT E-150 Statistical Methods.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
1 Chapter 3: Graphical Data Exploration 3.1 Exploring Relationships Between Continuous Columns 3.2 Examining Relationships Between Categorical Columns.
Chapter 13- Inference For Tables: Chi-square Procedures Section Test for goodness of fit Section Inference for Two-Way tables Presented By:
1 Week 3 Association and correlation handout & additional course notes available at Trevor Thompson.
Chapter 13 Understanding research results: statistical inference.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 11 Inference for Distributions of Categorical.
Nonparametric Statistics
Chi-Squared Test of Homogeneity Are different populations the same across some characteristic?
CHAPTER 7: TESTING HYPOTHESES Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for a Diverse Society.
Beginners statistics Assoc Prof Terry Haines. 5 simple steps 1.Understand the type of measurement you are dealing with 2.Understand the type of question.
PSY 325 AID Education Expert/psy325aid.com FOR MORE CLASSES VISIT
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 1 FINAL EXAMINATION STUDY MATERIAL III A ADDITIONAL READING MATERIAL – INTRO STATS 3 RD EDITION.
Slide Slide 1 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing Chapter 8.
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8… Where we are going… Significance Tests!! –Ch 9 Tests about a population proportion –Ch 9Tests.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.1 Independence.
BINARY LOGISTIC REGRESSION
Lecture Slides Elementary Statistics Twelfth Edition
Chi-Square X2.
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Logistic Regression.
Introduction to Logistic Regression
Analyzing the Association Between Categorical Variables
CHAPTER 11 Inference for Distributions of Categorical Data
Section 11-1 Review and Preview
Presentation transcript:

1 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression

2 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression

Objectives Recognize the differences between categorical and continuous data analysis. Identify the scale of measurement for your response variable. 3

Categorical versus Continuous Data Analysis 4

Identifying the Scale of Measurement Before analyzing, select the measurement scale for each variable. 5 VARIABLE AGREE NO OPINION DISAGREE

Nominal Variables Variable: Type of Beverage or

Ordinal Variables 7 Variable: Size of Beverage SmallMediumLarge

Continuous Variables Variable: Volume of Beverage 4.0

1.01 Quiz A car dealer records several inventory variables, including Type (automatic or standard), Time (the number of seconds it takes for the car to go from 0 to 60 mph), and Model (basic, middle, or luxury). Match the modeling type on the left with the appropriate component on the right. 1. ContinuousA. Type 2. OrdinalB. Time 3. NominalC. Model 10

1.01 Quiz – Correct Answer A car dealer records several inventory variables, including Type (automatic or standard), Time (the number of seconds it takes for the car to go from 0 to 60 mph), and Model (basic, middle, or luxury). Match the modeling type on the left with the appropriate component on the right. 1. ContinuousA. Type 2. OrdinalB. Time 3. NominalC. Model 11 1-B, 2-C, 3-A

What’s Next? 12 Ah ha! Ordinal! Agree No Opinion Disagree opinion

13

14 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression

Objectives Examine the distribution of categorical variables. Determine whether an association exists among categorical variables. Perform a stratified analysis of categorical variables. 15

Sample Data Set 16

17 This demonstration illustrates the concepts discussed previously. Examining Distributions

Association An association exists between two variables if the distribution of one variable changes when the level (or values) of the other variable changes. If there is no association, the distribution of the first variable is the same, regardless of the level of the other variable. 18

No Association 19 72%28% 72% Is your manager’s mood associated with the weather?

Association 20 82%18% 40%60% Is your manager’s mood associated with the weather?

21 This demonstration illustrates the concepts discussed previously. Recognizing Associations

1.02 Quiz Is there an association between finishing a prescription (Rx) and experiencing a relapse? 23

1.02 Quiz – Correct Answer Is there an association between finishing a prescription (Rx) and experiencing a relapse? Yes. The distribution of Yes/No for Did not finish Rx is different from the distribution of Yes/No for Finished Rx. 24

Tests for Association 25 Row percents of Income by Purchase $100 +Under $100 Low32%68% Medium32%68% High48%52% Purchase Income

Null Hypothesis There is no association between Income and Purchase. The probability of purchasing items of $100 or more is the same, regardless of income level. 26

Alternative Hypothesis There is an association between Income and Purchase. The probability of purchasing items over $100 is different between Low, Medium, and High income customers. 27

Chi-Square Test 28 NO ASSOCIATION observed frequencies = expected frequencies ASSOCIATION observed frequencies = expected frequencies \

p -Value for Chi-Square Test This p-value is the probability of observing a chi-square statistic at least as large as the one actually observed, given that there is no association between the variables probability of the association you observe in the data occurring by chance. 29

Chi-Square Tests Chi-square tests and the corresponding p-values determine whether an association exists do not measure the strength of an association depend on and reflect the sample size. 30

31 This demonstration illustrates the concepts discussed previously. Chi-Square Test

1.03 Quiz Is there sufficient evidence that an association exists between Relapsed and Rx Status? 33

1.03 Quiz – Correct Answer Is there sufficient evidence that an association exists between Relapsed and Rx Status? Yes there is sufficient evidence that an association exists between Relapsed and Rx Status. The p-value for the Pearson chi-square statistic is.0005, so at alpha=.05, there is sufficient evidence to reject the null (that no association exists) in favor of the alternative (that an association exists). 34

When Not to Use the Chi-Square Test 35 When more than 20% of the cells have expected counts less than five  2 Expected

Observed versus Expected Values Observed ValuesExpected Values

Small Samples – Fisher’s Exact Test 37 Fisher’s Exact Test SAMPLE SIZE Small Large

Example: Tea and Milk Suppose you want to test whether someone can determine if a cup of tea with milk had the milk poured first or the tea poured first. 38

Fisher’s Exact Test Example 9 Cups of Tea: 4 with Milk First and 5 with Tea First Predict which cups had tea poured first M T MT Fixed Marginal Totals Actual Guess

Basis for Fisher’s Exact Test row and column totals fixed Other possibilities M M T T Actual Guess

Fisher’s Exact Test Hypotheses Null Hypothesis: There is no association. Alternative Hypothesis: There is an association. Two-tailed Left-tailed Right-tailed 41

Left-Tailed Alternative Hypothesis Left-tailed p-value M M T T Actual Guess

Right-Tailed Alternative Hypothesis 43 Right-tailed p-value M M T T Actual Guess

Two-Tailed Alternative Hypothesis Two-tailed p-value M M T T Actual Guess

45 This demonstration illustrates the concepts discussed previously. Fisher’s Exact Test

1.04 Quiz What can you conclude from each of the p-values from the Fisher’s Exact Test for the association between Relapsed and Rx Status? 47

1.04 Quiz – Correct Answer What can you conclude from each of the p-values from the Fisher’s Exact Test for the association between Relapsed and Rx Status? The Left p-value =.0007, so there is sufficient evidence to conclude that the probability of a relapse is greater for those who did not finish the Rx than for those who did. The Right p-value =.9999, so there is not sufficient evidence to conclude that the probability of a relapse is greater for those who finished the Rx than for those who did not. The 2-Tail p-value =.0008, so there is sufficient evidence to conclude that the probability of a relapse is different depending on whether a Rx was finished or not. 48

What Happens If There Is a Third Variable? 49 Income Gender $100

Stratified Data Analysis Stratified data analysis is the process of dividing subjects into groups defined by the levels of a third variable. Use this analysis when you want to examine the association between two variables within the levels of a third variable. 50

Stratified Data Analysis Of the 39 single people, 23% have lung cancer and 77% do not. Of the 36 married people, 17% have lung cancer and 83% do not. 51

Stratified Data Analysis Of the 28 single smokers, 28% have lung cancer and 72% do not. Of the 14 married smokers, 28% have lung cancer and 72% do not. 52

Cochran-Mantel-Haenszel Statistics 53

CMH versus Chi-Square 54

1. Correlation of Scores 55 B A Test linear association

2. Row Scores by Column Categories 56 B A Test equal row scores

3. Column Scores by Row Categories 57 B A Test equal column scores

4. General Association of Categories 58 B A  2  2 Test general association

CMH Statistics and 2x2 Tables 59 2 X 2 CMH statistics are all equal

When Do CMH Statistics Lack Power? 60 Response Reversed in Strata

61 This demonstration illustrates the concepts discussed previously. CMH Tests

62

63 Exercise This exercise reinforces the concepts discussed previously.

1.05 Multiple Choice Poll The Correlation of Scores CMH test has which null hypothesis? a.There is no linear association between the row and column variables in any stratum. b.The mean scores for each column are equal in each stratum. c.The mean scores for each row are equal in each stratum. d.There is no association between the row and column variables in any stratum. 65

1.05 Multiple Choice Poll – Correct Answer The Correlation of Scores CMH test has which null hypothesis? a.There is no linear association between the row and column variables in any stratum. b.The mean scores for each column are equal in each stratum. c.The mean scores for each row are equal in each stratum. d.There is no association between the row and column variables in any stratum. 66

67 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression

Objectives Define partitioning. Understand the splitting criteria used in JMP. Review algorithm parameters available in JMP. Use the Partition platform in JMP. 68

Recursive Partitioning Partitioning refers to segmenting the data into groups that are as homogeneous as possible with respect to the dependent variable (Y). 69

Divide and Conquer 70 n = 42 n = 261 size ( Large ) size ( Medium, Small ) What factors affect the country from which cars are purchased? n =303 Country

Tree Algorithm: Calculate Separation of the Response 71 X1 Separation of Response

Tree Algorithm: Find Best Split for the Independent Variable 72 X1 Best Split X1

Tree Algorithm: Repeat for the Other Independent Variables 73 X2 Separation of Means

Tree Algorithm: Compare the Best Splits 74 Best Split X2 Best Split X1

Tree Algorithm: Partition with Best Split 75

Tree Algorithm: Repeat within Partitions 76

77 This demonstration illustrates the concepts discussed previously. Recursive Partitioning

78

79 Exercise This exercise reinforces the concepts discussed previously.

1.06 Quiz In which leaf, and on what variable, will JMP next split? 81

1.06 Quiz – Correct Answer In which leaf, and on what variable, will JMP next split? Of the leaves, the highest LogWorth is for Age (.7313), in the Gender(Female) leaf. This is where JMP will next split. 82

83 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression

Objectives Explain the concepts of logistic regression. Fit a logistic regression model using JMP software. Examine logistic regression output. 84

Overview 85 Categorical or Continuous Categorical Linear Regression Analysis Logistic Regression Analysis Predictor Response Analysis

Types of Logistic Regression 86 Nominal Ordinal Binary Two Categories Three or More Categories Response Variable Type of Logistic Regression Binary Nominal Ordinal Yes No

What Does Logistic Regression Do? The logistic regression model uses the predictor variables, which can be categorical or continuous, to predict the probability of specific outcomes. In other words, logistic regression is designed to describe probabilities associated with the values of the response variable. 87

The Logistic Curve The relationship between the probability of a response variable and a predictor variable might be an S-shaped curve. Linear regression cannot model this relationship, but logistic regression can. 88

Logistic Regression Curves This graph shows the relationship between the probability of Sale to Price. 89

Logit Transformation 90 where iindexes all cases (observations). p i is the probability that the event (a sale, for example) occurs in the i th case. 1- p i is the probability that the event (a sale, for example) does not occur in the i th case logis the natural log (to the base e).

Assumption 91 p i Predictor Logit Transform

Logistic Regression Model 92 logit (p i ) = B 0 + B 1 X 1 where logit(p i )is the logit transformation of the probability of the event B 0 is the intercept of the regression line B 1 is the slope of the regression line.

Likelihood Function A likelihood function expresses the probability of the observed data as a function of the unknown categorical parameters. The goal is to derive values of the parameters such that the probability of the observed data is as large as possible. 93

Maximum Likelihood Estimate 94 Log-likelihood

Model Inference 95 0 LogL 1 LogL 0 Log-likelihood function

Logistic Curve 96 Weak Relationship Strong Relationship Very Strong Relationship

Example of Binary Logistic Regression Model You want to predict the probability of defaulting on credit card payments based on having or not having a history of late payments. You can postulate this model: logit (Probability of Defaulting) = B 0 + B 1 *(Late Payment) 97

98 This demonstration illustrates the concepts discussed previously. Binary Logistic Regression

1.07 Quiz You want to predict the probability of a defect, given the width of a product. What kind of association exists between Defect and Width – a strong relationship or a weak relationship? 100

1.07 Quiz – Correct Answer You want to predict the probability of a defect, given the width of a product. What kind of association exists between Defect and Width – a strong relationship or a weak relationship? Weak – The fitted regression line is nearly flat, indicating a weak association between Defect and Width. 101

Multiple Logistic Regression 102

Interaction 103

104 This demonstration illustrates the concepts discussed previously. Multiple Logistic Regression

What Is an Odds Ratio? An odds ratio indicates how much more likely, with respect to odds, a certain event occurs in one group relative to its occurrence in another group. Example:How much more likely are females to purchase 100 dollars or more in items compared to males? Example:How much more likely is a person with a history of late payments on credit cards to default on a loan relative to a person who does not have a history of late payments? 105

Probability of Outcome 106 Default on Loan Yes No Yes Late Payments (Group A) 2060 No Late Payments (Group B) 1090 Total30150 Probability of defaulting = 20/80 (.25) in Group A Probability of not defaulting = 60/80 (.75) in Group A Total

Odds 107 Odds of Outcome in Group A probability of defaulting in group with history of late payments probability of not defaulting in group with history of late payments 0.25 ÷ 0.75 = 0.33 ÷

Odds Ratio 108 Odds Ratio of Group A to Group B odds of defaulting in group with history of late payments odds of defaulting in group with no history of late payments 0.33 ÷ 0.11 = 3 ÷

Properties of the Odds Ratio 109

Odds Ratio from a Logistic Regression Model For a predictor variable that has only two levels, you can exponentiate twice the parameter estimate that JMP provides to obtain the odds ratio. Estimated odds ratio = exp(2*parameter estimate) What are the odds a female purchases more than 100 dollars in items compared to a male? 110

111 This demonstration illustrates the concepts discussed previously. Odds Ratios

112

113 Exercise This exercise reinforces the concepts discussed previously.

1.08 Multiple Choice Poll Suppose processes A and B are used to make a product, and each product is evaluated as defective or non-defective. Suppose the probability of a defective from A is.2 and of a non-defective from A is.8. Which is true? a.The odds of a defective from group A is given by.8/.2 = 4. b.The odds of a defective from group A is given by.2/.8 =

1.08 Multiple Choice Poll – Correct Answer Suppose processes A and B are used to make a product, and each product is evaluated as defective or non-defective. Suppose the probability of a defective from A is.2 and of a non-defective from A is.8. Which is true? a.The odds of a defective from group A is given by.8/.2 = 4. b.The odds of a defective from group A is given by.2/.8 =

1.09 Multiple Choice Poll The odds of getting a defective product from process A is.25. What is its interpretation? a.You expect only 1/4 as many defectives as non- defectives from process A. b.You expect only 1/4 as many defectives as non- defectives from process B. 117

1.09 Multiple Choice Poll – Correct Answer The odds of getting a defective product from process A is.25. What is its interpretation? a.You expect only 1/4 as many defectives as non- defectives from process A. b.You expect only 1/4 as many defectives as non- defectives from process B. 118