Categorical Data Analysis

Slides:



Advertisements
Similar presentations
Contingency Table Analysis Mary Whiteside, Ph.D..
Advertisements

Categorical Data Analysis
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.
BCOR 1020 Business Statistics
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Chapter 10 Analyzing the Association Between Categorical Variables
How Can We Test whether Categorical Variables are Independent?
AS 737 Categorical Data Analysis For Multivariate
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Analysis of Categorical Data
CHP400: Community Health Program - lI Research Methodology. Data analysis Hypothesis testing Statistical Inference test t-test and 22 Test of Significance.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on Categorical Data 12.
Dr.Shaikh Shaffi Ahamed Ph.D., Dept. of Family & Community Medicine
7. Comparing Two Groups Goal: Use CI and/or significance test to compare means (quantitative variable) proportions (categorical variable) Group 1 Group.
Analysis of Qualitative Data Dr Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia FK6163.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
Contingency Tables 1.Explain  2 Test of Independence 2.Measure of Association.
1 Chapter 11: Analyzing the Association Between Categorical Variables Section 11.1: What is Independence and What is Association?
Contingency Tables Tables representing all combinations of levels of explanatory and response variables Numbers in table represent Counts of the number.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.1 Categorical Response: Comparing Two Proportions.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Chapter 10 Categorical Data Analysis. Inference for a Single Proportion (  ) Goal: Estimate proportion of individuals in a population with a certain.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
THE CHI-SQUARE TEST BACKGROUND AND NEED OF THE TEST Data collected in the field of medicine is often qualitative. --- For example, the presence or absence.
Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 11 Multinomial Experiments and Contingency Tables 11-1 Overview 11-2 Multinomial Experiments:
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Chi Square Test Dr. Asif Rehman.
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
March 28 Analyses of binary outcomes 2 x 2 tables
Chi-Square (Association between categorical variables)
Chapter 9: Non-parametric Tests
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
8. Association between Categorical Variables
Hypothesis Testing Review
Chapter 12 Tests with Qualitative Data
Essential Statistics Two Categorical Variables: The Chi-Square Test
John Loucks St. Edward’s University . SLIDES . BY.
Chapter 8: Inference for Proportions
Chapter 11 Goodness-of-Fit and Contingency Tables
Categorical Data Analysis
Elementary Statistics
The Analysis of Categorical Data and Chi-Square Procedures
Chapter 9 Hypothesis Testing.
Lecture Slides Elementary Statistics Tenth Edition
Different Scales, Different Measures of Association
Review for Exam 2 Some important themes from Chapters 6-9
Comparing 2 Groups Most Research is Interested in Comparing 2 (or more) Groups (Populations, Treatments, Conditions) Longitudinal: Same subjects at different.
Hypothesis testing. Chi-square test
Chapter 11: Inference for Distributions of Categorical Data
If we can reduce our desire,
Chapter 10 Analyzing the Association Between Categorical Variables
Chi Square (2) Dr. Richard Jackson
Overview and Chi-Square
Inference for Relationships
Inference on Categorical Data
The Analysis of Categorical Data and Goodness of Fit Tests
Basic Practice of Statistics - 3rd Edition Inference for Regression
Analyzing the Association Between Categorical Variables
Analysis of Categorical Data
The Analysis of Categorical Data and Goodness of Fit Tests
Categorical Data Analysis
The Analysis of Categorical Data and Goodness of Fit Tests
Section 11-1 Review and Preview
The Analysis of Categorical Data and Goodness of Fit Tests
Applied Statistics Using SPSS
Applied Statistics Using SPSS
Presentation transcript:

Categorical Data Analysis Chapter 10 Categorical Data Analysis

Inference for a Single Proportion (p) Goal: Estimate proportion of individuals in a population with a certain characteristic (p). This is equivalent to estimating a binomial probability Sample: Take a SRS of n individuals from the population and observe y that have the characteristic. The sample proportion is y/n and has the following sampling properties:

Large-Sample Confidence Interval for p Take SRS of size n from population where p is true (unknown) proportion of successes. Observe y successes Set confidence level (1-a) and obtain za/2 from z-table

Example - Ginkgo and Azet for AMS Study Goal: Measure effect of Ginkgo and Acetazolamide on occurrence of Acute Mountain Sickness (AMS) in Himalayan Trackers Parameter: p = True proportion of all trekkers receiving Ginkgo&Acetaz who would suffer from AMS. Sample Data: n=126 trekkers received G&A, y=18 suffered from AMS

Wilson-Agresti-Coull Method For moderate to small sample sizes, large-sample methods may not work well wrt coverage probabilities Simple approach that works well in practice: Adjust observed number of Successes (y) and sample size (n)

Example: Lister’s Tests with Antiseptic Experiments with antiseptic in patients with upper limb amputations (John Lister, circa 1870) n=12 patients received antiseptic y=1 died

Sample Size for Margin of Error = E Goal: Estimate p within E with 100(1-a)% Confidence Confidence Interval will have width of 2E

Significance Test for a Proportion Goal test whether a proportion (p) equals some null value p0 H0: p=p0 Large-sample test works well when np0 and n(1-p0)  5

Ginkgo and Acetaz for AMS Can we claim that the incidence rate of AMS is less than 25% for trekkers receiving G&A? H0: p=0.25 Ha: p < 0.25 Strong evidence that incidence rate is below 25% (p < 0.25)

Comparing Two Population Proportions Goal: Compare two populations/treatments wrt a nominal (binary) outcome Sampling Design: Independent vs Dependent Samples Methods based on large vs small samples Contingency tables used to summarize data Measures of Association: Absolute Risk, Relative Risk, Odds Ratio

Contingency Tables Tables representing all combinations of levels of explanatory and response variables Numbers in table represent Counts of the number of cases in each cell Row and column totals are called Marginal counts

2x2 Tables - Notation n1+n2 (n1+n2)-(y1+y2) y1+y2 Outcome Total n2 Group 2 n1 n1-y1 y1 Group 1 Group Absent Present

Example - Firm Type/Product Quality 172 134 38 Outcome Total 84 79 5 Vertically Integrated 88 55 33 Not Group Low Quality High Groups: Not Integrated (Weave only) vs Vertically integrated (Spin and Weave) Cotton Textile Producers Outcomes: High Quality (High Count) vs Low Quality (Count) Source: Temin (1988)

Notation Proportion in Population 1 with the characteristic of interest: p1 Sample size from Population 1: n1 Number of individuals in Sample 1 with the characteristic of interest: y1 Sample proportion from Sample 1 with the characteristic of interest: Similar notation for Population/Sample 2

Example - Cotton Textile Producers p1 - True proportion of all Non-integretated firms that would produce High quality p2 - True proportion of all vertically integretated firms that would produce High quality

Notation (Continued) Parameter of Primary Interest: p1-p2, the difference in the 2 population proportions with the characteristic (2 other measures given below) Estimator: Standard Error (and its estimate): Pooled Estimated Standard Error when p1=p2=p:

Cotton Textile Producers (Continued) Parameter of Primary Interest: p1-p2, the difference in the 2 population proportions that produce High quality output Estimator: Standard Error (and its estimate): Pooled Estimated Standard Error when p1=p2=p:

Significance Tests for p1-p2 Deciding whether p1=p2 can be done by interpreting “plausible values” of p1-p2 from the confidence interval: If entire interval is positive, conclude p1 > p2 (p1-p2 > 0) If entire interval is negative, conclude p1 < p2 (p1-p2 < 0) If interval contains 0, do not conclude that p1  p2 Alternatively, we can conduct a significance test: H0: p1 = p2 Ha: p1  p2 (2-sided) Ha: p1 > p2 (1-sided) Test Statistic: RR: |zobs|  za/2 (2-sided) zobs  za (1-sided) P-value: 2P(Z|zobs|) (2-sided) P(Z zobs) (1-sided)

Example - Cotton Textile Production Again, there is strong evidence that non-integrated performs are more likely to produce high quality output than integrated firms

Fisher’s Exact Test Method of testing for testing whether p2=p1 when one or both of the group sample sizes is small Measures (conditional on the group sizes and number of cases with and without the characteristic) the chances we would see differences of this magnitude or larger in the sample proportions, if there were no differences in the populations

Example – Echinacea Purpurea for Colds Healthy adults randomized to receive EP (n1=24) or placebo (n2=22, two were dropped) Among EP subjects, 14 of 24 developed cold after exposure to RV-39 (58%) Among Placebo subjects, 18 of 22 developed cold after exposure to RV-39 (82%) Out of a total of 46 subjects, 32 developed cold Out of a total of 46 subjects, 24 received EP Source: Sperber, et al (2004)

Example – Echinacea Purpurea for Colds Conditional on 32 people developing colds and 24 receiving EP and 22 receiving placebo, the following table gives the outcomes that would have been as strong or stronger evidence that EP reduced risk of developing cold (1-sided test). P-value from SPSS is .079 (next slide).

Example - SPSS Output

McNemar’s Test for Paired Samples Common subjects (or matched pairs) being observed under 2 conditions (2 treatments, before/after, 2 diagnostic tests) in a crossover setting Two possible outcomes (Presence/Absence of Characteristic) on each measurement Four possibilities for each subject/pair wrt outcome: Present in both conditions Absent in both conditions Present in Condition 1, Absent in Condition 2 Absent in Condition 1, Present in Condition 2

McNemar’s Test for Paired Samples

McNemar’s Test for Paired Samples Data: n12 = # of pairs where the characteristic is present in condition 1 and not 2 and n21 # where present in 2 and not 1 H0: Probability the outcome is Present is same for the 2 conditions (p1 = p2) HA: Probabilities differ for the 2 conditions (p1 ≠ p2)

Example - Reporting of Silicone Breast Implant Leakage in Revision Surgery Subjects - 165 women having revision surgery involving silicone gel breast implants Conditions (Each being observed on all women) Self Report of Presence/Absence of Rupture/Leak Surgical Record of Presence/Absence of Rupture/Leak Source: Brown and Pennello (2002), “Replacement Surgery and Silicone Gel Breast Implant Rupture”, Journal of Women’s Health & Gender-Based Medicine, Vol. 11, pp 255-264

Example - Reporting of Silicone Breast Implant Leakage in Revision Surgery H0: Tendency to report ruptures/leaks is the same for self reports and surgical records HA: Tendencies differ

Multinomial Experiment / Distribution Extension of Binomial Distribution to experiments where each trial can end in exactly one of k categories n independent trials Probability a trial results in category i is pi ni is the number of trials resulting in category I p1+…+pk = 1 n1+…+nk = n

Multinomial Distribution / Test for Cell Probabilities

Goodness of Fit Test for a Probability Distribution Data are collected and wish to be determined whether it comes from a particular probability distribution (e.g. Poisson, Normal, Gamma) Estimate any unknown model parameters (p estimates) Break down the range of data values into k > p intervals (typically where ≥ 80% have expected counts ≥ 5) obtain observed (n) and expected (E) values for each interval

Associations Between Categorical Variables Case where both explanatory (independent) variable and response (dependent) variable are qualitative Association: The distributions of responses differ among the levels of the explanatory variable (e.g. Party affiliation by gender)

Contingency Tables Cross-tabulations of frequency counts where the rows (typically) represent the levels of the explanatory variable and the columns represent the levels of the response variable. Numbers within the table represent the numbers of individuals falling in the corresponding combination of levels of the two variables Row and column totals are called the marginal distributions for the two variables

Example - Cyclones Near Antarctica Period of Study: September,1973-May,1975 Explanatory Variable: Region (40-49,50-59,60-79) (Degrees South Latitude) Response: Season (Aut(4),Wtr(5),Spr(4),Sum(8)) (Number of months in parentheses) Units: Cyclones in the study area Treating the observed cyclones as a “random sample” of all cyclones that could have occurred Source: Howarth(1983), “An Analysis of the Variability of Cyclones around Antarctica and Their Relation to Sea-Ice Extent”, Annals of the Association of American Geographers, Vol.73,pp519-537

Example - Cyclones Near Antarctica For each region (row) we can compute the percentage of storms occuring during each season, the conditional distribution. Of the 1517 cyclones in the 40-49 band, 370 occurred in Autumn, a proportion of 370/1517=.244, or 24.4% as a percentage.

Example - Cyclones Near Antarctica Graphical Conditional Distributions for Regions

Guidelines for Contingency Tables Compute percentages for the response (column) variable within the categories of the explanatory (row) variable. Note that in journal articles, rows and columns may be interchanged. Divide the cell totals by the row (explanatory category) total and multiply by 100 to obtain a percent, the row percents will add to 100 Give title and clearly define variables and categories. Include row (explanatory) total sample sizes

Independence & Dependence Statistically Independent: Population conditional distributions of one variable are the same across all levels of the other variable Statistically Dependent: Conditional Distributions are not all equal When testing, researchers typically wish to demonstrate dependence (alternative hypothesis), and wish to refute independence (null hypothesis)

Pearson’s Chi-Square Test Can be used for nominal or ordinal explanatory and response variables Variables can have any number of distinct levels Tests whether the distribution of the response variable is the same for each level of the explanatory variable (H0: No association between the variables r = # of levels of explanatory variable c = # of levels of response variable

Pearson’s Chi-Square Test Intuition behind test statistic Obtain marginal distribution of outcomes for the response variable Apply this common distribution to all levels of the explanatory variable, by multiplying each proportion by the corresponding sample size Measure the difference between actual cell counts and the expected cell counts in the previous step

Pearson’s Chi-Square Test Notation to obtain test statistic Rows represent explanatory variable (r levels) Cols represent response variable (c levels) n.. n.c … n.2 n.1 Total nr. nrc nr2 nr1 r n2. n2c n22 n21 2 n1. n1c n12 n11 1 c

Pearson’s Chi-Square Test Observed frequency (nij): The number of individuals falling in a particular cell Expected frequency (Eij): The number we would expect in that cell, given the sample sizes observed in study and the assumption of independence. Computed by multiplying the row total and the column total, and dividing by the overall sample size. Applies the overall marginal probability of the response category to the sample size of explanatory category

Pearson’s Chi-Square Test Large-sample test (at least 80% of Eij > 5) H0: Variables are statistically independent (No association between variables) Ha: Variables are statistically dependent (Association exists between variables) Test Statistic: P-value: Area above in the chi-squared distribution with (r-1)(c-1) degrees of freedom. (Critical values in Table 8)

Example - Cyclones Near Antarctica Observed Cell Counts (nij): Note that overall: (1876/9165)100%=20.5% of all cyclones occurred in Autumn. If we apply that percentage to the 1517 that occurred in the 40-49S band, we would expect (0.205)(1517)=310.5 to have occurred in the first cell of the table. The full table of Eij:

Example - Cyclones Near Antarctica Computation of

Example - Cyclones Near Antarctica H0: Seasonal distribution of cyclone occurences is independent of latitude band Ha: Seasonal occurences of cyclone occurences differ among latitude bands Test Statistic: RR: cobs2  c.05,62 = 12.59 P-value: Area in chi-squared distribution with (3-1)(4-1)=6 degrees of freedom above 71.2 From Table 8, P(c222.46)=.001  P< .001

Likelihood Ratio Statistic Note: The formula on page 512 of textbook is incorrect

SPSS Output - Cyclone Example P-value

Misuses of chi-squared Test Expected frequencies too small (at least 80% of expected counts should be above 5, not necessary for the observed counts) Dependent samples (the same individuals are in each row, see McNemar’s test) Can be used for nominal or ordinal variables, but more powerful methods exist for when both variables are ordinal and a directional association is hypothesized

Residual Analysis Once dependence has been determined from a chi-squared test, often interested in determining which cells contributed Residual: fo-fe measures the difference between the observed and expected counts Positive implies observed more than expected Residual’s practical importance depends on level of fe Adjusted Residual (computed for each cell): Adjusted residuals above 3 in absolute value give strong evidence against independence in that cell

Example - Cyclones Near Antarctica Adjusted residuals are computed in the following table. Row proportion for Region 40-49S: 1517/9165=0.1655 Column Proportion for Season Autumn is: 1876/9165=0.2047

Ordinal Explanatory and Response Variables Pearson’s Chi-square test can be used to test associations among ordinal variables, but more powerful methods exist When theories exist that the association is directional (positive or negative), measures exist to describe and test for these specific alternatives from independence: Gamma Kendall’s tb

Concordant and Discordant Pairs Concordant Pairs - Pairs of individuals where one individual scores “higher” on both ordered variables than the other individual Discordant Pairs - Pairs of individuals where one individual scores “higher” on one ordered variable and the other individual scores “higher” on the other C = # Concordant Pairs D = # Discordant Pairs Under Positive association, expect C > D Under Negative association, expect C < D Under No association, expect C  D

Example - Alcohol Use and Sick Days Alcohol Risk (Without Risk, Hardly any Risk, Some to Considerable Risk) Sick Days (0, 1-6, 7) Concordant Pairs - Pairs of respondents where one scores higher on both alcohol risk and sick days than the other Discordant Pairs - Pairs of respondents where one scores higher on alcohol risk and the other scores higher on sick days Source: Hermansson, et al (2003)

Example - Alcohol Use and Sick Days Concordant Pairs: Each individual in a given cell is concordant with each individual in cells “Southeast” of theirs Discordant Pairs: Each individual in a given cell is discordant with each individual in cells “Southwest” of theirs

Example - Alcohol Use and Sick Days

Measures of Association Goodman and Kruskal’s Gamma: Kendall’s tb: When there’s no association between the ordinal variables, the population based values of these measures are 0. Statistical software packages provide these tests.

Example - Alcohol Use and Sick Days

Measures of Association Absolute Risk (AR): p1-p2 Relative Risk (RR): p1 / p2 Odds Ratio (OR): o1 / o2 (o = p/(1-p)) Note that if p1 = p2 (No association between outcome and grouping variables): AR=0 RR=1 OR=1

Relative Risk Ratio of the probability that the outcome characteristic is present for one group, relative to the other Sample proportions with characteristic from groups 1 and 2:

Relative Risk Estimated Relative Risk: 95% Confidence Interval for Population Relative Risk:

Relative Risk Interpretation Conclude that the probability that the outcome is present is higher (in the population) for group 1 if the entire interval is above 1 Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is below 1 Do not conclude that the probability of the outcome differs for the two groups if the interval contains 1

Example - Concussions in NCAA Athletes Units: Game exposures among college socer players 1997-1999 Outcome: Presence/Absence of a Concussion Group Variable: Gender (Female vs Male) Contingency Table of case outcomes: Source: Covassin, et al (2003)

Example - Concussions in NCAA Athletes There is strong evidence that females have a higher risk of concussion

Odds Ratio Odds of an event is the probability it occurs divided by the probability it does not occur Odds ratio is the odds of the event for group 1 divided by the odds of the event for group 2 Sample odds of the outcome for each group:

Odds Ratio Estimated Odds Ratio: 95% Confidence Interval for Population Odds Ratio

Odds Ratio Interpretation Conclude that the probability that the outcome is present is higher (in the population) for group 1 if the entire interval is above 1 Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is below 1 Do not conclude that the probability of the outcome differs for the two groups if the interval contains 1

Osteoarthritis in Former Soccer Players Units: 68 Former British professional football players and 136 age/sex matched controls Outcome: Presence/Absence of Osteoathritis (OA) Data: Of n1= 68 former professionals, y1 =9 had OA, n1-y1=59 did not Of n2= 136 controls, y2 =2 had OA, n2-y2=134 did not Interval > 1 Source: Shepard, et al (2003)

Mantel-Haenszel Test / CI for Multiple Tables Data collected from q studies or strata in 2x2 contingency tables with common groupings/outcomes Each table has 4 cells: nh11, nh12, nh21, nh21 h=1,…,q They can be combined for an overall Chi-square statistic or odds ratio and confidence Interval

Mantel-Haenszel Computations

Inter-Rater Agreement – Cohen’s Kappa Two Raters rate the same items, typically on an ordinal scale Goal: Measure Strength of their agreement above “chance”

Agreement Among Movie Reviewers Reviews by Gene Siskel and Roger Ebert (160 movies between April, 1995 through September 1996)