Lecture notes 10a: Inference for categorical variables

Slides:



Advertisements
Similar presentations
Hypothesis Testing IV Chi Square.
Advertisements

Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Copyright ©2011 Brooks/Cole, Cengage Learning More about Inference for Categorical Variables Chapter 15 1.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.
Data Analysis Statistics. Levels of Measurement Nominal – Categorical; no implied rankings among the categories. Also includes written observations and.
Crosstabs. When to Use Crosstabs as a Bivariate Data Analysis Technique For examining the relationship of two CATEGORIC variables  For example, do men.
1 Nominal Data Greg C Elvers. 2 Parametric Statistics The inferential statistics that we have discussed, such as t and ANOVA, are parametric statistics.
Presentation 12 Chi-Square test.
AM Recitation 2/10/11.
1 Psych 5500/6500 Chi-Square (Part Two) Test for Association Fall, 2008.
Chi-square (χ 2 ) Fenster Chi-Square Chi-Square χ 2 Chi-Square χ 2 Tests of Statistical Significance for Nominal Level Data (Note: can also be used for.
Two Way Tables and the Chi-Square Test ● Here we study relationships between two categorical variables. – The data can be displayed in a two way table.
Section 10.2 Independence. Section 10.2 Objectives Use a chi-square distribution to test whether two variables are independent Use a contingency table.
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Leftover Slides from Week Five. Steps in Hypothesis Testing Specify the research hypothesis and corresponding null hypothesis Compute the value of a test.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Lecture notes 13: ANOVA (a.k.a. Analysis of Variance)
Chi Square Test Dr. Asif Rehman.
Lecture Slides Elementary Statistics Twelfth Edition
Warm up On slide.
Comparing Counts Chi Square Tests Independence.
Basic Statistics The Chi Square Test of Independence.
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
Chi-Square (Association between categorical variables)
Chapter 12 Chi-Square Tests and Nonparametric Tests
Chi-Square hypothesis testing
Presentation 12 Chi-Square test.
Chi-square test or c2 test
10 Chapter Chi-Square Tests and the F-Distribution Chapter 10
Hypothesis Testing Review
Chapter 12 Tests with Qualitative Data
CHAPTER 11 Inference for Distributions of Categorical Data
John Loucks St. Edward’s University . SLIDES . BY.
Chapter 25 Comparing Counts.
Chapter 8: Inference for Proportions
Chapter 11 Goodness-of-Fit and Contingency Tables
The Analysis of Categorical Data and Chi-Square Procedures
Chapter 9 Hypothesis Testing.
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
Hypothesis testing. Chi-square test
Chapter 11: Inference for Distributions of Categorical Data
Two Categorical Variables: The Chi-Square Test
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 10 Analyzing the Association Between Categorical Variables
Contingency Tables: Independence and Homogeneity
One-Way Analysis of Variance
Inference on Categorical Data
The Analysis of Categorical Data and Goodness of Fit Tests
CHAPTER 11 Inference for Distributions of Categorical Data
Lesson 11 - R Chapter 11 Review:
Paired Samples and Blocks
Analyzing the Association Between Categorical Variables
CHAPTER 11 CHI-SQUARE TESTS
The Analysis of Categorical Data and Goodness of Fit Tests
Chapter 26 Comparing Counts.
CHAPTER 11 Inference for Distributions of Categorical Data
Parametric versus Nonparametric (Chi-square)
The Analysis of Categorical Data and Goodness of Fit Tests
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
The Analysis of Categorical Data and Goodness of Fit Tests
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 26 Comparing Counts.
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Presentation transcript:

Lecture notes 10a: Inference for categorical variables Outline: The contingency table Independence and dependence Hypothesis test for independence The chi-square distribution The “expected” table The test statistic Worked example

Categorical variables Recall that categorical variables take the form of labels, or descriptions. Variables like gender, political affiliation, state of birth, and model of car, are all categorical. Nominal categorical variables can be given in no particular order. All of the variables listed above are nominal Ordinal categorical variables have an order they ought to be listed in. Any variable where categories are ranked (from first to last; from best to worst, from strongest to weakest, etc) is ordinal.

The contingency table A contingency table is a convenient tool for displaying how two categorical variables relate to one another. One variable’s different categories (often called “levels”) are listed as the rows; the other’s are listed as the columns. The last row and last column show totals for the rows and columns, and the bottom right corner shows the grand total (to which the row and column totals both sum). Each cell of the contingency table shows how many observations fall into that combination of categories.

A boring contingency table Here is an exceptionally boring contingency table, showing how a deck of cards is divided into suits and colors: Note that the row totals and column totals both sum to the grand total of 52. Diamond Heart Club Spade Total Red 13 26 Black 52

Finding probabilities Contingency tables allow us to easily compute probabilities. To find the probability that an observation falls into any particular row or column category, simply divide the total for that row or column by the grand total. Using our boring contingency table, we can compute: P(Red) = P(Diamond) =

Finding probabilities We can also find what we call “conditional probabilities”. Conditional probabilities are probabilities of certain events occurring, “given” that some other event has occurred. We write the conditional probability of A, given B, as: P(A|B)

Finding probabilities To find the conditional probability that an observation falls in a certain row (or column) category, given that it falls a certain column (or row) category, make the denominator the total for the category that is “given”, and the numerator the cell count for the intersection of the two variables. Examples: P(Diamond|Red) = P(Red|Diamond) =

A less boring contingency table Here is a contingency table comparing “Mediterranean diet” habits of respondents to smoking status. The data comes from a 2005 survey of men, interested in the longevity benefits of a Mediterranean diet. Mediterranean Diet Low Medium High Total Smoking Status Never 2516 2920 2417 7853 Former 3657 4653 3449 11759 Current 2012 1627 1294 4933 8185 9200 7160 24545

A less boring contingency table We can use this to find the following probabilities: P(Currently) = P(Never) = P(Currently|Low) = P(Never|High) =

Estimating proportions All of these probabilities are based on sample proportions e.g. the proportion of people in our sample who have never smoked. We can perform hypothesis tests to see if a sample proportion differs significantly from some hypothesized value. We can also create confidence intervals for proportions. These techniques work in a nearly identical manner to hypothesis tests and confidence intervals for means. We will not cover them in this class, but they are commonly used, for instance in opinion polling.

Are our variables related? We often want to know whether or not two categorical variables are related. By “are they related?”, we mean “If an observation falls into a certain category for one variable, does this make it more likely to fall into a certain category of the other?” Another way of putting this is: “are our variables independent or dependent?” If they are independent, then knowing that a variable falls into a certain category for the row variable does not affect the probability that it will fall into any of the categories for the column variable, and vice versa.

Independence If two events A and B are independent, then: P(A|B) = P(A) and P(B|A) = P(B) Otherwise they are dependent. Here is the logic behind this: if knowing that B occurred has no impact on the probability that A occurred, then P(A|B) = P(A), and vice versa. However, if knowing that that B occurred does affect the probability that A occurred, then P(A|B) ≠ P(A).

Testing for independence In our diet vs. smoking status example, it appears that the events “Currently” and “Low”, as well as “Never” and “High”, may be related. This is because P(Currently) ≠ P(Currently|Low) and P(Never) ≠ P(Never|High) There is some uncertainty in this statement, as this table is taken from a random sample, and any apparent relationship we observe between variables from a random sample could be due to chance.

Testing for independence Still, this sample is large and the data is suggestive of a relationship. In order to account for the possibility that our results are due to random chance, we should perform an inferential procedure. The first type of analysis we will look at for categorical data is a formal statistical test for independence. We would like to use our sample data to determine whether we believe two variables are independent or dependent. First, we need to establish some notation.

Generic contingency table Below is a “generic” 3x3 table. The subscripts on each “n” identify the row and column, respectively, of the appropriate cell. The dots mean “add them all up”. The capital N denotes the grand total. This notation applies to tables of any dimension (2x2, 5x9, etc.) totals n11 n12 n13 n1● n21 n22 n23 n2● n31 n32 n33 n3● n●1 n●2 n●3 N

Hypothesis test for independence We can now perform a hypothesis test for independence. The test we will look at is often called a “chi-square test for independence” or “Pearson’s chi-squared test”, after its creator, Karl Pearson. Our hypotheses are: H0: The variables are independent Ha: The variables are dependent We now need to compute a test statistic that will tell us how much evidence we have against the null hypothesis.

Hypothesis test for independence We do this by figuring out what our contingency table would look like if H0 was true. Then we compare this to our actual table. We calculate a test statistic that quantifies just how different these two tables are, and based on this test statistic we can compute a p-value. This p-value will tell us how likely it is that we would observe the table we observed, if the two variables were independent (i.e. if H0 was true). If this p-value is small enough, we reject H0.

The Χ2 distribution The test statistic we calculate follows a distribution that we have not yet seen. It is called the chi-squared distribution or Χ2 distribution (hence the name of the test). The Χ2 distribution is created by adding up squared normal random variables. It has degrees of freedom (df) equal to however many squared normal random variables were used to create it.

The Χ2 distribution Here are some examples:

Degrees of freedom We will compute a statistic based on the Χ2 distribution. Just as with the t-distribution, the Χ2 distribution looks different depending on the degrees of freedom (df). We calculate df using the formula:

The “expected” table To determine if we have enough evidence to reject the null hypothesis of independence, we need to determine what our contingency table would look like if A and B were perfectly independent. We do this using the following formula: i.e. Where Eij is the expected count for row i, column j

The “expected” table Here is the generic form of an expected 3x3 table: totals n1● n2● n3● n●1 n●2 n●3 N

The test statistic The test statistic we compute is a Χ2 statistic, because it follows the Χ2 distribution. The formula is: Where “O” stands for “observed count” and “E” stands for “expected count”. In other words, for each cell, we take the difference between the observed count and the expected count, square it, then divide by the expected count. Then we sum all of these numbers up, giving us the chi-square test statistic.

One important assumption On the previous slide, we noted that our test statistic follows a Х2 distribution. This is actually only true “asymptotically”, meaning that as our sample size gets larger, the distribution of our test statistic under H0 gets more and more like a theoretical X2 distribution. If our sample size is small, or there are categories with very few observations in them, this assumption is not satisfied, and the test is invalid.

One important assumption When using any statistical model, we need to take care that the assumptions behind the model aren’t being violated, or at least aren’t being violated severely. A general rule of thumb is that the assumption is satisfied if there are no cells with expected counts less than 5. If this is not the case, there is an alternative test, but we will not discuss it in this class.

The p-value Once we have our X2 test statistic, we find the p-value the same way we did for z-tests and t-tests. It is the area under the curve next to the test statistic. For the chi-square distribution, we only calculate areas to the right of the test statistic. Suppose we have df=5 and a test statistic of Χ2test = 12. Visually, this looks like: Note: this curve is the sampling distribution of the test statistic, assuming that the null hypothesis is true.

The statistical decision As before, we make our statistical decision by comparing the p-value to the level of significance (α). If the p-value is less than α, we reject H0 and conclude that the two variables are dependent. Otherwise we FTR H0, i.e. we fail to reject that they are independent. We could also find the Χ2 critical value corresponding to the level of significance, which would separate our chi-square distribution into “reject” and “fail to reject” regions, and then see which region the test statistic falls into. In this class, we will only use the p-value.

Worked example Let’s apply this technique to the “Mediterranean diet vs. smoking status” example. Here again is the contingency table showing the observed counts: Mediterranean Diet Low Medium High Total Smoking Status Never 2516 2920 2417 7853 Former 3657 4653 3449 11759 Current 2012 1627 1294 4933 8185 9200 7160 24545

H0 and Ha Any time we do a chi-square test for independence, our hypotheses will be of the general form: H0: The variables are independent. Ha: The variables are dependent. Applied to this example, we have: H0: Mediterranean diet and smoking status are independent. Ha: Mediterranean diet and smoking status are dependent. For α, we will use the standard 0.05

The expected table We use to fill in the expected counts: Mediterranean Diet Low Medium High Total Smoking Status Never 7853∙8185 24545 7853∙9200 24545 7853∙7160 24545 7853 Former 11759∙8185 24545 11759∙9200 24545 11759∙7160 24545 11759 Current 4933∙8185 24545 4933∙9200 24545 4933∙7160 24545 4933 8185 9200 7160 24545

The expected table Note that, although cell counts for actual data are discrete, we do not round our expected counts. Mediterranean Diet Low Medium High Total Smoking Status Never 2618.733 2943.475 2290.792 7853 Former 3921.264 4407.529 3430.207 11759 Current 1645.003 1848.996 1439.001 4933 8185 9200 7160 24545

The test statistic Now we take the observed and expected counts and compute: = 2516−2618.733 2 2618.733 + 2920−2943.475 2 2943.475 + 2417−2290.792 2 2290.792 + 3657−3921.264 2 3921.264 + 4653−4407.529 2 4407.529 + 3449−3430.207 2 3430.207 + 2012−1645.003 2 1645.003 + 1627−1848.996 2 1848.996 + 1294−1439.001 2 1439.001

The test statistic Now we take the observed and expected counts and compute: =4.03+.19+6.95+17.81+13.67+.10+81.88+26.65+14.61 =165.89

The p-value We will use the X2cdf function on the calculator to compute the p-value. The syntax is the same as with tcdf. In our example, the test statistic is Χ2test = 165.89 p-value =

The statistical decision Because the p-value is tiny, we reject H0 and conclude that Mediterranean diet habits and smoking status are dependent. It is important to remember that “dependent” in this sense just means that knowing something about one variable gives you information about the probability of the other. It does not imply that one causes the other! It could be that Mediterranean diet habit affects smoking status, or that smoking status affects Mediterranean diet habit, or that other, confounding variables affect both of these things. In other words: correlation does not imply causation!

Sources of dependence You can see which cells contributed the most to the Χ2 test statistic by looking at which ones have the largest value of In this case, the “Low and Currently” and “Medium and Currently” contributed the biggest values (81.88 and 26.65, respectively). This means that these cells different had the largest difference between observed and expected counts.

Sources of dependence Here, the observed count for “Low and Currently” was much higher than what we would expect under independence (2012 vs. 1645) Also, the observed count for “Medium and Currently” was much lower than what we would expect under independence (1627 vs. 1849) This suggests that the strongest relationship between Mediterranean diet and smoking status occurs in these categories. Perhaps for other categories (e.g. Former and High), there is not as strong a relationship.

Comparing specific categories In the next set of notes, we will more further examine the “sources of dependence” by introducing a way to compare specific categories from two variables with one another. This is useful because sometimes, the Χ2 test for independence results in two variables being statistically significantly dependent, but in reality all of the dependence exists in only two categories. In the example we just saw, it appears that almost all of the statistical dependence is accounted for with the “Currently” and “Low / Medium” categories. We will introduce the “odds ratio” next, which will allow us to make these comparisons in more detail.