Chapter 10 Analyzing Categorical Variables and Interpreting Research.

Chapter 10 Analyzing Categorical Variables and Interpreting Research

Chapter 10 Topics Explore associations between categorical variables
Discuss important considerations when reading research papers

The basic ingredients for testing with categorical variables
Section 10.1 Monika Wisniewska. Shutterstock The basic ingredients for testing with categorical variables Identify the Basic Ingredients for Testing with Categorical Variables

Introductory Example: Fair Die
Suppose we wanted to determine if a standard 6-sided die was fair. In a perfect world, if the die was fair, the distribution of outcomes would look like this:

We roll a die 60 times and record the number of spots. The outcomes are shown in the table and graph below.

We can see that our outcomes were not exactly what we would expect in a perfect world. We will use a statistic, called the chi-square statistic, to compare the real outcomes with the expected outcomes. We use the chi-square distribution to find p-values that tell us whether we should be suspicious that our outcomes are not matching our expectations.

Contingency Table (Two-Way Table)
Summary table that displays frequencies for outcomes when two categorical variables are analyzed Even though there are numbers in the table, these numbers are summaries of variables whose values are categories.

Example: Contingency Table
A sample of US adults and members of the American Association for the Advancement of Science (AAAS) were asked by the Pew Poll, “Is it safe to eat genetically modified foods?” The results are shown in the contingency table. (We assumed the sample size was 100 for each group.) US Adults AAAS Scientists Yes 37 88 No 63 12

Expected Counts The expected counts are the numbers of observations we would see in each cell of the contingency table if the null hypothesis were true.

Example: Expected Counts
In our previous example of the fair die, if the die is rolled 60 times, we would expect 10 of each outcome. The table below shows the expected counts and the observed counts from the experiment.

Example: Finding Expected Counts
In the Pew Research poll, there are two categorical variables: Background (US Adult or scientist) and Belief in Safety of GMO (Yes/No). What counts should we expect if these variables are truly not related to each other? US Adults AAAS Scientists Yes 37 88 No 63 12

Starting with the GMO (Yes/No) variable: 125/200 (0.625) of the sample said “Yes” and 75/200 (0.375) of the sample said “No.” If GMO (Yes/No) is independent of background, then we should expect the same percentage of US adults and AAAS scientists to say “Yes” and the same percentage to say “No.”

There were 100 US Adults in the sample, so we expect 0.625(100) = 62.5 to say “Yes” and 0.375(100) = 37.5 to say “No.” Since the sample size of AAAS Scientists in the survey is the same as that of the US adults (100 in each group), so we expect the same numbers to say “Yes” and “No” as the US Adults.

Contingency Table Showing Observed and Expected Counts
The table below shows the actual counts and the expected counts (in parentheses): US Adults AAAS Scientists Total Yes 37 (62.5) 88 (62.5) 125 No 63 (37.5) 12 (37.5) 75 100 200

Notes about Expected Counts
In this example we started with the GMO variable. We could also have started with the background variable and computed the expected counts using the background percentages. For example, 50% of the sample were AAAS Scientists, so we would expect 50% of those saying “Yes” to be scientists. If the expected counts are computed this way, the results are exactly the same. The formula can also be used to compute the expected counts, but in practice the expected counts are computed using technology.

The Chi-Square Statistic
The chi-square statistic measures the amount that our expected counts differ from our observed counts. The formula for the chi-square statistic is where O is the observed count in each cell, E is the expected count in each cell, and means to add the results in each cell.

Example: Chi-Square Statistic
In our GMO example,

Finding the P-Value for the Chi-Square Statistic
The p-value is found using the chi-square distribution. The chi-square distribution has only positive values for test statistics and is right skewed. Like the t-distribution, the shape of the chi-square distribution depends on a number called the degrees of freedom.

The Chi-Square Distribution
The chi-square distribution provides a good approximation to the sampling distribution of the chi-square statistic only if the sample size is large enough (if each expected count is five or higher). In practice, technology is used to compute the chi-square statistic and the accompanying p-value.

GMO Example: Conclusion
Chi-square statistic: p-value: < US Adults and AAAS scientists differ in their support of GMO foods.

Chi-square tests for associations between categorical variables
Section 10.2 Amy Walters. Shutterstock Chi-square tests for associations between categorical variables Use the Chi-Square Test to Determine Whether there is an Association between Two Categorical Variables

Two Tests for Association
There are two tests to determine whether two categorical variables are associated. Which test you use depends on how the data were collected. Both methods use two-way tables to display data. Both are conducted in similar ways.

Test of Homogeneity Collect two (or more) independent samples, one from each population. Each object sampled has a categorical value that is recorded.

Test of Homogeneity Example: Collect a random sample of men and a random sample of women. Ask each person sampled if they agree that global warming is a serious problem. In this example we have two samples: one categorical response variable (opinion) and one categorical grouping variable (gender).

Test of Independence Collect only one sample.
For objects in the sample we record two categorical response variables. Example: Collect a large sample of people and record their marital status and income.

Similarities in the Two Approaches
In both situations we are interested in knowing whether the two categorical variables are related or unrelated. Use the same chi-square test statistic and the same chi-square distribution to find the p-value.

Example: Homogeneity or Independence?
A polling organization asks a random sample of people for their party affiliation (Democrat, Republican, or other) and whether they think the minimum wage should be raised. If the organization wanted to test whether party affiliation and opinion on minimum wage are associated, would this be a test of homogeneity or independence?

This is a test of independence because only one sample was collected and two categorical variables (party affiliation and opinion on minimum wage) were recorded for each member of the sample.

In 2013 the Pew Organization surveyed adults in eight countries that had legalized same-sex marriage, asking the question, “Should homosexuality be accepted?” If the organization wanted to investigate whether country of origin and opinion are associated, would this be a test of homogeneity or independence?

This is an example of a test of homogeneity. Eight independent samples are collected (a sample from each of eight countries) and a single categorical variable is recorded for each member of the sample (the response to the question “ Should homosexuality be accepted?”).

Tests of Homogeneity and Independence
Hypothesize H0: There is no association between the two variables (the variables are independent). Ha: There is an association between the two variables (the variables are not independent).

Prepare Random samples Independent samples and observations Large samples: The expected counts must be five or more in each cell of the table.

Compute to Compare Test statistic is Degrees of freedom = (#rows – 1)(#columns – 1) p-value comes from the X2 distribution. Technology can be used to compute the test statistic and p-value.

Interpret If the p-value is less than or equal to the significance level, we reject H0 and conclude there is an association between the variables. Otherwise we do not reject H0 and we cannot conclude there is an association between the variables.

Example: Republican Views on Global Warming
The Yale Project on Climate Change investigated views on global warming among the Republican Party. Republicans surveyed identified themselves as Liberal, Moderate, Conservative, or Tea Party Republicans and also answered the question, “Do you believe global warming is happening?” The results are shown in the following two-way table.

Is this a test of homogeneity or independence? Run a test to see if there is an association between type of Republican and opinion. The report on this survey is titled “Not All Republicans Think Alike about Global Warming.” Do the results of the hypothesis test support this headline?

Liberal Republican Moderate Conservative Tea Party Yes 72 335 483 120 No 34 205 788 292 “Do you believe global warming is happening?” Hypothesize H0: Republican type and opinion are independent. Ha: Republican type and opinion are not independent. Prepare Samples are random and independent. Check on the technology output that all expected counts are greater than or equal to 5. StatCrunch: Stats > Table > Contingency > with summary

All expected counts (in parentheses) are greater than or equal to 5.
Compute to Compare Test statistic: X2 = p-value: < 4. Interpret Reject H0. There is an association between type of Republican and opinion.

This was a test of independence. There is an association between Republican type and opinion. The study supports the headline, “Not all Republicans Think Alike about Global Warming.”

To Run a Test of Independence Using a TI-84 Calculator
To run a test of independence on the TI-84 calculator: Enter your data into a Matrix. Push STAT > TESTS then select option X2 Test. Press Calculate. The X2 test statistic and p-value will be displayed.

Example: Education and Marital Status
Does a person’s educational level affect his or her decision about marrying? A sample of 665 people was taken. Their marital status and educational level were recorded. The data is shown in the table. Are the variables marital status and educational level independent?

The Data: Education and Marital Status
College or higher HS Less HS Divorced 15 59 10 Married 98 240 70 Single 27 68 17 Widow/er 3 30 28

Hypothesize H0: Marital status and educational level are independent. Ha: Marital status and educational level are associated. Prepare We use technology to compute the test statistic, p-value, and expected counts. We need to check that the expected counts are all five or more.

All expected counts (in parentheses) are greater than or equal to 5.
Compute to Compare Test statistic: X2=39.97 p-value: p<0.0001

Interpret We reject H0 because the p-value is small. Marital status and educational level are associated.

Drawback of the Chi-Square Test
The chi-square test reveals only if two variables are associated, not how they are associated. When both categorical variables only have two categories, the data can be analyzed using a two-proportion z-test instead which gives more information on how the variables are associated.

Example: AIDS Vaccine In a study of a potential AIDS vaccine, 8200 volunteers were randomly assigned to receive a vaccine against AIDS and another 8200 to receive a placebo. The number in each group who had contracted AIDS at the end of 3 years was recorded. The data is shown in the following table. Vaccine No Vaccine Total AIDS 51 74 125 No AIDS 8149 8126 16275 8200 16400

Example: AIDS Vaccine A Chi-square test could be used to determine if vaccine and AIDS are independent. The conclusion of this test could tell us there is an association between the variables but not how they are associated. What the researchers want to know is if the vaccine is effective.

Example: AIDS Vaccine Because both categorical variables Vaccine and AIDS have only 2 outcomes (yes/no), the data can be analyzed using a two-proportion z-test. By testing the hypotheses: H0: propvaccine = propplacebo Ha: propvaccine < propplacebo the researchers can determine the direction of the effect; in other words, whether the vaccine was effective.

reading research papers
Section 10.3 Africa Studio. Shutterstock reading research papers Discuss Methods to Critically Evaluate Published Research

Some Guiding Principles
Pay attention to how randomness is used. Don’t rely solely on the conclusions of any single paper. Extraordinary claims require extraordinary evidence. Be wary of conclusions based on very complex statistical or mathematical models. Stick to peer-reviewed journals.

Random vs. Non-Random Assignment
Randomness in study design makes certain inferences possible and others not possible.

Reading Abstracts An abstract is a short paragraph at the beginning of a research article that described its basic findings. It often includes a description of: The methods used in the study The results of the study The conclusions of the researchers

Evaluating Abstracts When reading an abstract, answer these questions:
What is the research question that the investigators are trying to to answer? What is their answer to the research question? What were the methods they used to collect data?

Evaluating Abstracts Is the conclusion appropriate for the methods used to collect data? To what population do the conclusions apply? Have the results been replicated in other articles? Are the results consistent with what other researchers have suggested?

Beware of Data Dredging
Data Dredging: the practice of stating a hypothesis after first looking at the data. This makes it more likely to mistakenly reject the null hypothesis. If we first look at the test statistic to decide what the hypothesis should be we are rigging the system in favor of the alternative hypothesis.

Beware of Publication Bias
Publication Bias – most scientific and medical journals prefer to publish “positive” findings – one in which the null hypothesis is rejected. If a journal favors positive findings over negative findings, then we will only read studies that find a drug works (for example), even though the vast majority of researchers came to the opposite conclusion (with unpublished studies).

Beware of Profit Motive
Much statistical research is now paid for by corporations that hope their products make life better for people. Sometimes the corporation funding the research can influence whether or not results get published. Always evaluate the methods of the study used and decide whether those methods are sound.

Beware of the Media Media often use “catchy” headlines that do not always capture the true spirit of the study. The most common problem is that the headlines often suggest a cause-and-effect relationship even though such a conclusion is not supported by the data.

Clinical vs. Statistical Significance
An outcome of an experiment or study that is large enough to have a real effect on people’s health or lifestyle is said to have clinical significance. Sometimes researchers discover that a treatment is statistically significant (meaning the outcome is too large to be due to chance) but too small to be meaningful (so it is not clinically significant).

Example: Clinically vs. Statistically Significant
A rare disease affects only one person in 10 million. A controlled experiment finds that a new drug “significantly reduces your risk of getting this disease.” Given that the disease is so rare, is it worth producing the drug to lower your chance of getting the disease from one in 10 million to one in 20 million? This is a case where the treatment may not be clinically significant.

Chapter 10 Analyzing Categorical Variables and Interpreting Research.

Similar presentations

Presentation on theme: "Chapter 10 Analyzing Categorical Variables and Interpreting Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 10 Analyzing Categorical Variables and Interpreting Research.

Similar presentations

Presentation on theme: "Chapter 10 Analyzing Categorical Variables and Interpreting Research."— Presentation transcript:

Similar presentations

About project

Feedback