Presentation is loading. Please wait.

Presentation is loading. Please wait.

Looking at Data - Relationships Data analysis for two-way tables

Similar presentations


Presentation on theme: "Looking at Data - Relationships Data analysis for two-way tables"— Presentation transcript:

1 Looking at Data - Relationships Data analysis for two-way tables
IPS Chapter 2.5 © 2009 W.H. Freeman and Company

2 Objectives (IPS Chapter 2.5)
Data analysis for two-way tables Two-way tables Joint distributions Marginal distributions Relationships between categorical variables Conditional distributions Simpson’s paradox

3 Charting Categorical Data
RECALL For two quantitative variables, we usually graph them using a scatterplot. If you have one categorical and one quantitative, we use the various graphs discussed in chapter 1 (e.g. histograms, bar charts, etc) If you have two categorical variables, we usually plot them using “Two-Way Tables”

4 Two-way tables for Categorical Data Factors & Levels
When examining the relationship between two categorical variables, we can’t use a scatterplot. However, we can use a plain old-fashioned “two-way table”. Factors: In this table, we have two factors: Age and Education. Levels: Age has three levels, and education has four levels. First factor: age Second factor: education Data obtained from 2000 U.S. Census

5 Respect tables! Treat tables with respect! While they may seem simple at first glance, they sometimes contain all kinds of information that may not be apparent without some careful examination. By the same token, watch out for traps – there are all kinds of ways of misinterpreting tables!!

6 Frequency Table When each cell in the table contains a simple frequency count, we call it a frequency table.

7 Marginal Frequencies Sometimes, we include the total for each row and each column in the “margins”. These numbers are known as marginal frequencies. These are pretty useful to include in your tables. Marginal frequencies are sometimes expressed in percentages (thought not in this case). 2000 U.S. census

8 As always, YOU get to decide how to look at the data
As always, YOU get to decide how to look at the data. For example, you might want to look at each of the two marginal distributions separately. In that case, you might create a separate bar graph for each. In the case shown here, the authors decided to make it even easier to compare the levels within each factor by converting the frequency to a percentage of the total. 58,077 / 175,230 = 33.1

9 Conditional Frequency Table
For each individual cell, we sometimes compute a proportion by dividing each cell by the sum of all values in the original table. A new table with the collection of these proportions is called a conditional frequency table. In this example, the 25-34/No-H.S. group cell is calculated by dividing 4,459 by 175,230 to give us In other words, of all the people in this study, 2.54% of them fell into the 25-34, no HS category. We can show conditional frequencies for the entire table (as shown here), or by a particular row or column. 25-34 35-54 >55 No HS 0.0254 0.0524 0.0812 HS 0.0660 0.1510 0.1145 <4 Coll 0.0610 0.1292 0.0635 4+ Coll 0.0632 0.1322 0.0605

10 Conditional Frequency by Level
Sometimes we wish to look at conditional frequencies for each level of a factor. For example, in the table below, the 25 to 34 age group occupies the first column. To find the conditional distribution of education in this age group (i.e. for this particular level), look only at that column. Compute each count as a percent of the column total. (Next slide). These percents should add up to 100% because all persons in this age group fall into one of the education categories. These four percents together are “the conditional distribution of education, given the 25 to 34 age group” – see next slide. 2000 U.S. census

11 Conditional distributions
For example, the percentage of college graduates given the age group is 29.30%. The percentage of college graduates given the age group is 28.44%. Etc… Note that the conditional distributions in this particular table were calculated for examination of the age category. If you were interested in the education category, you’d need to create a separate table. Here the percents are calculated by age range (columns). skip to Simpson’s 29.30% = 37785 = cell total . column total

12 Here, the percents are calculated by age range (columns).
The conditional distributions can be graphically compared using side by side bar graphs of one variable for each value of the other variable. Here, the percents are calculated by age range (columns).

13 Example A study was done to establish the preferred activities among men and women between TV, sports, and dancing. A random sample of 30 men and 20 women were asked their preferences. At first glance, we might be temped to say that all three activities were about equal (18, 16, and 16). However, upon closer examination, we see some major differences. Eg: Women overwhelmingly preferred dance relative to men. In fact, the effect is even more pronounced: Note that while 16 women preferred dance to only 2 men – this was 2 men out of 30 while there were 16 women out of 20. This is why it is important to go beyond basic frequencies, and look at conditional frequencies. Key Point: With tables – as with just about anything in statistics, If you just look at pieces without careful examination of the relationship to the whole picture, you run the risk of drawing entirely flawed conclusions!!! Dance Sports TV Total Men 2 11 7 30 Women 16 5 9 20 18 50

14 Music and wine purchase decision
What is the relationship between type of music played in supermarkets and type of wine purchased? We want to compare the conditional distributions of the response variable (wine purchased) for each value of the explanatory variable (music played). Therefore, we calculate column percents. 30 = 35.7% 84 = cell total . column total Calculations: When no music was played, there were 84 bottles of wine sold. Of these, 30 were French wine. 30/84 =  35.7% of the wine sold was French when no music was played. Note how both variables are categorical: Music (French composer, italian composer, etc) Wine (from France, Italy, etc) We calculate the column conditional percents similarly for each of the nine cells in the table:

15 For every two-way table, there are two sets of possible conditional distributions.
Does background music in supermarkets influence customer purchasing decisions? In this case, we can look at either (or both) of the two following distributions: Wine purchased for each kind of music played (column percents) Music played for each kind of wine purchased (row percents)

16 ** Simpson’s paradox Combining groups together can lead to inaccurate conclusions. Example: Hospital death rates On the surface, Hospital B would seem to have a better record. But once patient condition is taken into account, we see that hospital A has in fact a better record. In fact, for both patient conditions!

17 ** Simpson’s paradox In this case, the misleading information occurred because all patients were grouped together instead of stratifying them by the condition in which they were first admitted. (This is a great example of a ‘lurking variable’!) Example: Hospital death rates But once patient condition is taken into account, we see that hospital A has in fact a better record for both patient conditions! The Moore textbook has a very good example using airlne arrival times. Here, patient condition was the lurking variable.


Download ppt "Looking at Data - Relationships Data analysis for two-way tables"

Similar presentations


Ads by Google