Relations in Categorical Data 1
When a researcher is studying the relationship between two variables, if both variables are numerical then scatterplots, the correlation coefficient and regression analysis are useful tools. But, when the variables are categorical the study of a possible relationship between the variables begins with a two way table. Let’s consider an example where a retailer over a period of time will study the types of wine that are sold. The three types are French, Italian, and the rest of the wines labeled Other. It is also of interest to see if the type of music playing in the store has had an influence on the type of wine purchased. So, when each bottle of wine is sold the type of music playing in the store (None, French music, or Italian music) must be noted. Let’s add some numbers and the table on the next slide. 2
Music WineNoneFrenchItalian French Italian11119 Other You will notice all these numbers add up to 243. This number represents the total number of bottles of wine sold during the study. But, the number also represents the total types of music that were played when the wines were sold. On the next slide I will add a column and a row to the table. The column will contain the actual sum of each type of wine and the total number of bottles and the row will contain the actual number of song types and the total number of songs. 3
Music WineNoneFrenchItalianTotal French Italian Other Total You will notice all these numbers add up to 243. This number represents the total number of bottles of wine sold during the study. But, the number also represents the total types of music that were played when the wines were sold. So, the total column is really the distribution on the types of wines sold and we could make pie charts and bar graphs from this information. The column is often called the marginal distribution on the row variable because it is written in the margin. The total row has a similar interpretation. 4
In this example the real interest of the retailer is understanding types of wine sold. Thus the type of wine sold is the response variable and we put it here in the rows. The variable type of music is thought to explain the wine type sold and is thus the explanatory variable, sometimes called the treatment variable in this example and here put in the columns. To explore the possible relationship between the variables we look at each value of the explanatory variable (here in columns). Then for each number in a column we take each number and divide by the column total. We see this on the next slide. Check my work – please! 5
Music WineNoneFrenchItalianTotal French Italian Other Total1111 When no music was playing, wines other than French or Italian made up the majority of wines purchased. French was the second most common, followed by Italian. When French music was playing, more people bought French wine, very few people bought Italian wine, and the percent who bought other wines decreased. When Italian music was playing, French wine made up the same percent of bottles purchased as when no music was playing, but more people bought Italian wine and fewer people bought other wines. So, the music played does seem to matter. French wines sell in greater proportion when French music is played than when not, and the same is true for Italian wines and music. 6