Section 4.4:Contingency Tables and Association Contingency table – What and why a contingency table – Marginal distribution – Conditional distribution Simpson’s Paradox – What is it? – What causes it?
Contingency tables are for summarizing bivariate (or multivariate) qualitative data. sexheightshoeeyeshairhand male709brownbrownright male7111blueblondleft male7311.5blueblondright female647brownblackright male667.5brownlightbrownright female636.5brownblackright female646.5blueredright male7210brownblondleft male668.5greenlightbrownright female678brownlightbrownright male7411.5brownbrownleft male7212bluebrownright female688.5bluelightbrownright male7812blueblondright male7012greenblondright female688blueredboth female689.5greenbrownleft female667blueblondright male6610brownbrownright :::: :: :: ::::: ::::: :::::
blackblondbrownlightbrownredTotal blue brown green Total Contingency table results: Rows: eyes Columns: hair
bluebrowngreenTotal black0516 blond brown lightbrown2428 red3104 Total Contingency table results: Rows: hair Columns: eyes Often it is arbitrary which variable gets to be the row variable.
blackblondbrownlightbrownredTotal blue brown green Total blackblondbrownlightbrownredTotal blue brown green Total Contingency table results for sex=female: Rows: eyes Columns: hair Contingency table results for sex=male Displaying three variables (sex, eye color, hair color). We will focus on two variables.
The 793 adult male passenger survival, by 1st class, 2nd class, and 3rd class fares: Status \ Class1st Class2nd Class3rd ClassTotal Saved Lost Total
Status \ Class1st Class2nd Class3rd ClassTotal Saved58 (7.31%) 13 (1.64%) 60 (7.57%) 131 (16.52%) Lost118 (14.88%) 154 (19.42%) 390 (49.18%) 662 (83.48%) Total176 (22.19%) 167 (21.06%) 450 (56.75%) 793 (100%) Relative Frequency marginal distribution: (in parentheses) Margins show relative amount in each row or column Add to one.
Status \ Class1st Class2nd Class3rd ClassTotal Saved58 (44.27%) 13 (9.92%) 60 (45.8%) 131 (100%) Lost118 (17.82%) 154 (23.26%) 390 (58.91%) 662 (100%) Total176 (22.19%) 167 (21.06%) 450 (56.75%) 793 (100%) Conditional Distribution Either rows or columns add to one (100%). Percentages conditioned on survival status
Status \ Class1st Class2nd Class3rd ClassTotal Saved58 (32.95%) 13 (7.78%) 60 (13.33%) 131 (16.52%) Lost118 (67.05%) 154 (92.22%) 390 (86.67%) 662 (83.48%) Total176 (100%) 167 (100%) 450 (100%) 793 (100%) Percentages conditioned on passenger class
Women & children Mentotal saved lost total What proportion of passengers were women & children? 2.What proportion of the passengers were lost? 3.What proportion of the women & children were lost? 4.Of the passengers who were lost, what proportion of the passengers were women and children?
AcceptedRejectedtotal% Accepted Male Female Simpson’s Paradox: Example Hypothetical graduate school acceptance data: Men do better
But if a third variable is accounted for the story changes… MajorAMajorB Accept Rejecte dtotal%AcceptedAccept Rejecte dtotal% Accepted Male Female Women actually do better
Both majorsMajorAMajorB AcceptedRejectedtotal%AcceptedAcceptRejectedtotal%AcceptedAcceptRejectedtotal% Accepted Male Female Why the change?
Simpson’s Paradox represents a situation in which an association between two variables inverts or goes away when a third variable is introduced to the analysis. See: