Topic 4: Exploring Categorical Data
Frequency tables and bar plots
Data matrix for emails Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65 Variable Description Spam Specifies whether the email is spam Num_char Number of characters in email Line_breaks Number of line breaks in email Format Specifies whether email was in html or text format Number Indicates if email contained no number, a small number (under 1,000,000), or a big number
Data matrix for emails Categorical variables Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65 Categorical variables
Frequency Table A table that summarizes data for a single categorical variable is called a frequency table. A frequency table can display raw counts, proportions, or both. Examples for the variable number are below. raw count None Small Big Total 549 2827 545 3921
Frequency Table A table that summarizes data for a single categorical variable is called a frequency table. A frequency table can display raw counts, proportions, or both. Examples for the variable number are below. raw count None Small Big Total 549 2827 545 3921 proportion None Small Big Total 0.14 0.72 1
Frequency Table A table that summarizes data for a single categorical variable is called a frequency table. A frequency table can display raw counts, proportions, or both. Examples for the variable number are below. raw count both None Small Big Total 549 2827 545 3921 Count Proportion None 549 0.14 Small 2827 0.72 Big 545 Total 3921 1 proportion None Small Big Total 0.14 0.72 1
Bar plot A bar plot is a graphical representation of a frequency table. raw count proportion None Small Big Total 549 2827 545 3921 None Small Big Total 0.14 0.72 1
The order of the bars There is often a natural ordering for the bars, such as by class year in the example below.
Changing the order of the bars When the bars are ordered from highest count to lowest count, it is sometimes called a Pareto chart.
Bar plot vs. pie chart Pie charts are another way to graphically represent a frequency table. They are well known, but generally not as useful as bar plots.
Categorical data pairs: contingency tables, side-by-side bar plots, segmented bar plots, and mosaic plots
Recall the data matrix for emails Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65 Categorical variables
Pairing two categorical variables Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65
Contingency Table A table that summarizes data for two categorical variables is called a contingency table.
Row and column proportions Row proportions are computed using row totals, and column proportions using column totals. None Small Big Total Spam 149/367 = 0.406 168/367 = 0.458 50/367 = 0.136 1.000 Not spam 400/3554 = 0.113 2657/3554 = 0.748 495/3554 = 0.139 549/3921 = 0.140 2827/3921 = 0.721 545/3921 = 0.139 None Small Big Total Spam 149/549 = 0.271 168/2827 = 0.059 50/545 = 0.092 367/3921 = 0.094 Not spam 400/549 = 0.729 2657/2827 = 0.941 495/545 = 0.908 3684/3921 = 0.906 1.000
Segmented bar plot vs. side-by-side bar plot
Segmented bar plot: count vs. proportion
Mosaic Plot
Mosaic Plot
Simpson’s Paradox
Example: long-term study on smoking A survey of 1,314 women in the United Kingdom during 1972-1974 asked each woman whether she was a smoker. Twenty years later, a follow-up survey observed whether each woman was dead or still alive. Below is a summary of the results. Survival Status Dead Alive Total Smoking Status Smoker 139 (23.88%) 443 (76.12%) 582 (100%) Non-smoker 230 (31.42%) 502 (68.58%) 732 369 (28.08%) 945 (71.92%) 1314
Example: long-term study on smoking A survey of 1,314 women in the United Kingdom during 1972-1974 asked each woman whether she was a smoker. Twenty years later, a follow-up survey observed whether each woman was dead or still alive. Below is a summary of the results. Survival Status Dead Alive Total Smoking Status Smoker 139 (23.88%) 443 (76.12%) 582 (100%) Non-smoker 230 (31.42%) 502 (68.58%) 732 369 (28.08%) 945 (71.92%) 1314
Example: long-term study on smoking A survey of 1,314 women in the United Kingdom during 1972-1974 asked each woman whether she was a smoker. Twenty years later, a follow-up survey observed whether each woman was dead or still alive. Below is a summary of the results.