Download presentation
Presentation is loading. Please wait.
1
Topic 4: Exploring Categorical Data
2
Frequency tables and bar plots
3
Data matrix for s Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 s that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65 Variable Description Spam Specifies whether the is spam Num_char Number of characters in Line_breaks Number of line breaks in Format Specifies whether was in html or text format Number Indicates if contained no number, a small number (under 1,000,000), or a big number
4
Data matrix for emails Categorical variables
Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 s that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65 Categorical variables
5
Frequency Table A table that summarizes data for a single categorical variable is called a frequency table. A frequency table can display raw counts, proportions, or both. Examples for the variable number are below. raw count None Small Big Total 549 2827 545 3921
6
Frequency Table A table that summarizes data for a single categorical variable is called a frequency table. A frequency table can display raw counts, proportions, or both. Examples for the variable number are below. raw count None Small Big Total 549 2827 545 3921 proportion None Small Big Total 0.14 0.72 1
7
Frequency Table A table that summarizes data for a single categorical variable is called a frequency table. A frequency table can display raw counts, proportions, or both. Examples for the variable number are below. raw count both None Small Big Total 549 2827 545 3921 Count Proportion None 549 0.14 Small 2827 0.72 Big 545 Total 3921 1 proportion None Small Big Total 0.14 0.72 1
8
Bar plot A bar plot is a graphical representation of a frequency table. raw count proportion None Small Big Total 549 2827 545 3921 None Small Big Total 0.14 0.72 1
9
The order of the bars There is often a natural ordering for the bars, such as by class year in the example below.
10
Changing the order of the bars
When the bars are ordered from highest count to lowest count, it is sometimes called a Pareto chart.
11
Bar plot vs. pie chart Pie charts are another way to graphically represent a frequency table. They are well known, but generally not as useful as bar plots.
12
Categorical data pairs: contingency tables, side-by-side bar plots, segmented bar plots, and mosaic plots
13
Recall the data matrix for emails
Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 s that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65 Categorical variables
14
Pairing two categorical variables
Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 s that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65
15
Contingency Table A table that summarizes data for two categorical variables is called a contingency table.
16
Row and column proportions
Row proportions are computed using row totals, and column proportions using column totals. None Small Big Total Spam 149/367 = 0.406 168/367 = 0.458 50/367 = 0.136 1.000 Not spam 400/3554 = 0.113 2657/3554 = 0.748 495/3554 = 0.139 549/3921 = 0.140 2827/3921 = 0.721 545/3921 = 0.139 None Small Big Total Spam 149/549 = 0.271 168/2827 = 0.059 50/545 = 0.092 367/3921 = 0.094 Not spam 400/549 = 0.729 2657/2827 = 0.941 495/545 = 0.908 3684/3921 = 0.906 1.000
17
Segmented bar plot vs. side-by-side bar plot
18
Segmented bar plot: count vs. proportion
19
Mosaic Plot
20
Mosaic Plot
21
Simpson’s Paradox
22
Example: long-term study on smoking
A survey of 1,314 women in the United Kingdom during asked each woman whether she was a smoker. Twenty years later, a follow-up survey observed whether each woman was dead or still alive. Below is a summary of the results. Survival Status Dead Alive Total Smoking Status Smoker 139 (23.88%) 443 (76.12%) 582 (100%) Non-smoker 230 (31.42%) 502 (68.58%) 732 369 (28.08%) 945 (71.92%) 1314
23
Example: long-term study on smoking
A survey of 1,314 women in the United Kingdom during asked each woman whether she was a smoker. Twenty years later, a follow-up survey observed whether each woman was dead or still alive. Below is a summary of the results. Survival Status Dead Alive Total Smoking Status Smoker 139 (23.88%) 443 (76.12%) 582 (100%) Non-smoker 230 (31.42%) 502 (68.58%) 732 369 (28.08%) 945 (71.92%) 1314
24
Example: long-term study on smoking
A survey of 1,314 women in the United Kingdom during asked each woman whether she was a smoker. Twenty years later, a follow-up survey observed whether each woman was dead or still alive. Below is a summary of the results.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.