AP Statistics Chapter 3 Part 1 Displaying and Describing Categorical Data
Learning Objectives 1.Summarize the distribution of a categorical variable with a frequency table. 2.Display the distribution of a categorical variable with a bar chart or pie chart. 3.Recognize misleading statistics. 4.Know how to make and examine a contingency table. 5.Be able to make and examine a segmented bar chart of the conditional distribution of variable for two or more categories. Rubric: Level 1 – Know the objectives. Level 2 – Fully understand the objectives. Level 3 – Use the objectives to solve simple problems. Level 4 – Use the objectives to solve more advanced problems. Level 5 – Adapts and applies the objectives to different and more complex problems. 2
Learning Objectives 6.Describe the distribution of a categorical variable in terms of its possible values and relative frequency. 7.Understand how to examine the association (independence or dependence) between categorical variables by comparing conditional and marginal percentages. 8.Know what Simpson’s paradox is and be able to recognize when it occurs.
4 Learning Objective 1: Distributions Definition: Distribution ‒ The pattern of variation of a variable. ‒ What values a variable takes and how often it takes these values. ‒ A distribution tells us the possible values a variable takes as well as the occurrence of those values (frequency or relative frequency)
5 Learning Objective 1: Proportion & Percentage (Relative Frequencies) The proportion of the observations that fall in a certain category is the frequency (count) of observations in that category divided by the total number of observations Frequency of that class Sum of all frequencies The Percentage is the proportion multiplied by 100. Proportions and percentages are also called relative frequencies.
6 Learning Objective 1: Frequency, Proportion, & Percentage - Example If 4 students received an “A” out of 40 students, then, 1.4 is the frequency. 2.4/40 = 0.10 is the proportion and relative frequency. 3.10% is the percentage (0.10 · 100=10%).
Learning Objective 1: Frequency Table A frequency table is a listing of possible values for a variable, together with the number of observations and/ or relative frequencies for each value. Frequency tables are often used to organize categorical data. Frequency tables display the category names and the counts of the number of data values in each category. Relative frequency tables also display the category names, but they give the percentages (and/or relative frequency) rather than counts for each category. 7
8 Learning Objective 1: Class Problem A stock broker has been following different stocks over the last month and has recorded whether a stock is up, the same, or down in value. The results were 1.What is the variable of interest 2.What type of variable is it? 3.Add proportions to this frequency table Performance of stockCount Up21 Same7 Down12
Learning Objective 2: Graphs for Categorical Variables Displaying categorical data Frequency tables can be difficult to read. Sometimes it is easier to analyze a distribution by displaying it with a bar graph or pie chart. Frequency Table FormatCount of Stations Adult Contemporary1556 Adult Standards1196 Contemporary Hit569 Country2066 News/Talk2179 Oldies1060 Religious2014 Rock869 Spanish Language750 Other Formats1579 Total13838 Relative Frequency Table FormatPercent of Stations Adult Contemporary11.2 Adult Standards8.6 Contemporary Hit4.1 Country14.9 News/Talk15.7 Oldies7.7 Religious14.6 Rock6.3 Spanish Language5.4 Other Formats11.4 Total99.9
10 Learning Objective 2: Graphs for Categorical Variables Use pie charts and bar graphs to summarize categorical variables. ‒ Bar Graph: A graph that displays a vertical bar for each category. ‒ Pie Chart: A circle having a “slice of pie” for each category. Categorical Data Graphing Data Pie Chart Bar Chart
Learning Objective 2: Graphs for Categorical Variables Because the variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.). Bar charts Pie charts
Learning Objective 2: Pie Charts When you are interested in parts of the whole (relative frequency or percentages), a pie chart might be your display of choice. Pie charts show the whole group of cases as a circle. They slice the circle into pieces whose size is proportional to the fraction of the whole in each category.
Learning Objective 2: Pie Charts - Procedure Commonly used graphical device for presenting relative frequency distributions for qualitative data. First draw a circle; then subdivide the circle into sectors that correspond in area to the relative frequency for each category. Since there are 360 degrees in a circle, a category with a relative frequency of.25 would consume.25(360) = 90 degrees of the circle. “Good practice” requires including a title and either wedge labels or legend.
Learning Objective 2: Pie Chart Example Construct a pie chart for the table on U.S. sources of electricity below. 14
Learning Objective 2: Pie Chart Example - Solution Step 1: Convert the percentage or relative frequencies of each category to an angle measurement. Step 2: Draw a circle and divide into sectors using the angles calculated. 15 Angle.51 · 360 = ̊.06 · 360 = 21.6 ̊.16 · 360 = 57.6 ̊.21 · 360 = 75.6 ̊.03 · 360 = 10.8 ̊ 360 ̊
Learning Objective 2: Pie Chart Example - Solution Step 3: Using “Good practices” include a title and either wedge labels or legend. 16
Learning Objective 2: Pie Chart
18 Learning Objective 2: Bar Graphs Bar graphs are used for summarizing a categorical variable Bar Graphs display a vertical bar for each category. The height of each bar represents either counts (“frequencies”) or percentages (“relative frequencies”) for that category. Usually easier to compare categories with a bar graph than with a pie chart. A bar chart stays true to the area principle. The bars are separated to emphasize the fact that each class is a separate category.
Learning Objective 2: Bar Graphs Either counts (frequency bar chart) or proportions (relative frequency bar chart) may be shown on the y-axis. This will not change the shape or relationships of the graph. Make sure all graphs have a descriptive title and that the axes are labeled (this is true for all graphs).
Learning Objective 2: Bar Graphs - Procedure A bar chart is a graphical device for depicting categorical data. On one axis (usually the vertical axis) pick an appropriate scale for frequency, relative frequency, or percentage and label. On the other axis (usually the horizontal axis), specify the labels that are used for each of the categories. Using a bar of fixed width (to maintain the area principle) drawn above each class label, extend the height appropriately. Title the graph.
Learning Objective 2: Bar Graphs Example Construct a bar graph on the following table of absences today by grade level. Grade LevelAbsences Today 6 th 7 7 th 12 8 th 4
Step One: Draw your axis: Learning Objective 2: Bar Graphs Example - Solution
Step Two: Scale and label your axis: Grade Level 6th7th8th # of Absences Learning Objective 2: Bar Graphs Example - Solution
Step Three: plot your data: Grade Level 6th7th8th # of Absences Learning Objective 2: Bar Graphs Example - Solution
Step Four: Fill in your bars: Grade Level 6th7th8th # of Absences Learning Objective 2: Bar Graphs Example - Solution
Step Five: Title the graph. Absences in Each Grade Level Grade Level 6th7th8th # of Absences Learning Objective 2: Bar Graphs Example - Solution
Learning Objective 2: Graphs for Categorical Variables Many students spend lots of time constructing graphs only to forget the labels. It is imperative to communicate the data with the proper labels and scaling. Unless specifically directed to do so, do not create a pie chart. Statisticians prefer bar charts to pie charts because they are easier to create and compare.
Learning Objective 3: Misleading Statistics There are three kinds of lies: lies, damned lies, and statistics. Benjamin DisraeliBenjamin Disraeli ( )
Learning Objective 3: Misleading Statistics Survey problems Choice of sample Question phrasing Misleading graphs Scale Missing numbers Pictographs Correlation vs. Causation Self-Interest Study Partial pictures Deliberate distortions Mistakes
Learning Objective 3: Misleading Statistics Questions to Ask When Looking at Data and/or Graphs. Is the information presented correctly? Is the graph trying to influence you? Does the scale use a regular interval? What impression is the graph giving you?
Learning Objective 3: Misleading Statistics The best data displays observe a fundamental principle of graphing data called the area principle. The area principle says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents. Violations of the area principle are a common way of misleading with statistics.
Learning Objective 3: Misleading Statistics Adjusting the scale of a graph is a common way to mislead (or lie) with statistics. Not following the area principle. Example:
Learning Objective 3: Misleading Statistics - Why is this graph misleading? This title tells the reader what to think (that there are huge increases in price). The actual increase in price is 2,000 pounds, which is less than a 3% increase. The graph shows the second bar as being 3 times the size of the first bar, which implies a 300% increase in price. Violates the area principle. The scale moves from 0 to 80,000 in the same amount of space as 80,000 to 81,000.
Learning Objective 3: Misleading Statistics - A more accurate graph: An unbiased title A scale with a regular interval. This shows a more accurate picture of the increase. Follows the area principle.
Learning Objective 3: Misleading Statistics The scale does not have a regular interval. Why is this graph misleading?
Learning Objective 3: Misleading Statistics Graphs in the news can be misleading. The margin of error is the amount (usually in percentage points) that the results can be “ off by. ” Be wary of data with large margins of error.
Learning Objective 3: Misleading Statistics From CNN.com
Learning Objective 3: Misleading Statistics Problems: The difference in percentage points between Democrats and Republicans (and between Democrats and Independents) is 8% (62 – 54). Since the margin of error is 7%, it is likely that there is even less of a difference. The graph implies that the Democrats were 8 times more likely to agree with the decision. In truth, they were only slightly more likely to agree with the decision. The graph does not accurately demonstrate that a majority of all groups interviewed agreed with the decision.
Learning Objective 3: Misleading Statistics CNN.com updates the graph:
Double the length, width, and height of a cube, and the volume increases by a factor of eight Learning Objective 3: Area Principle - Pictographs
Learning Objective 3: Misleading Statistics What’s Wrong With This Picture? You might think that a good way to show the Titanic data is with this display:
The ship display makes it look like most of the people on the Titanic were crew members, with a few passengers along for the ride. When we look at each ship, we see the area taken up by the ship, instead of the length of the ship. The ship display violates the area principle: The area occupied by a part of the graph should correspond to the magnitude of the value it represents. Learning Objective 3: Misleading Statistics
Missing Numbers Learning Objective 3: Misleading Statistics
Gender After High School Plans 4 Year College 2 Year College EnlistTotal Female Male 4127 Total Learning Objective 4: Contingency Table We have already looked at how to summarize one categorical variable using a frequency or relative frequency table When we are interested in looking at a possible relationship between two variables we organize data into a two-way table called a contingency table
Learning Objective 4: Association The main purpose of data analysis with two variables is to investigate whether there is an association and to describe that association. An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.
Learning Objective 4: Contingency Table A contingency table or two-way table: ‒ Displays two categorical variables. ‒ The rows list the categories of one variable. ‒ The columns list the categories of the other variable. ‒ Entries in the table are frequencies.
Learning Objective 4: Contingency Table The table below presents Census Bureau data describing the age and sex of college students. This is a two-way table because it describes two categorical variables. (Age is a categorical here because the students are grouped into age categories.) Age group is the row variable because each row in the table describes students in one age group. Sex is the column variable because each column describes one sex. The entries in the table are the counts of students in each age-by-sex class.
Learning Objective 4: Contingency Table Discrepancies may appear in tabular data. For example, the sum of entries in the “25 to 34” row is 1, ,589 = 3,493. The entry in the total column is 3,494. The explanation is rounding error.
Learning Objective 4: Marginal Distribution To best grasp the information contained in the table, first look at the distribution of each variable separately. The distributions of sex alone and age alone are called marginal distributions because they appear at the right and bottom margins of the two-way table. The distribution of a categorical variable says how often each outcome occurred. Usually it is advantageous to look at percents as opposed to counts.
When we do a marginal distribution, we only look at totals (the values found on the right margin or bottom margin) To obtain the marginal distributions, divide the column or row totals by the grand or table totals. This is usually expressed as a percentage. Education 25 to 3435 to 5455+Total Did not complete HS4,4749,15514,22427,853 Completed HS11,54626,48120,06058,087 1 to 3 years of college10,70022,61811,12744, years of college11,06623,18310,59644,845 Total37,78681,43556,008175,230 Learning Objective 4: Calculating Marginal Distributions Age Group
Learning Objective 4: Calculating Marginal Distributions - Example Calculate the marginal distributions for Education (the row categorical variable). Divide each row total by the table total. Education, by Age Group, 2000 (thousand of persons) 25 to 3435 to 5455+Total Did not complete HS4,4749,15514,22427,853 Completed HS11,54626,48120,06058,087 1 to 3 years of college10,70022,61811,12744, years of college11,06623,18310,59644,845 Total37,78681,43556,008175,230 Education Distribution Did not complete HS Completed HS 1 to 3 years of college 4+ years of college 15.9%33.1%25.4%25.6%
Learning Objective 4: Displaying Marginal Distributions - Example Each marginal distribution from a two-way table is a distribution for a single categorical variable. We could use a pie graph or bar graph to display such a distribution. Education Distribution Did not complete HS Completed HS 1 to 3 years of college 4+ years of college 15.9%33.1%25.4%25.6%
Learning Objective 4: Conditional Distribution Marginal distributions tell us nothing about the relationship between two categorical variables. To examine the relationship between two categorical variables we look at the conditional distributions. A conditional distribution shows the distribution of one variable for just the individuals who satisfy some condition on another variable. Education, by Age Group, 2000 (thousand of persons) 25 to 3435 to 5455+Total Did not complete HS4,4749,15514,22427,853 Completed HS11,54626,48120,06058,087 1 to 3 years of college10,70022,61811,12744, years of college11,06623,18310,59644,845 Total37,78681,43556,008175,230
Learning Objective 4: Calculating Conditional Distributions The “conditional” part is worded like: “on the condition the respondents are 35 to 54” “among those who have completed high school but did not go to college” “for those respondents over 55 years of age” 25 to 3435 to 5455+Total Did not complete HS4,4749,15514,22427,853 Completed HS11,54626,48120,06058,087 1 to 3 years of college10,70022,61811,12744, years of college11,06623,18310,59644,845 Total37,78681,43556,008175,230
Learning Objective 4: Calculating Conditional Distributions When we look at conditional distributions, we are restricted to a particular column or row (but not “margins”) In conditional distributions, we divide by “Total” of the column or row. 25 to 3435 to 5455+Total Did not complete HS4,4749,15514,22427,853 Completed HS11,54626,48120,06058,087 1 to 3 years of college10,70022,61811,12744, years of college11,06623,18310,59644,845 Total37,78681,43556,008175,230
Learning Objective 4: Calculating Conditional Distributions - Example Education, by Age Group, 2000 (thousand of persons) 25 to 3435 to 5455+Total Did not complete HS4,4749,15514,22427,853 Completed HS11,54626,48120,06058,087 1 to 3 years of college10,70022,61811,12744, years of college11,06623,18310,59644,845 Total37,78681,43556,008175,230 Calculate the conditional distributions for whose persons who have completed HS. Divide each cell value in the row “Completed HS” by the total for the row. 25 to 3435 to Completed HS 19.9%45.6%34.5%
Learning Objective 4: Displaying Conditional Distributions - Example Each row category and column category give a different conditional distribution. We can use a pie graph or bar graph to display these a conditional distributions. 25 to 3435 to Completed HS 19.9%45.6%34.5%
Learning Objective 4: Displaying Conditional Distributions Use side by side bar charts can be used to show conditional proportions. Allows for easy comparison of the row variable with respect to the column variable.
For every two-way table, there are two sets of possible conditional distributions. Wine purchased for each kind of music played (column conditionals) Music played for each kind of wine purchased (row conditionals) Does background music in supermarkets influence customer purchasing decisions? Learning Objective 4: Displaying Conditional Distributions