Download presentation
Presentation is loading. Please wait.
Published byMartina Allison Modified over 8 years ago
1
Chapter 3 Displaying and Describing Categorical Data
2
Learning Objectives 1.Summarize the distribution of a categorical variable with a frequency table. 2.Display the distribution of a categorical variable with a bar chart or pie chart. 3.Recognize misleading statistics. 4.Know how to make and examine a contingency table. 5.Be able to make and examine a segmented bar chart of the conditional distribution of variable for two or more categories. Rubric: Level 1 – Know the objectives. Level 2 – Fully understand the objectives. Level 3 – Use the objectives to solve simple problems. Level 4 – Use the objectives to solve more advanced problems. Level 5 – Adapts and applies the objectives to different and more complex problems. 2
3
Learning Objectives 6.Describe the distribution of a categorical variable in terms of its possible values and relative frequency. 7.Understand how to examine the association (independence or dependence) between categorical variables by comparing conditional and marginal percentages. 8.Know what Simpson’s paradox is and be able to recognize when it occurs.
4
4 Learning Objective 1: Distributions Definition: Distribution ‒ The pattern of variation of a variable. ‒ What values a variable takes and how often it takes these values. ‒ A distribution tells us the possible values a variable takes as well as the occurrence of those values (frequency or relative frequency)
5
5 Learning Objective 1: Proportion & Percentage (Relative Frequencies) The proportion of the observations that fall in a certain category is the frequency (count) of observations in that category divided by the total number of observations Frequency of that class Sum of all frequencies The Percentage is the proportion multiplied by 100. Proportions and percentages are also called relative frequencies.
6
6 Learning Objective 1: Frequency, Proportion, & Percentage - Example If 4 students received an “A” out of 40 students, then, 1.4 is the frequency. 2.4/40 = 0.10 is the proportion and relative frequency. 3.10% is the percentage (0.10 · 100=10%).
7
Learning Objective 1: Frequency Table A frequency table is a listing of possible values for a variable, together with the number of observations and/ or relative frequencies for each value. 7
8
8 Learning Objective 1: Class Problem A stock broker has been following different stocks over the last month and has recorded whether a stock is up, the same, or down in value. The results were 1.What is the variable of interest 2.What type of variable is it? 3.Add proportions to this frequency table Performance of stockCount Up21 Same7 Down12
9
Learning Objective 1: Class Problem - Solution Performance of stockCountRelative Frequency Up2121/40 = 0.525 Same77/40 = 0.175 Down1212/40 = 0.30 Total401 1.What is the variable of interest?Performance of Stock 2.What type of variable is it? Categorical 3.Add proportions to this frequency table. Below
10
Learning Objective 2: Graphs for Categorical Variables Displaying categorical data Frequency tables can be difficult to read. Sometimes it is easier to analyze a distribution by displaying it with a bar graph or pie chart. Frequency Table FormatCount of Stations Adult Contemporary1556 Adult Standards1196 Contemporary Hit569 Country2066 News/Talk2179 Oldies1060 Religious2014 Rock869 Spanish Language750 Other Formats1579 Total13838 Relative Frequency Table FormatPercent of Stations Adult Contemporary11.2 Adult Standards8.6 Contemporary Hit4.1 Country14.9 News/Talk15.7 Oldies7.7 Religious14.6 Rock6.3 Spanish Language5.4 Other Formats11.4 Total99.9
11
11 Learning Objective 2: Graphs for Categorical Variables Use pie charts and bar graphs to summarize categorical variables. ‒ Pie Chart: A circle having a “slice of pie” for each category. ‒ Bar Graph: A graph that displays a vertical bar for each category. Categorical Data Graphing Data Pie Chart Bar Chart
12
Learning Objective 2: Graphs for Categorical Variables Because the variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.). Bar charts Pie charts
13
Learning Objective 2: Pie Charts When you are interested in parts of the whole (relative frequency or percentages), a pie chart might be your display of choice. Pie charts show the whole group of cases as a circle. They slice the circle into pieces whose size is proportional to the fraction of the whole in each category.
14
Learning Objective 2: Pie Charts - Procedure Commonly used graphical device for presenting relative frequency distributions for qualitative data. First draw a circle; then subdivide the circle into sectors that correspond in area to the relative frequency for each category. Since there are 360 degrees in a circle, a category with a relative frequency of.25 would consume.25(360) = 90 degrees of the circle. “Good practice” requires including a title and either wedge labels or legend.
15
Learning Objective 2: Pie Chart Example Construct a pie chart for the table on U.S. sources of electricity below. 15
16
Learning Objective 2: Pie Chart Example - Solution Step 1: Convert the percentage or relative frequencies of each category to an angle measurement. Step 2: Draw a circle and divide into sectors using the angles calculated. 16 Angle.51 · 360 = 183.6 ̊.06 · 360 = 21.6 ̊.16 · 360 = 57.6 ̊.21 · 360 = 75.6 ̊.03 · 360 = 10.8 ̊ 360 ̊
17
Learning Objective 2: Pie Chart Example - Solution Step 3: Using “Good practices” include a title and either wedge labels or legend. 17
18
Learning Objective 2: Pie Chart
19
19 Learning Objective 2: Bar Graphs Bar graphs are used for summarizing a categorical variable Bar Graphs display a vertical bar for each category. The height of each bar represents either counts (“frequencies”) or percentages (“relative frequencies”) for that category. Usually easier to compare categories with a bar graph than with a pie chart. A bar chart stays true to the area principle. The bars are separated to emphasize the fact that each class is a separate category.
20
Learning Objective 2: Bar Graphs Either counts (frequency bar chart) or proportions (relative frequency bar chart) may be shown on the y-axis. This will not change the shape or relationships of the graph. Make sure all graphs have a descriptive title and that the axes are labeled (this is true for all graphs).
21
Learning Objective 2: Bar Graphs - Procedure A bar chart is a graphical device for depicting categorical data. On one axis (usually the vertical axis) pick an appropriate scale for frequency, relative frequency, or percentage and label. On the other axis (usually the horizontal axis), specify the labels that are used for each of the categories. Using a bar of fixed width (to maintain the area principle) drawn above each class label, extend the height appropriately. Title the graph.
22
Learning Objective 2: Bar Graphs Example Construct a bar graph on the following table of absences today by grade level. Grade LevelAbsences Today 6 th 7 7 th 12 8 th 4
23
Step One: Draw your axis: Learning Objective 2: Bar Graphs Example - Solution
24
Step Two: Scale and label your axis: Grade Level 6th7th8th # of Absences 5 10 15 Learning Objective 2: Bar Graphs Example - Solution
25
Step Three: plot your data: Grade Level 6th7th8th # of Absences 5 10 15 Learning Objective 2: Bar Graphs Example - Solution
26
Step Four: Fill in your bars: Grade Level 6th7th8th # of Absences 5 10 15 Learning Objective 2: Bar Graphs Example - Solution
27
Step Five: Title the graph. Absences in Each Grade Level Grade Level 6th7th8th # of Absences 5 10 15 Learning Objective 2: Bar Graphs Example - Solution
28
Learning Objective 2: Graphs for Categorical Variables Many students spend lots of time constructing graphs only to forget the labels. It is imperative to communicate the data with the proper labels and scaling. Unless specifically directed to do so, do not create a pie chart. Statisticians prefer bar charts to pie charts because they are easier to create and compare.
29
Learning Objective 3: Misleading Statistics There are three kinds of lies: lies, damned lies, and statistics. Benjamin DisraeliBenjamin Disraeli (1804 - 1881)
30
Learning Objective 3: Misleading Statistics Survey problems Choice of sample Question phrasing Misleading graphs Scale Missing numbers Pictographs Correlation vs. Causation Self-Interest Study Partial pictures Deliberate distortions Mistakes
31
Learning Objective 3: Misleading Statistics Questions to Ask When Looking at Data and/or Graphs. Is the information presented correctly? Is the graph trying to influence you? Does the scale use a regular interval? What impression is the graph giving you?
32
Learning Objective 3: Misleading Statistics The best data displays observe a fundamental principle of graphing data called the area principle. The area principle says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents. Violations of the area principle are a common way of misleading with statistics.
33
Learning Objective 3: Misleading Statistics Adjusting the scale of a graph is a common way to mislead (or lie) with statistics. Not following the area principle. Example:
34
Learning Objective 3: Misleading Statistics - Why is this graph misleading? This title tells the reader what to think (that there are huge increases in price). The actual increase in price is 2,000 pounds, which is less than a 3% increase. The graph shows the second bar as being 3 times the size of the first bar, which implies a 300% increase in price. Violates the area principle. The scale moves from 0 to 80,000 in the same amount of space as 80,000 to 81,000.
35
Learning Objective 3: Misleading Statistics - A more accurate graph: An unbiased title A scale with a regular interval. This shows a more accurate picture of the increase. Follows the area principle.
36
Learning Objective 3: Misleading Statistics The scale does not have a regular interval. Why is this graph misleading?
37
Learning Objective 3: Misleading Statistics Graphs in the news can be misleading. The margin of error is the amount (usually in percentage points) that the results can be “ off by. ” Be wary of data with large margins of error.
38
Learning Objective 3: Misleading Statistics From CNN.com
39
Learning Objective 3: Misleading Statistics Problems: The difference in percentage points between Democrats and Republicans (and between Democrats and Independents) is 8% (62 – 54). Since the margin of error is 7%, it is likely that there is even less of a difference. The graph implies that the Democrats were 8 times more likely to agree with the decision. In truth, they were only slightly more likely to agree with the decision. The graph does not accurately demonstrate that a majority of all groups interviewed agreed with the decision.
40
Learning Objective 3: Misleading Statistics CNN.com updates the graph:
41
Double the length, width, and height of a cube, and the volume increases by a factor of eight Learning Objective 3: Area Principle - Pictographs
42
Learning Objective 3: Misleading Statistics What’s Wrong With This Picture? You might think that a good way to show the Titanic data is with this display:
43
The ship display makes it look like most of the people on the Titanic were crew members, with a few passengers along for the ride. When we look at each ship, we see the area taken up by the ship, instead of the length of the ship. The ship display violates the area principle: The area occupied by a part of the graph should correspond to the magnitude of the value it represents. Learning Objective 3: Misleading Statistics
44
Missing Numbers Learning Objective 3: Misleading Statistics
45
Gender After High School Plans 4 Year College 2 Year College EnlistTotal Female 42511 Male 4127 Total 83718 Learning Objective 4: Contingency Table We have already looked at how to summarize one categorical variable using a frequency or relative frequency table When we are interested in looking at a possible relationship between two variables we organize data into a two-way table called a contingency table
46
Learning Objective 4: Association The main purpose of data analysis with two variables is to investigate whether there is an association and to describe that association. An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.
47
Learning Objective 4: Contingency Table A contingency table or two-way table: ‒ Displays two categorical variables. ‒ The rows list the categories of one variable. ‒ The columns list the categories of the other variable. ‒ Entries in the table are frequencies.
48
Learning Objective 4: Contingency Table The table below presents Census Bureau data describing the age and sex of college students. This is a two-way table because it describes two categorical variables. (Age is a categorical here because the students are grouped into age categories.) Age group is the row variable because each row in the table describes students in one age group. Sex is the column variable because each column describes one sex. The entries in the table are the counts of students in each age-by-sex class.
49
Learning Objective 4: Contingency Table Discrepancies may appear in tabular data. For example, the sum of entries in the “25 to 34” row is 1,904 + 1,589 = 3,493. The entry in the total column is 3,494. The explanation is rounding error.
50
Learning Objective 4: Marginal Distribution To best grasp the information contained in the table, first look at the distribution of each variable separately. The distributions of sex alone and age alone are called marginal distributions because they appear at the right and bottom margins of the two-way table. The distribution of a categorical variable says how often each outcome occurred. Usually it is advantageous to look at percents as opposed to counts.
51
When we do a marginal distribution, we only look at totals (the values found on the right margin or bottom margin) To obtain the marginal distributions, divide the column or row totals by the grand or table totals. This is usually expressed as a percentage. Education 25 to 3435 to 5455+Total Did not complete HS4,4749,15514,22427,853 Completed HS11,54626,48120,06058,087 1 to 3 years of college10,70022,61811,12744,445 4+ years of college11,06623,18310,59644,845 Total37,78681,43556,008175,230 Learning Objective 4: Calculating Marginal Distributions Age Group
52
Learning Objective 4: Calculating Marginal Distributions - Example Calculate the marginal distributions for Education (the row categorical variable). Divide each row total by the table total. Education, by Age Group, 2000 (thousand of persons) 25 to 3435 to 5455+Total Did not complete HS4,4749,15514,22427,853 Completed HS11,54626,48120,06058,087 1 to 3 years of college10,70022,61811,12744,445 4+ years of college11,06623,18310,59644,845 Total37,78681,43556,008175,230 Education Distribution Did not complete HS Completed HS 1 to 3 years of college 4+ years of college 15.9%33.1%25.4%25.6%
53
Learning Objective 4: Displaying Marginal Distributions - Example Each marginal distribution from a two-way table is a distribution for a single categorical variable. We could use a pie graph or bar graph to display such a distribution. Education Distribution Did not complete HS Completed HS 1 to 3 years of college 4+ years of college 15.9%33.1%25.4%25.6%
54
Learning Objective 4: Conditional Distribution Marginal distributions tell us nothing about the relationship between two categorical variables. To examine the relationship between two categorical variables we look at the conditional distributions. A conditional distribution shows the distribution of one variable for just the individuals who satisfy some condition on another variable. Education, by Age Group, 2000 (thousand of persons) 25 to 3435 to 5455+Total Did not complete HS4,4749,15514,22427,853 Completed HS11,54626,48120,06058,087 1 to 3 years of college10,70022,61811,12744,445 4+ years of college11,06623,18310,59644,845 Total37,78681,43556,008175,230
55
Learning Objective 4: Calculating Conditional Distributions The “conditional” part is worded like: “on the condition the respondents are 35 to 54” “among those who have completed high school but did not go to college” “for those respondents over 55 years of age” 25 to 3435 to 5455+Total Did not complete HS4,4749,15514,22427,853 Completed HS11,54626,48120,06058,087 1 to 3 years of college10,70022,61811,12744,445 4+ years of college11,06623,18310,59644,845 Total37,78681,43556,008175,230
56
Learning Objective 4: Calculating Conditional Distributions When we look at conditional distributions, we are restricted to a particular column or row (but not “margins”) In conditional distributions, we divide by “Total” of the column or row. 25 to 3435 to 5455+Total Did not complete HS4,4749,15514,22427,853 Completed HS11,54626,48120,06058,087 1 to 3 years of college10,70022,61811,12744,445 4+ years of college11,06623,18310,59644,845 Total37,78681,43556,008175,230
57
Learning Objective 4: Calculating Conditional Distributions - Example Education, by Age Group, 2000 (thousand of persons) 25 to 3435 to 5455+Total Did not complete HS4,4749,15514,22427,853 Completed HS11,54626,48120,06058,087 1 to 3 years of college10,70022,61811,12744,445 4+ years of college11,06623,18310,59644,845 Total37,78681,43556,008175,230 Calculate the conditional distributions for whose persons who have completed HS. Divide each cell value in the row “Completed HS” by the total for the row. 25 to 3435 to 5455+ Completed HS 19.9%45.6%34.5%
58
Learning Objective 4: Calculating Conditional Distributions - Example Each row category and column category give a different conditional distribution. We can use a pie graph or bar graph to display these a conditional distributions. 25 to 3435 to 5455+ Completed HS 19.9%45.6%34.5%
59
Learning Objective 4: Displaying Conditional Distributions Use side by side bar charts can be used to show conditional proportions. Allows for easy comparison of the row variable with respect to the column variable.
60
For every two-way table, there are two sets of possible conditional distributions. Wine purchased for each kind of music played (column conditionals) Music played for each kind of wine purchased (row conditionals) Does background music in supermarkets influence customer purchasing decisions? Learning Objective 4: Displaying Conditional Distributions
61
Income Job Satisfaction Row Total 1 2 3 4 < 30K 20 24 80 82 206 30K-50K 22 38 104 125 289 50K-80K 13 28 81 113 235 > 80K 7 18 54 92 171 C. Total 62 108 319 412 901 This is a Contingency table with Income Level as the Row Variable and Job Satisfaction as the Column Variable. The distributions of income to job satisfaction or job satisfaction to income are called Conditional Distributions. The distributions of income alone and job satisfaction alone are called Marginal Distributions. Relationships between categorical variables are described by calculating appropriate percents from the counts given in each cell. Conditional distribution Marginal distribution Table total Learning Objective 4: Contingency Table - Review
62
Learning Objective 4: Contingency Table – Your Turn Many kidney dialysis patients get vitamin D injections to correct for a lack of calcium. Two forms of vitamin D injections are used: calcitriol and paricalcitol. The records of 67,000 dialysis patients were examined, and half received one drug; the other half the other drug. After three years, 58.7% of those getting paricalcitol had survived, while only 51.5% of those getting calcitriol had survived. Construct an approximate two-way table of the data (due to rounding of the percentages we can’t recover the exact counts – round to whole numbers).
63
Learning Objective 4: Contingency Table - Solution SurvivedDiedTotal Calcitriol Paricalcitol Total 67,000 33,500 19,66513,835 17,252 16,248 36,917 30,083 Two forms of vitamin D injections are used: calcitriol and paricalcitol. The records of 67,000 dialysis patients were examined, and half received one drug; the other half the other drug. After three years, 58.7% of those getting paricalcitol had survived, while only 51.5% of those getting calcitriol had survived. Is survival independent of form of vitamin D injected? Why? NO
64
Learning Objective 4: Contingency Table - Your Turn: The following two-way table summarizes the number of cancer patients treated at two cancer clinics who died or survived. What percentage of the cancer patients survived? a)390 / 1000 = 39% b)320 / 1000 = 32% c)710 / 1000 = 71% d)290 / 1000 = 29%
65
Learning Objective 4: Contingency Table - Solution: The following two-way table summarizes the number of cancer patients treated at two cancer clinics who died or survived. What percentage of the cancer patients survived? a)390 / 1000 = 39% b)320 / 1000 = 32% c)710 / 1000 = 71% d)290 / 1000 = 29%
66
Learning Objective 4: Contingency Table - Your Turn: The following two-way table summarizes the number of cancer patients treated at two cancer clinics who died or survived. What percentage of the cancer patients at Clinic A survived? a)390 / 1000 = 39% b)390 / 710 = 55% c)710 / 1000 = 71% d)390 / 600 = 65%
67
Learning Objective 4: Contingency Table - Solution: The following two-way table summarizes the number of cancer patients treated at two cancer clinics who died or survived. What percentage of the cancer patients at Clinic A survived? a)390 / 1000 = 39% b)390 / 710 = 55% c)710 / 1000 = 71% d)390 / 600 = 65%
68
Learning Objective 4: Contingency Table - Your Turn: The following two-way table summarizes the number of cancer patients treated at two cancer clinics who died or survived. What percentage of the cancer patients who survived were treated at Clinic B? a)320 / 1000 = 32% b)320 / 400 = 80% c)320 / 710 = 45% d)710 / 1000 = 71%
69
Learning Objective 4: Contingency Table - Solution: The following two-way table summarizes the number of cancer patients treated at two cancer clinics who died or survived. What percentage of the cancer patients who survived were treated at Clinic B? a)320 / 1000 = 32% b)320 / 400 = 80% c)320 / 710 = 45% d)710 / 1000 = 71%
70
Learning Objective 4: Contingency Table - Your Turn: The following two-way table summarizes the number of single and married students in a basic statistics course who like watching professional football. The percentage of students in this class who are married is considered a)A marginal percentage b)A conditional percentage c)Something else
71
Learning Objective 4: Contingency Table - Solution: The following two-way table summarizes the number of single and married students in a basic statistics course who like watching professional football. The percentage of students in this class who are married is considered a)A marginal percentage b)A conditional percentage c)Something else
72
Learning Objective 4: Contingency Table - Your Turn: The following two-way table summarizes the number of single and married students in a basic statistics course who like watching professional football. The percentage of married students in this class who like football is considered a)A marginal percentage b)A conditional percentage c)Something else
73
Learning Objective 4: Contingency Table - Solution: The following two-way table summarizes the number of single and married students in a basic statistics course who like watching professional football. The percentage of married students in this class who like football is considered a)A marginal percentage b)A conditional percentage c)Something else
74
Learning Objective 4: Calculating Marginal and Conditional Distributions - Problem Find each percentage and state whether it is a marginal or conditional distribution. a)What percent of the seniors are white? b) What percent of the seniors are planning to attend a 2-year college? c) What percent of the seniors are white and planning to attend a 2-year college? d) What percent of the white seniors are planning to attend a 2-year college? e) What percent of the seniors planning to attend a 2- year college are white? Seniors WhiteMinorityTotal 4-year college 19844242 2-year college 36642 Enlist415 Employment14317 Other16319 Total26857325 Plans 268/ 325 x 100% ≈ 82.5% Marginal 42/325 x 100% ≈ 12.9% Marginal 36/325 x 100% ≈ 11.1% Neither 36/268 x 100% ≈ 13.4% Conditional 36/42 x 100% ≈ 85.7% Conditional
75
An article in the Winter 2003 issue of Chance magazine reported on the Houston Independent School District’s magnet schools programs. The Find each percentage and state whether it is a marginal or conditional distribution. a)What percent of all applicants were Asian? b) What percent of the students accepted were Asian? c) What percent of Asians were accepted? d) What percent of all students were accepted? 292/1755 x 100% ≈ 16.6% Marginal 110/931 x 100% ≈ 11.8% Conditional 110/292 x 100% ≈ 37.6% Conditional 931/1755 x 100% ≈ 53% Marginal Learning Objective 4: Calculating Marginal and Conditional Distributions – Your Turn
76
Learning Objective 5: Segmented Bar Charts A segmented bar chart displays conditional distributions the same as a pie chart, but in the form of bars instead of circles. Each bar is treated as the “whole” and is divided proportionally into segments corresponding to the percentage in each group of the conditional distribution.
77
Learning Objective 5: Segmented Bar Charts Contingency table of ticket class vs. survival on the Titanic Conditional distributions of surviving the Titanic Conditional distributions of dying on the Titanic
78
Learning Objective 6: Describing Categorical Distributions To describe a marginal distribution, 1)Use the data in the table to calculate the marginal distribution (in percents) of the row or column totals. 2)Make a graph to display the marginal distribution. 3)Comment on and compare the heights/percentages of the different categories.
79
Young adults by gender and chance of getting rich FemaleMaleTotal Almost no chance9698194 Some chance, but probably not426286712 A 50-50 chance6967201416 A good chance6637581421 Almost certain4865971083 Total236724594826 ResponsePercent Almost no chance 194/4826 = 4.0% Some chance 712/4826 = 14.8% A 50-50 chance 1416/4826 = 29.3% A good chance 1421/4826 = 29.4% Almost certain 1083/4826 = 22.4% Describe the marginal distribution of chance of getting rich. Learning Objective 6: Describing Categorical Distributions A good change and a 50-50 chance are the highest percentage at about 29%, and almost no chance the lowest at 4%. With some chance 15% and almost certain 22%.
80
Learning Objective 6: Describing Categorical Distributions To describe or compare conditional distributions, 1)Select the row(s) or column(s) of interest. 2)Use the data in the table to calculate the conditional distribution (in percents) of the row(s) or column(s). 3)Make a graph to display the conditional distribution. Use a side-by-side bar graph or segmented bar graph to compare distributions. 4)Comment on and compare the heights/percentages of the different categories.
81
Young adults by gender and chance of getting rich FemaleMaleTotal Almost no chance9698194 Some chance, but probably not426286712 A 50-50 chance6967201416 A good chance6637581421 Almost certain4865971083 Total236724594826 ResponseMale Almost no chance 98/2459 = 4.0% Some chance 286/2459 = 11.6% A 50-50 chance 720/2459 = 29.3% A good chance 758/2459 = 30.8% Almost certain 597/2459 = 24.3% Calculate the conditional distribution of opinion among males. Describe the relationship between gender and opinion. Female 96/2367 = 4.1% 426/2367 = 18.0% 696/2367 = 29.4% 663/2367 = 28.0% 486/2367 = 20.5% Learning Objective 6: Describing Categorical Distributions
82
Learning Objective 7: Association Between Categorical Variables One type of relationship between categorical variables is an association. Definition: We say that there is an association between two variables if specific values of one variable tend to occur in common with specific values of another. Two categorical variables are associated if knowing the value of one variable helps you predict the value of the other variable. If two categorical variables are associated, then they are dependent. If two categorical variables are not associated, then they are independent.
83
Learning Objective 7: Association Between Categorical Variables To examine data for an association from a frequency table, 1)Select the row(s) or column(s) of interest (the condition rows or columns). 2)Use the data in the table to calculate the conditional distribution (in percents) of the row(s) or column(s). 3)Determine whether a specific value of one variable tends to occur in common with a specific value of another. If all the percentages are approximately the same there is no association and the variables are independent. If the percentages are different there is an association and the variables are dependent.
84
For a period of five years, physicians at McGill University Health Center followed more than 5000 adults over the age of 50. The researchers were investigating whether people taking a certain class of antidepressants (SSRIs) might be at greater risk of bone fractures. Their observations are summarized in the table: Learning Objective 7: Checking for Independence Between Variables Do these results suggest there’s an association between experiencing bone fractures and the taking of SSRI antidepressants (is experiencing bone fractures independent of taking antidepressants)? Explain. Taking SSRI No SSRITotal Experienced Fractures 14244258 No Fractures 12346274750 Total 13748715008
85
To determine if there is an association between experiencing bone fractures and the taking of SSRI antidepressants, look at the conditional distribution of experiencing bone fractures or no fractures depend on (or conditional on) SSRI group (columns). Taking SSRI No SSRITotal Experienced Fractures 14244258 No Fractures12346274750 Total13748715008 Learning Objective 7: Checking for Independence Between Variables 1) Select the row(s) or column(s) of interest (the condition rows or columns).
86
Learning Objective 7: Checking for Independence Between Variables Taking SSRI No SSRITotal Experienced Fractures No Fractures Total 10.2% 89.8% 100% 05% 95% 100% 05.2% 94.8% 100% 2) Use the data in the table to calculate the conditional distribution (in percents) of the columns. Taking SSRI No SSRITotal Experienced Fractures 14244258 No Fractures12346274750 Total13748715008
87
Learning Objective 7: Checking for Independence Between Variables 3)Determine whether a specific value of one variable tends to occur in common with a specific value of another. If all the percentages are approximately the same there is no association and the variables are independent. If the percentages are different there is an association and the variables are dependent. Taking SSRI No SSRITotal Experienced Fractures 10.2%05%05.2% No Fractures89.8%95%94.8% Total100% There appears to be an association between experiencing bone fractures and the taking of SSRI antidepressants ( they are dependent – not independent). Overall, approximately 5% of the respondents experienced fractures, while respondents taking SSRI experienced twice that amount at 10%. And overall 95% had no fractures, while whose taking SSRI only 90% had no fractures.
88
Learning Objective 7: Association Between Categorical Variables To examine data for an association from graphs. 1)Select the row(s) or column(s) of interest (the condition rows or columns). 2)Use the data in the table to make pie graphs of the conditional distributions or a segmented bar graph of the conditional distributions. 3)If the sectors (of the pie graphs) or the segments (of the bars) are approximately the same size, then the variables are not associated (independent). If the sectors (of the pie graphs) or the segments (of the bars) are not the same size, then the variables are associated (dependent).
89
Learning Objective 7: Association Graphs No association – Independent (corresponding sectors between pie graphs are different sizes for females and males).
90
Learning Objective 7: Association Graphs Association – Dependent (corresponding segments between bars are approximately the same size for both males and females).
91
Learning Objective 7: Checking for Independence Between Variables The contingency table shows the relationship between class of ticket and surviving the sinking of the Titanic. Is there an association between ticket class and surviving the Titanic (are ticket class and survival dependent or independent)?
92
Learning Objective 7: Checking for Independence Between Variables 1)Select the row(s) or column(s) of interest (the condition rows or columns). Is there an association between ticket class and surviving the Titanic? The row variable, survival, is the condition.
93
Learning Objective 7: Checking for Independence Between Variables 2)Use the data in the table to make pie graphs of the conditional distributions or a segmented bar graph of the conditional distributions. Survival is the condition, so construct segmented bar graphs or pie graphs of the categories alive and dead.
94
Learning Objective 7: Checking for Independence Between Variables 3)If the sectors (of the pie graphs) or the segments (of the bars) are approximately the same size, then the variables are not associated (independent). If the sectors (of the pie graphs) or the segments (of the bars) are not the same size, then the variables are associated (dependent). In this case the sectors or segments for corresponding categories are not approximately the same size, class and survival are dependent. There is an association between the variables ticket class and survival.
95
Learning Objective 7: Checking for Independence Between Variables – Class Problem Examine the table below about ethnicity and acceptance for the Houston Independent School District’s magnet schools program. Does it appear that the admissions decisions are made independent of the applicant’s ethnicity?
96
Admission Decision Accepted Turned Away Wait- listed Total Black / Hispanic 93.81%0%6.19%100 % Asian37.67%16.78%45.55%100 % White35.52%26.53%37.95%100% Total53.05%17.09%29.86%100% Ethnicity First calculate the Ethnicity conditional distributions (row conditional distributions). Learning Objective 7: Solution
97
Learning Objective 7: Solution – Pie Charts by Ethnicity
98
Learning Objective 7: Solution – Segmented Bar Graph by Ethnicity
99
Learning Objective 7: Solution - Conculsion The Houston Independent School District’s magnet schools program admissions decisions are made dependent of the applicant’s ethnicity. 53% of all applicants are accepted, but only about 38% Asian and 36% Whites were accepted. There are similar differences for the categories Turned Away and Wait-Listed. The graphs also show a difference in the size of corresponding sectors and segments, indicating dependence. Admission Decision Accepted Turned Away Wait- listed Total Black / Hispanic 93.81%0%6.19%100 % Asian37.67%16.78%45.55%100 % White35.52%26.53%37.95%100% Total53.05%17.09%29.86%100%
100
Prostate Cancer NoYes Never/Seldom 11014 Small Part of Diet 2420201 Moderate Part 2769209 Large Part 50742 Fish Consumption Learning Objective 7: Independence Between Variables – Class Problem Medical researchers followed 6272 Swedish men for 30 years to see if there was any association between the amount of fish in their diet and prostate cancer. Their results are summarized in the table. Is there an association between fish consumption and prostate cancer?
101
Prostate Cancer NoYesTotal Never/Seldom 11014124 Small Part of Diet 24202012621 Moderate Part 27692092978 Large Part 50742549 Total 58064666272 Fish Consumption No Never/Seldom 1.89% Small Part of Diet 41.7% Moderate Part 47.7% Large Part 8.73% Yes 3% 43.1% 44.8% 9% Total 1.98% 41.8% 47.5% 8.8% Conditioned on No Cancer Conditioned on Cancer Marginal Distribution of Fish Consumption Learning Objective 7: Solution - Independent
102
Learning Objective 8: Simpson’s Paradox A paradox is “a statement that is seemingly contradictory or opposed to common sense and yet is perhaps true”. Discovered by E. H. Simpson in 1951. Occurs when averaging different samples of different sizes Two groups from one sample are compared to two similar groups from another sample Not E. H. Simpson
103
Learning Objective 8: Simpson’s Paradox One sample’s success rate for both groups is higher than the success rates for the other sample’s two groups. However, when both groups’ respective success rates are combined, the sample with the lower success rate ends up with the better overall proportion of successes. Thus, the paradox. One sample group usually has a considerably smaller number of members than the other groups. Simpson’s Paradox does not occur in samples with similar sizes.
104
Learning Objective 8: Simpson’s Paradox What is Simpson’s Paradox? Simpson’s Paradox occurs when an association between two variables is reversed upon observing a third variable. Simpson’s paradox, the third or lurking variable creates a reversal in the direction of an association (“confounding”). To uncover Simpson’s Paradox, divide data into subgroups based on the lurking variable.
105
Recent Cleveland Indians season records 2003—68-94, 42.0% winning percentage 2004—80-82, 49.4% winning percentage Two-season record: 148-176, 45.7% win percentage Recent Minnesota Twins season records 2003—90-72, 55.6% win percentage 2004—92-70, 56.8% win percentage Two-season record: 182-142, 56.2% win percentage Notice that the Twins had a higher percentage in both 2003 and 2004, as well as in the two-year period. Not Simpson’s Paradox. Learning Objective 8: Simpson’s Paradox
106
Ronnie Belliard 2002—61/289,.211 of his at-bats were hits 2003—124/447,.277 of his at-bats were hits Two-season average: 185/736, hits.2514 of the time Casey Blake 2002—4/20,.200 of his at-bats were hits 2003—143/557,.257 of his at-bats were hits Two-season average: 147/577, hits.2548 of the time The two season batting avg. for Belliard was lower than Blake’s, but divided into separate seasons, Belliard’s had a higher batting avg. both seasons. This is Simpson’s Paradox. Learning Objective 8: Simpson’s Paradox – At Work
107
Discrimination? Consider college acceptance rates by sex. Accepted Not accepted Total Men198162360 Women88112200 Total286274560 198 of 360 (55%) of men accepted 88 of 200 (44%) of women accepted Is there a sex bias? Learning Objective 8: Simpson’s Paradox – At Work
108
Or is there a lurking variable that explains the association? To evaluate this, split applications according to the lurking variable "major applied to” Business School (240 applicants) Art School (320 applicants)
109
18 of 120 men (15%) of men were accepted to B-school. 24 of 120 (20%) of women were accepted to B-school. A higher percentage of women were accepted. BUSINESS SCHOOL Accepted Not accepted Total Men18102120 Women2496120 Total42198240 Learning Objective 8: Simpson’s Paradox – At Work
110
ART SCHOOL 180 of 240 men (75%) of men were accepted. 64 of 80 (80%) of women were accepted. A higher percentage of women were accepted. Accepted Not accepted Total Men18060240 Women641680 Total24476320 Learning Objective 8: Simpson’s Paradox – At Work
111
Within each school, a higher percentage of women were accepted than men. No discrimination against women. Possible discrimination against men. This is an example of Simpson’s Paradox. When the lurking variable (School applied to) was ignored, the data suggest discrimination against women. When the School applied to was considered, the association is reversed.
112
Learning Objective 8: Simpson’s Paradox – The Paradox What’s true for the parts isn’t true for the whole.
113
Learning Objective 8: Simpson’s Paradox – CONCLUSION!!!! Simpson’s paradox is a rare phenomenon! It does not occur often! Thus statisticians must be trained academically & ethically well enough to make sure that if it has occurred they will detect and correct it. This is where practice, critical thinking skills, and repetition come into play!
114
Assignment Chapter 3 Notes Worksheet Chapter 3, Exercises pg. 37 – 43: #5, 7, 9,11, 17 - 25 odd, 33, 35 Read Ch-4, pg. 44 - 71
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.