Download presentation
1
Chapter 2 Summarizing and Graphing Data
2
Recall: The 2 Types of data variables:
3
2.1 Graphs for qualitative variables
Bar graphs (frequency and relative frequency) Pie charts Pareto
4
Graphs for qualitative variables
The values of a qualitative or categorical variable are labels. The distribution of a categorical variable lists the count or percentage of individuals in each category. Counts: A sample of 400 wireless internet users.
5
Wireless internet users
Male 288 (72%) Female 112 (28%) Total 400 (100%)
6
Frequency Distribution (or Frequency Table)
lists each category of data and the number of occurrences for each category of data.
7
Frequency Distribution Ages of Best Actresses
Original Data Frequency Distribution
8
Lower Class Limits are the smallest numbers that can actually belong to different classes Lower Class Limits
9
Upper Class Limits are the largest numbers that can actually belong to different classes Upper Class Limits
10
Class Midpoints Class Midpoints 25.5 35.5 45.5 55.5 65.5 75.5
can be found by adding the lower class limit to the upper class limit and dividing the sum by two Class Midpoints 25.5 35.5 45.5 55.5 65.5 75.5
11
Class Width is the difference between two consecutive lower class limits or two consecutive lower class boundaries Editor: Substitute Table 2-2 Class Width 10
12
Construct a frequency distribution of the color of plain M&Ms.
EXAMPLE Organizing Qualitative Data into a Frequency Distribution The data on the next slide represent the color of M&Ms in a bag of plain M&Ms. Construct a frequency distribution of the color of plain M&Ms. 12
13
Frequency table 13
14
The relative frequency is the proportion (or percent) of observations within a category and is found using the formula: A relative frequency distribution lists the relative frequency of each category of data. 2-14 14
15
EXAMPLE. Organizing Qualitative Data into a Relative
EXAMPLE Organizing Qualitative Data into a Relative Frequency Distribution Use the frequency distribution obtained in the prior example to construct a relative frequency distribution of the color of plain M&Ms. 15
16
Relative Frequency 0.2222 0.2 0.1333 0.0667 0.1111 2-16 16
17
Bar Graphs A bar graph is constructed by labeling each category of data on either the horizontal or vertical axis and the frequency or relative frequency of the category on the other axis.
18
Use the M&M data to construct a frequency bar graph and
EXAMPLE Constructing a Frequency and Relative Frequency Bar Graph Use the M&M data to construct a frequency bar graph and a relative frequency bar graph. 2-18 18
19
2-19 19
20
20
21
Actresses example 28/76 = 37% 30/76 = 39% etc. Total Frequency = 76
22
Frequency bar graph The horizontal scale represents the classes of data values the vertical scale represents the frequencies
23
Relative Frequency Graph
Has the same shape and horizontal scale as the bar graph, but the vertical scale is marked with relative frequencies instead of actual frequencies
24
Interpreting Frequency Distributions
In later chapters, there will be frequent reference to data with a normal distribution. One key characteristic of a normal distribution is that it has a “bell” shape. The frequencies start low, then increase to some maximum frequency, then decrease to a low frequency. The distribution should be approximately symmetric.
25
Example: “bell” shape
26
EXAMPLE Comparing Two Data Sets
The following data represent the marital status (in millions) of U.S. residents 18 years of age or older in 1990 and Draw a side-by-side relative frequency bar graph of the data. Marital Status 1990 2006 Never married 40.4 55.3 Married 112.6 127.7 Widowed 13.8 13.9 Divorced 15.1 22.8 26
27
Marital Status in 1990 vs. 2006 1990 Relative Frequency 2006
27
28
Define the categorical variables
Another Example: On the morning of April 10, 1912 the Titanic sailed from the port of Southampton (UK) directed to NY. Altogether there were 2,201 passengers and crew members on board. This is the table of the survivors of the famous tragic accident. Survived Dead Male Female First class 62 141 118 4 Second class 25 93 154 13 Third class 88 90 422 106 Crew members 192 20 670 3 Define the categorical variables
29
Bar chart representing the data in the table above (in percentages)
30
A Pareto chart is a bar graph where the bars are drawn in decreasing order of frequency or relative frequency. 2-30 30
31
Pareto Chart 2-31 31
32
Pie Chart A pie chart is a circle divided into sectors. Each sector represents a category of data. The area of each sector is proportional to the frequency of the category.
33
EXAMPLE Constructing a Pie Chart
The following data represent the marital status (in millions) of U.S. residents 18 years of age or older in Draw a pie chart of the data. Marital Status Frequency Never married 55.3 Married 127.7 Widowed 13.9 Divorced 22.8 33
34
Other example: A graph depicting qualitative data as slices of a pie
35
2.2 Graphs for quantitative variables:
Histograms (discrete data and continuous data) Stem-and-leaf plots Time series Dot plots Distributions
36
Histogram: Example: CEO salaries
Forbes magazine published data on the best small firms in These were firms with annual sales of more than five and less than $350 million. Firms were ranked by five-year average return on investment. The data extracted are the age and annual salary of the chief executive officer for the first 60 ranked firms. (Data at ) Salary of chief executive officer (including bonuses), in $thousands Histogram on CEO salaries
37
Drawing a histogram Construct a distribution table:
Define class intervals or bins (Choose intervals of equal width!) Count the percentage of observations in each interval End-point convention: left endpoint of the interval is included, and the right endpoint is excluded, i.e. [a,b[ Draw the horizontal axis. Construct the blocks: Height of block = percentages! The total area under an histogram must be 100%
38
Percentage= (frequency/total)x100
Class intervals Frequency Use left end-point Percentage= (frequency/total)x100 Class intervals 0-100 2 2/59x100=3.39 3 5.08 4 4/59x100=6.78 18 30.50 6.78 14 23.73 1 1.70 6 10.18 Total 59 100%
39
30.50% 23.73% 3.39% 1.70% The area of each block represents the percentages of cases in the corresponding class interval (or bin).
40
Remarks A histogram represents percent by area. The area of each block represents the percentages of cases in the corresponding class interval. The total area under a histogram is 100% There is no fixed choice for the number of classes in a histogram: If class intervals are too small, the histogram will have spikes; If class intervals are too large, some information will be missed. Use your judgment! Typically statistical software will choose the class intervals for you, but you can modify them. Let's try various binning levels.
41
Example: Smoking In a Public Health Service study, a histogram was plotted showing the number of cigarettes smoked per day by each subject (male current smokers), as shown below. The density is marked in parentheses. The class intervals include the left endpoint, but not the right. The percentage who smoked less than two packs a day but at least a pack, is around (note: there are 20 cigarettes in a pack.) 1.5% 15% 30% 50% The percent who smoked at least a pack a day is around 1.5% % 30% 50% The percent who smoked at least 3 packs a day is around 0.25 of 1% 0.5 of 1% 10% The percent who smoked 20 cigarettes a day is around 0.35 of 1% of 1% 1.5% 3.5% 10%
42
Answers: The percentage who smoked less than two packs a day but at least a pack, is given by (note: there are 20 cigarettes in a pack.) the area of the third block: 1.5x(40-20)=1.5x20=30% The percent who smoked at least a pack a day is given by the area of the third and fourth blocks: x40=50% The percent who smoked at least 3 packs a day is the area of the block for number of cigarettes greater or equal to 60. This is half of the fourth block: 10% The percent who smoked 20 cigarettes a day: use the left endpoint convention, so 20 belongs to the third block. The answer is 1.5%.
43
Using histograms for comparisons
Fuel economy for model year 2001 compact and two-seater cars (Table 1.8 pg 38) City Consumption Highway consumption
45
Stemplot (or Stem-and-Leaf Plot)
Represents data by separating each value into two parts: the stem (leftmost digits) and the leaf (the last rightmost digit) Example: a data value of 147 would have 14 as the stem and 7 as the leaf.
46
To make a Stemplot:
47
Example:
49
Advantage of Stem-and-Leaf Diagrams over Histograms
Once a frequency distribution or histogram of continuous data is created, the raw data is lost (unless reported with the frequency distribution), however, the raw data can be retrieved from the stem-and-leaf plot. 49
50
Dot plots A dot plot is drawn by placing each observation horizontally in increasing order and placing a dot above the observation each time it is observed. 2-50 50
51
EXAMPLE Drawing a Dot Plot
The following data represent the number of available cars in a household based on a random sample of 50 households. Draw a dot plot of the data. Data based on results reported by the United States Bureau of the Census. 51
52
2-52 52
53
Examining distributions
Purpose of graph: to understand data better Histograms and Stemplots display the main features of a distribution similarly. Features to be observed: Modes (how many?) Symmetry vs skewness Outliers
54
2-54 54
55
EXAMPLE Identifying the Shape of the Distribution
Identify the shape of the following histogram which represents the time between eruptions at Old Faithful. 55
56
Time-Series Graphs Data that have been collected at different points in time
57
Time-Series Graphs Data that have been collected at different points in time
58
Example:
61
Time series graph:
62
Time series graph with seasonal variation:
63
Other types of graphs: Frequency Polygon
Ogive (cumulative frequencies) Scatter Plot (to relate two variables)
64
Frequency polygons The class midpoint is found by adding consecutive lower class limits and dividing the result by 2. A frequency polygon is drawn by plotting a point above each class midpoint on a horizontal axis at a height equal to the frequency of the class. After the points for each class are plotted, draw straight lines between consecutive points. 2-64 64
65
2-65 Time between Eruptions (seconds) Class Midpoint Frequency
Relative Frequency 670 – 679 675 2 0.0444 680 – 689 685 690 – 699 695 7 0.1556 700 – 709 705 9 0.2 710 – 719 715 720 – 729 725 11 0.2444 730 – 739 735 2-65 65
66
Frequency Polygon Time (seconds) 2-66 66
67
Practice
68
CO2 emission levels in the world:
Burning fuel in power plants or motor vehicles emits carbon dioxide (CO2) which contributes to global warming. The table in the next slide displays CO2 emissions per person from countries with populations at least 20 millions. Questions: Why do you think we choose to measure emissions per person rather than total CO2 emissions for each country? Display the data of the table in a graph. Describe the shape, center, and spread of the distribution. Which countries are outliers? Make a Stemplot, then A Histogram.
69
2.3 3.9 17 0.2 1.8 16 2.5 1.4 1.7 6.1 10 0.9 1.2 7.3 3.8 3.6 9.1 0.3 9.7 8.8 4.6 3.7 1.0 0.1 0.7 0.8 8 10.2 11 8.1 6.8 2.8 7.6 9 19.9 4.8 5.1 0.5
70
Answer: (a) Totals emissions would almost certainly be higher for very large countries; for example, we would expect that even with great attempts to control emissions, China (with over 1 billion people) would have higher total emissions than the smallest countries in the data set.
71
Answer: (stemplot) (b) Graph representation of the data: 1) Stemplot:
5 1 11 12 13 14 15 16 17 18 19 9
72
Answer: (histogram) (b)-continued: Graph representation of the data:
2) Histogram: (For example, using Excel – Note: in Excel, the convention is ‘right point belongs in bin, left point out’): (Demo in class) Summary of steps: - Find min and max of data - Choose binning - From Menus: Tools, Data Analysis, Histograms - Define: Input range, Bin range, Output range - Check Chart output. - Click OK. - Adjust width between bars (right-click on bars, format data series, options, set gap width to zero).
73
Answer: (histogram) (b)-continued: Histogram:
min max 19.9 Bin Frequency 2 18 4 9 6 3 8 5 10 12 14 16 1 20 22 Interpretation of graphs: The graph is not symmetric. There is a strong right skew with a high peak at low metric tons per person, The three highest countries (the U.S., Canada, and Australia) appear to be outliers; apart from those countries, the distribution is spread from 0 to 11 metric tons per person (see table).
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.