Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graphs for categorical data

Similar presentations


Presentation on theme: "Graphs for categorical data"— Presentation transcript:

1 Graphs for categorical data

2 Bar Chart When to Use Categorical data How to construct
Draw a horizontal line; write the categories or labels below the line at regularly spaced intervals Draw a vertical line; label the scale using frequency or relative frequency Place equal-width rectangular bars above each category label with a height determined by its frequency or relative frequency

3 Bar Chart (continued) What to Look For
Frequently or infrequently occurring categories Collect the following data and then display the data in a bar chart: What is your favorite ice cream flavor? Vanilla, chocolate, strawberry, or other

4 Why MUST we use relative frequencies?
Double Bar Charts When to Use Categorical data How to construct Constructed like bar charts, but with two (or more) groups being compared MUST use relative frequencies on the vertical axis MUST include a key to denote the different bars Why MUST we use relative frequencies?

5 What should you do first?
Each year the Princeton Review conducts a survey of students applying to college and of parents of college applicants. In 2009, 12,715 high school students responded to the question “Ideally how far from home would you like the college you attend to be?” Also, 3007 parents of students applying to college responded to the question “how far from home would you like the college your child attends to be?” Data is displayed in the frequency table below. What should you do first? The first step is to calculate the relative frequencies for students and for parents. Frequency Ideal Distance Students Parents Less than 250 miles 4450 1594 250 to 500 miles 3942 902 500 to 1000 miles 2416 331 More than 1000 miles 1907 180 Create a comparative bar chart with these data.

6 Relative Frequency Ideal Distance Students Parents Less than 250 miles .35 .53 250 to 500 miles .31 .30 500 to 1000 miles .19 .11 More than 1000 miles .15 .06 Found by dividing the frequency by the total number of students Found by dividing the frequency by the total number of parents What does this graph show about the ideal distance college should be from home? See page 97 for a discussion of graph

7 Segmented (or Stacked) Bar Charts
When to Use Categorical data How to construct MUST first calculate relative frequencies Draw a bar representing 100% of the group Divide the bar into segments corresponding to the relative frequencies of the categories

8 Remember the Princeton survey . . .
Create a segmented bar graph with these data. First draw a bar that represents 100% of the students who answered the survey. Relative Frequency Ideal Distance Students Parents Less than 250 miles .35 .53 250 to 500 miles .31 .30 500 to 1000 miles .19 .11 More than 1000 miles .15 .06

9 Next, divide the bar into segments.
Relative Frequency Ideal Distance Students Parents Less than 250 miles .35 .53 250 to 500 miles .31 .30 500 to 1000 miles .19 .11 More than 1000 miles .15 .06 Notice that this segmented bar chart displays the same relationship between the opinions of students and parents concerning the ideal distance that college is from home as the double bar chart does. First draw a bar that represents 100% of the students who answered the survey. Relative frequency Students Next, divide the bar into segments. Do the same thing for parents – don’t forget a key denoting each category Less than 250 miles 250 to 500 miles 500 to 1000 miles More than 1000 miles Parents

10 Pie (Circle) Chart When to Use Categorical data How to construct
Draw a circle to represent the entire data set Calculate the size of each “slice”: Relative frequency × 360° Using a protractor, mark off each slice To describe – comment on which category had the largest proportion or smallest proportion

11 Create a pie chart for these data.
Typos on a résumé do not make a very good impression when applying for a job. Senior executives were asked how many typos in a résumé would make them not consider a job candidate. The resulting data are summarized in the table below. Number of Typos Frequency Relative Frequency 1 60 .40 2 54 .36 3 21 .14 4 or more 10 .07 Don’t know 5 .03 Create a pie chart for these data.

12 First draw a circle to represent the entire data set.
Number of Typos Frequency Relative Frequency 1 60 .40 2 54 .36 3 21 .14 4 or more 10 .07 Don’t know 5 .03 What does this pie chart tell us about the number of typos occurring in résumés before the applicant would not be considered for a job? First draw a circle to represent the entire data set. Next, calculate the size of the slice for “1 typo” .40×360º =144º Draw that slice. Repeat for each slice. Here is the completed pie chart created using Minitab. See Page 98 for this example.

13 Graphs for numerical data

14 Dotplot How to construct When to Use Small numerical data sets
Draw a horizontal line and mark it with an appropriate numerical scale Locate each value in the data set along the scale and represent it by a dot. If there are two are more observations with the same value, stack the dots vertically

15 Dotplot (continued) What to Look For
The representative or typical value The extent to which the data values spread out The nature of the distribution along the number line The presence of unusual values Collect the following data and then display the data in a dotplot: How many body piercings do you have?

16 How to describe a numerical, univariate graph
Do after Features of Distributions Activity

17 What strikes you as the most distinctive difference among the distributions of exam scores in classes A, B, & C ? Adapted from “Workshop Statistics” by Rossman

18 The mean and/or median is typically reported rather than the mode.
1. Center discuss where the middle of the data falls three measures of central tendency mean, median, & mode The mean and/or median is typically reported rather than the mode.

19 What strikes you as the most distinctive difference among the distributions of scores in classes D, E, & F? Adapted from “Workshop Statistics” by Rossman

20 2. Spread discuss how spread out the data is
refers to the variability in the data Measure of spread are Range, standard deviation, IQR Remember, Range = maximum value – minimum value Standard deviation & IQR will be discussed in Chapter 2

21 What strikes you as the most distinctive difference among the distributions of exam scores in classes G, H, & I ? Adapted from “Workshop Statistics” by Rossman

22 The following slides will discuss these shapes.
refers to the overall shape of the distribution The following slides will discuss these shapes.

23 Symmetrical 1. Collect data by rolling two dice and recording the sum of the two dice. Repeat three times. 2. Plot your sums on the dotplot on the board. 3. What shape does this distribution have? refers to data in which both sides are (more or less) the same when the graph is folded vertically down the middle bell-shaped is a special type has a center mound with two sloping tails

24 Uniform 1. Collect data by rolling a single die and recording the number rolled. Repeat five times. 2. Plot your numbers on the dotplot on the board. 3. What shape does this distribution have? refers to data in which every class has equal or approximately equal frequency To help remember the name for this shape, picture soldier standing in straight lines. What are they wearing?

25 Skewed 1. Collect data finding the age of five coins in circulation (current year minus year of coin) and record 2. Plot the ages on the dotplot on the board. 3. What shape does this distribution have? Name a variable with a distribution that is skewed left. refers to data in which one side (tail) is longer than the other side the direction of skewness is on the side of the longer tail Example of a left skewed distribution, hopefully, is the distribution of grades on the first statistics test. The directions are right skewed or left skewed.

26 Bimodal (multi-modal)
Suppose collect data on the time it takes to drive from San Luis Obispo, California to Monterey, California. Some people may take the inland route (approximately 2.5 hours) while others may take the coastal route (between 3.5 and 4 hours). What shape would this distribution have? refers to the number of peaks in the shape of the distribution Bimodal would have two peaks Multi-modal would have more than two peaks Bimodal distributions can occur when the data set consist of observations from two different kinds of individuals or objects. See page 127 for an example of a data set that is bimodal. What would a distribution be called if it had ONLY one peak? Unimodal

27 3. Shape refers to the overall shape of the distribution
symmetrical, uniform, skewed, or bimodal

28 What strikes you as the most distinctive difference among the distributions of exam scores in class J ? Adapted from “Workshop Statistics” by Rossman

29 4. Unusual occurrences Outlier - value that lies away from the rest of the data Gaps Clusters

30 5. In context You must write your answer in reference to the context in the problem, using correct statistical vocabulary and using complete sentences!

31 Dotplot (continued) What to Look For
The representative or typical value The extent to which the data values spread out The nature of the distribution along the number line The presence of unusual values Collect the following data and then display the data in a dotplot: How many body piercings do you have? Describe the distribution of the number of body piercings the class has.

32 Numerical Graphs Continued

33 Stem-and-Leaf Displays
When to Use Univariate numerical data How to construct Select one or more of the leading digits for the stem List the possible stem values in a vertical column Record the leaf for each observation beside each corresponding stem value Indicate the units for stems and leaves in a key or legend To describe – comment on the center, spread, and shape of the distribution and if there are any unusual features Each number is split into two parts: Stem – consists of the first digit(s) Leaf - consists of the final digit(s) Can also create comparative stem-and-leaf displays Remember the data set collected in Chapter 1 – how many piercings do you have? Would a stem-and-leaf display be a good graph for this distribution? Why or why not? Use for small to moderate sized data sets. Doesn’t work well for large data sets. Be sure to list every stem from the smallest to the largest value If you have a long lists of leaves behind a few stems, you can split stems in order to spread out the distribution.

34 Create a stem-and-leaf display with this data?
The following data are price per ounce for various brands of different brands of dandruff shampoo at a local grocery store. Create a stem-and-leaf display with this data? What would an appropriate stem be? For the observation of “0.32”, write the 2 behind the “3” stem. List the stems vertically The median price per ounce for dandruff shampoo is $0.285, with a range of $ The distribution is positively skewed with an outlier at $0.54. Stem Leaf 1 2 3 4 5 Continue recording each leaf with the corresponding stem Describe this distribution. 7 1 9 8 3 2 6 4

35 Notice that now you can see the shape of this distribution.
The Census Bureau projects the median age in 2030 for the 50 states and Washington D.C. A stem-and-leaf display is shown below. Notice that now you can see the shape of this distribution. We use L for lower leaf values (0-4) and H for higher leaf values (5-9). Notice that you really cannot see a distinctive shape for this distribution due to the long list of leaves We can split the stems in order to better see the shape of the distribution.

36 Let’s truncate the leaves to the unit place.
The median percentage of primary-school-aged children enrolled in school is larger for countries in Northern Africa than in Central Africa, but the ranges are the same. The distribution for countries in Northern Africa is strongly negatively skewed, but the distribution for countries in Central Africa is approximately symmetrical. The following is data on the percentage of primary-school-aged children who are enrolled in school for 19 countries in Northern Africa and for 23 countries in Central African. Northern Africa Central Africa Let’s truncate the leaves to the unit place. “4.6” becomes “4” Create a comparative stem-and-leaf display. Be sure to use comparative language when describing these distributions! What is an appropriate stem?

37 Constructed differently for discrete versus continuous data
Histograms When to Use Univariate numerical data How to construct Discrete data Draw a horizontal scale and mark it with the possible values for the variable Draw a vertical scale and mark it with frequency or relative frequency Above each possible value, draw a rectangle centered at that value with a height corresponding to its frequency or relative frequency To describe – comment on the center, spread, and shape of the distribution and if there are any unusual features Constructed differently for discrete versus continuous data For comparative histograms – use two separate graphs with the same scale on the horizontal axis

38 Create a histogram for the number of partners of the queen bees.
Queen honey bees mate shortly after they become adults. During a mating flight, the queen usually takes several partners, collecting sperm that she will store and use throughout the rest of her life. A study on honey bees provided the following data on the number of partners for 30 queen bees. Create a histogram for the number of partners of the queen bees.

39 What do you notice about the shapes of these two histograms?
Draw a rectangle above each value with a height corresponding to the frequency. First draw a horizontal axis, scaled with the possible values of the variable of interest. Next draw a vertical axis, scaled with frequency or relative frequency. Suppose we use relative frequency instead of frequency on the vertical axis. What do you notice about the shapes of these two histograms?

40 This is the type of histogram that most students are familiar with.
Histograms When to Use Univariate numerical data How to construct Continuous data Mark the boundaries of the class intervals on the horizontal axis Draw a vertical scale and mark it with frequency or relative frequency Draw a rectangle directly above each class interval with a height corresponding to its frequency or relative frequency To describe – comment on the center, spread, and shape of the distribution and if there are any unusual features This is the type of histogram that most students are familiar with.

41 Notice the common scale on the horizontal axis
A study examined the length of hours spent watching TV per day for a sample of children age 1 and for a sample of children age 3. Below are comparative histograms. The median number of hours spent watching TV per day was greater for the 1-year-olds than for the 3-year-olds. The distribution for the 3-year-olds was more strongly skewed right than the distribution for the 1-year-olds, but the two distributions had similar ranges. Children Age 1 Children Age 3 Notice the common scale on the horizontal axis Write a few sentences comparing the distributions. See pages

42 Cumulative Relative Frequency Plot
When to use - used to answer questions about percentiles. How to construct - Mark the boundaries of the intervals on the horizontal axis - Draw a vertical scale and mark it with relative frequency - Plot the point corresponding to the upper end of each interval with its cumulative relative frequency, including the beginning point - Connect the points. Percentiles are a value with a given percent of observations at or below that value.

43 Cumulative relative frequency
The National Climatic Center has been collecting weather data for many years. The annual rainfall amounts for Albuquerque, New Mexico from 1950 to 2008 were used to create the frequency distribution below. Find the cumulative relative frequency for each interval Annual Rainfall (in inches) Relative frequency Cumulative relative frequency 4 to <5 0.052 5 to <6 0.103 6 to <7 0.086 7 to <8 8 to <9 0.172 9 to <10 0.069 10 to < 11 0.207 11 to <12 12 to <13 13 to <14 0.052 + 0.155 + 0.241 Continue this pattern to complete the table

44 Cumulative relative frequency
The National Climatic Center has been collecting weather data for many years. The annual rainfall amounts for Albuquerque, New Mexico from 1950 to 2008 were used to create the frequency distribution below. To create a cumulative relative frequency plot, graph a point for the upper value of the interval and the cumulative relative frequency In the context of this problem, explain the meaning of this value. Annual Rainfall (in inches) Relative frequency Cumulative relative frequency 4 to <5 0.052 5 to <6 0.103 0.155 6 to <7 0.086 0.241 7 to <8 0.344 8 to <9 0.172 0.516 9 to <10 0.069 0.585 10 to < 11 0.207 0.792 11 to <12 0.895 12 to <13 0.947 13 to <14 0.999 Why isn’t this value one (1)? Plot a point for each interval. Plot a starting point at (4,0). Connect the points. In the context of this problem, explain the meaning of this value.

45 What proportion of years had rainfall amounts that were 9
What proportion of years had rainfall amounts that were 9.5 inches or less? Rainfall Cumulative relative frequency Approximately 0.55

46 Approximately 30% of the years had annual rainfall less than what amount?
Cumulative relative frequency Approximately 7.5 inches

47 Which interval of rainfall amounts had a larger proportion of years –
9 to 10 inches or 10 to 11 inches? Explain Rainfall Cumulative relative frequency The interval 10 to 11 inches, because its slope is steeper, indicating a larger proportion occurred.

48 Displaying Bivariate Numerical Data

49 Scatterplots are discussed in much greater depth in Chapter 5.
When to Use Bivariate numerical data How to construct - Draw a horizontal scale and mark it with appropriate values of the independent variable - Draw a vertical scale and mark it appropriate values of the dependent variable - Plot each point corresponding to the observations To describe - comment the relationship between the variables Scatterplots are discussed in much greater depth in Chapter 5.

50 Time Series Plots When to Use How to construct To describe
- measurements collected over time at regular intervals How to construct - Draw a horizontal scale and mark it with appropriate values of time - Draw a vertical scale and mark it appropriate values of the observed variable - Plot each point corresponding to the observations and connect To describe - comment on any trends or patterns over time Can be considered bivariate data where the y-variable is the variable measured and the x-variable is time

51 The accompanying time-series plot of movie box office totals (in millions of dollars) over 18 weeks in the summer for 2001 and 2002 appeared in USA Today (September 3, 2002). Describe any trends or patterns that you see.


Download ppt "Graphs for categorical data"

Similar presentations


Ads by Google