1 Chapter 7 Looking at Distributions
2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This is a modeling problem. When we have a knowledge to use a specific type distribution (normal, exponential, Poisson distributions) to fit the data, a goodness-fit-test will be useful. Various Q-Q plots are very useful methods to find a suitable distribution to fit the data.
3 Two data sets The contents in this chapter are from Chapter 7 of the textbook. Our textbook chooses the data set of marathon.sav to show us how to use SPSS for looking at distribution. The Chicago Marathon has been run yearly since As we use the student version of SPSS that has some limitation on the number of rows/columns, we use a similar data set of mar1500.sav to instead.
4 Data set “mar1500.sav” The data set involves the following variables: “age”, “sex”, “hours”, “agecat8”, and “agecat6”. Hours = “completion time in hours” Agecat8: 1=24 or less, 2=25-39, 3=40-44, 4=45-49, 5=50-54, 6=55-59, 7=60-64, 8=65+ Agecat6: 1=44 or less, 2=45-49, 3=50-54, 4=55-59, 5=60-64, 6=65+
5 Histogram
6 Impressions on the histogram The mean falls in The distribution is not symmetric about the mean. The distribution has a tail toward larger times. Low marathon times are difficult to achieve. It is hard to break the world record. Since the distribution has a tail toward larger values, the median should be somewhat less than the mean.
7 Basic statistics
8 The 5% trimmed mean excludes the 5% largest and the 5% smallest values. It is based on the 90% of cases in the middle. The trimmed mean provides an alternative to the median when you have some outliers. In this data the 5% trimmed mean doesn’t differ much from the usual mean, because the distribution is not too far from being symmetric.
9 Comparisons of completing time on Gender
10 Comparisons of completing time on Gender
11 Comparisons of completing time on Gender The difference in all of the percentile values of completing times between men and women is about hour. The weighted percentiles and Tukey’s hinges are two different ways of calculating sample percentiles. More details refer to P.120.
12 Histogram of completion times for women
13 Histogram of completion times for men
14 Age and Gender
15 Age and Gender
16 Boxplots of completing times by age and gender
17 Remarks Average completion times for men and women of different ages are shown. For every age group, the average time for men is less than the average time for women. For men and women younger than 45, age does not seem to matter very much. For both men and women the variability of completion times is very stable except the eldest age group.
18 Detecting outliers Cases with values between 1.5 and 3 box lengths from the upper or lower edge of the box are called outliers and are designated with an “o”. Cases with values of more than 3 box lengths from the upper or lower edge of the box are called extreme values. They are designated with “*”.
19
20 A stem-and-leaf plot is a display very much like a histogram, but it includes more information of the data. In a stem-and-leaf plot, each row corresponds to a stem and each case is represented by a leaf. Stem-and-leaf plots
21 The following are price of 15 students eating lunch at a fast-food restaurant: 5.35, 4.75, 4.30, 5.47, 4.85, 6.62, 3.54, 4.87, 6.26, 5.48, 7.27, 8.45, 6.05, 4.76, | 5 The first value of 5.35 is rounded to | The second value of 4.75 is rounded to | 4559 Their stems are 5 and 4, respectively 3 6 | 631 Their leafs are 4 and 8, respectively 1 7 | | 5 Stem-and-leaf plots
22 Stem-and-leaf plots completion time in hours Stem-and-Leaf Plot for agecat6= Frequency Stem & Leaf Extremes (>=6.2) Stem width: 1.00 Each leaf: 1 case(s)