Presentation is loading. Please wait.

Presentation is loading. Please wait.

Numerical Methods for Describing Data Distributions

Similar presentations


Presentation on theme: "Numerical Methods for Describing Data Distributions"— Presentation transcript:

1 Numerical Methods for Describing Data Distributions
Chapter 3 Numerical Methods for Describing Data Distributions Created by Kathy Fritz

2 Suppose that you have just received your score on an exam in one of your classes. What would you want to know about the distribution of scores for this exam? Measures of center Measures of spread

3 The stress of the final years of medical training can contribute to depression and burnout. The authors of the paper “Rates of Medication Errors Among Depressed and Burnt Out Residents” (British Medical Journal [2008]: 488) studied 24 residents in pediatrics. Medical records of patients treated by these residents during a fixed time period were examined for errors in ordering or administering medications. The accompanying dotplot displays the total number of medication errors for each of the 24 residents.

4 Choosing Appropriate Measures for Describing Center and Spread
If the shape of the data distribution is … Describe Center and Spread Using …

5 Describing Center and Spread For Data Distributions That Are Approximately Symmetric
Mean Standard Deviation

6 Mean Definition: In mathematics, the capital Greek letter Σ is short for “add them all up.” Therefore, the formula for the mean can be written in more compact notation: The population mean, m (the Greek letter mu), is the arithmetic average of all the x values in an entire population. Some notation: x = the variable of interest n = the sample size x1, x2, …, xn are the individual observations in the data set

7 Measuring Center Use the data below to calculate the mean of the commuting times (in minutes) of 20 randomly selected New York workers. 10 30 5 25 40 20 15 85 65 60 45 0 5 3 00 5 7 8 5 Key: 4|5 represents a New York worker who reported a 45- minute travel time to work.

8 Measuring Variability
Consider the three sets of six exam scores displayed below: Each data set has a mean exam score of 75. Does that completely describe these data sets?

9 Range

10 Deviations The most widely used measures of variability

11 Variance and Standard Deviation
Suppose that we are interested in finding the “typical” or average deviation from the mean. So, to calculate the “typical” or average deviation from the mean, we must first square each deviation. Then the all the squared deviations are positive. The deviations from the mean were -25, -15, -5, 5, 15, and 25. The squares of these deviations from the mean are Now we can average these.

12 Variance and Standard Deviation

13 Variance and Standard Deviation

14 Variance and Standard Deviation
Consider the following data on the number of pets owned by a group of 9 children. Wait a minute . . . If the data values represented the entire population, then we would divide by the sample size (n). However, more often than not, the data values represent a sample from the population and we divide by (n – 1). Why? deviation: = 3 Since the sum of the deviations from the mean is always zero, you cannot just add the deviations and then divide by the number of deviations. What do you do? If the spread of the population were from 50 to 100, samples would rarely have the same spread. The samples would have a smaller spread (less variability). By dividing by a smaller number n - 1, we get a better estimate of the true “typical” deviation from the mean. Can we just calculate the arithmetic average for the deviations from the mean? Why or why not?

15 Measuring Spread: The Standard Deviation
xi (xi-mean) (xi-mean)2 1 3 4 5 7 8 9 “average” squared deviation = 52/(9-1) = This is the variance. Standard deviation = square root of variance =

16 Notation to remember

17 Putting it Together

18

19 Describing Center and Spread For Data Distributions That Are Skewed or Have Outliers
Median Interquartile Range

20 Median The median M The sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list). Then . . .

21 Forty students were enrolled in a statistical reasoning course at a California college. The instructor made course materials, grades, and lecture notes available to students on a class web site. Course management software kept track of how often each student accessed any of these web pages. The data set below (in order from smallest to largest) is the number of times each of the 40 students had accessed the class web page during the first month. 3 4 5 7 8 12 13 14 16 18 19 20 21 22 23 26 36 37 42 84 331

22 Comparing the Mean and the Median
The mean and median measure center in different ways, and both are useful. Don’t confuse the “average” value of a variable (the mean) with its “typical” value, which we might describe by the median. Comparing the Mean and the Median

23 Measuring Spread - Interquartile Range
Interquartile range (iqr) is based on quantities called quartiles which divide the data set into four equal parts (quarters). Lower quartile (Q1) = Upper quartile (Q3) = In n is odd, the median of the entire data set is excluded from both halves when computing quartiles. The sample standard deviation, s, can also be greatly affected by the presence of even one outlier. The interquartile range is a measure of variability that is resistant to the effects of outliers.

24 How to Calculate the Quartiles and the Interquartile Range
Measuring Spread: The Interquartile Range A measure of center alone can be misleading. A useful numerical description of a distribution requires both a measure of center and a measure of spread. How to Calculate the Quartiles and the Interquartile Range To calculate the quartiles:

25 The lower quartile (Q1) is the median of the lower 20 data values.
Recall the website data set: 3 4 5 7 8 12 13 14 16 18 19 20 21 22 23 26 36 37 42 84 331 The lower quartile (Q1) is the median of the lower 20 data values. The upper quartile (Q3) is the median of the upper 20 data values. The interquartile (iqr) is the difference of the upper and lower quartile.

26 Putting it Together The Chronicle of Higher Education (Almanac Issue, ) published the accompanying data on the percentage of the population with a bachelor’s degree or graduate degree in 2007 for each of the 50 U.S. states and the District of Columbia. The data distribution is shown in the histogram below. Step 1: Select

27 Putting it Together Step 2: Calculations Step 3: Interpret

28 IQR = Q3 – Q1 = 42.5 – 15 = 27.5 minutes Find and Interpret the IQR
Travel times to work for 20 randomly selected New Yorkers 10 30 5 25 40 20 15 85 65 60 45 Interpretation: The range of the middle half of travel times for the New Yorkers in the sample is 27.5 minutes. 5 10 15 20 25 30 40 45 60 65 85 5 10 15 20 25 30 40 45 60 65 85 Q1 = 15 M = 22.5 Q3= 42.5 IQR = Q3 – Q1 = 42.5 – 15 = 27.5 minutes

29 Boxplots General Boxplots Modified Boxplots

30 Five-Number Summary The five-number summary consists of the following:

31 Boxplots When to Use Univariate numerical data How to construct
What to look for center, spread, and shape of the data distribution and if there are any unusual features

32 Boxplot Example

33 Comparative Boxplots A comparative boxplot is
Recall the video game study. There were two groups: 1) told to improve total score or 2) told to improve a different aspect, such as speed. 1st 2nd

34 Identifying Outliers Definition: The 1.5 x IQR Rule for Outliers 0 5
In addition to serving as a measure of spread, the interquartile range (IQR) is used as part of a rule of thumb for identifying outliers. Definition: The 1.5 x IQR Rule for Outliers 0 5 3 00 5 7 8 5 In the New York travel time data, we found Q1=15 minutes, Q3=42.5 minutes, and IQR=27.5 minutes.

35 Modified boxplots How to construct
Compute the values in the five-number summary Draw a horizontal line and add an appropriate scale. Draw a box above the line that extends from the lower quartile (Q1) to the upper quartile (Q3) Draw a line segment inside the box at the location of the median.

36 Construct a Boxplot Consider our NY travel times data. Construct a boxplot. 10 30 5 25 40 20 15 85 65 60 45

37 Big Mac prices in U.S. dollars for 44 different countries were given in the article “Big Mac Index 2010”. The following 44 Big Mac prices are arranged in order from the lowest price (Ukraine) to the highest price (Norway). 1.84 1.86 1.90 1.95 2.17 2.19 2.28 2.33 2.34 2.45 2.46 2.50 2.51 2.60 2.62 2.67 2.71 2.80 2.82 2.99 3.08 3.33 3.34 3.43 3.48 3.54 3.56 3.59 3.67 3.73 3.74 3.83 3.84 3.86 3.89 4.00 4.33 4.39 4.90 4.91 6.19 6.56 7.20 1.84 1.86 1.90 1.95 2.17 2.19 2.28 2.33 2.34 2.45 2.46 2.50 2.51 2.60 2.62 2.67 2.71 2.80 2.82 2.99 3.08 3.33 3.34 3.43 3.48 3.54 3.56 3.59 3.67 3.73 3.74 3.83 3,84 3.86 3.89 4.00 4.33 4.39 4.90 4.91 6.19 6.56 7.20 1.84 1.86 1.90 1.95 2.17 2.19 2.28 2.33 2.34 2.45 2.46 2.50 2.51 2.60 2.62 2.67 2.71 2.80 2.82 2.99 3.08 3.33 3.34 3.43 3.48 3.54 3.56 3.59 3.67 3.73 3.74 3.83 3.84 3.86 3.89 4.00 4.33 4.39 4.90 4.91 6.19 6.56 7.20

38 Big Mac Prices Continued . . .
Smallest observation = Upper quartile = Lower quartile = Median = Largest observation =

39 The salaries of NBA players published on the web site hoopshype.com were used to construct the comparative boxplot of salary data for five teams. See page 198 for more information.

40 Measures of Relative Standing
z -scores Percentiles

41 Percentiles For a number r between 0 and 100, the rth percentile is a value such that r percent of the observations fall AT or BELOW that value. This diagram illustrates the 90th percentile.

42 Measuring Position: Percentiles
One way to describe the location of a value in a distribution is to tell what percent of observations are less than it. Definition: 6 7 9 03 Jenny earned a score of 86 on her test. How did she perform relative to the rest of the class?

43 What value of head circumference is at the 75th percentile?
In addition to weight and length, head circumference is another measure of health in newborn babies. The National Center for Health Statistics reports the following summary values for head circumference (in cm) at birth for boys. Head circumference (cm) 32.2 33.2 34.5 35.8 37.0 38.2 38.6 Percentile 5 10 25 50 75 90 95 What value of head circumference is at the 75th percentile? What is the median value of head circumference?

44 z -scores Definition: The z -score tells you.

45 Measuring Position: z-Scores
Jenny earned a score of 86 on her test. The class mean is 80 and the standard deviation is What is her standardized score?

46 Using z-scores for Comparison
We can use z-scores to compare the position of individuals in different distributions. Jenny earned a score of 86 on her statistics test. The class mean was 80 and the standard deviation was She earned a score of 82 on her chemistry test. The chemistry scores had a fairly symmetric distribution with a mean 76 and standard deviation of 4. On which test did Jenny perform better relative to the rest of her class?

47 What do these z-scores mean?
-2.3 1.8

48 Suppose that two graduating seniors, one a marketing major and one an accounting major, are comparing job offers. The accounting major has an offer for $45,000 per year, and the marketing major has an offer for $43,000 per year. Accounting: mean = 46,000 standard deviation = 1500 Marketing: mean = 42,500 standard deviation = 1000

49 Density Curve Definition: A density curve is a curve that
A density curve describes the overall pattern of a distribution. The area under the curve and above any interval of values on the horizontal axis is the proportion of all observations that fall in that interval. The overall pattern of this histogram of the scores of all 947 seventh-grade students in Gary, Indiana, on the vocabulary part of the Iowa Test of Basic Skills (ITBS) can be described by a smooth curve drawn through the tops of the bars.

50 Normal Distributions One particularly important class of density curves are the Normal curves, which describe Normal distributions. All Normal curves are A Specific Normal curve is described by giving its Two Normal curves, showing the mean µ and standard deviation σ.

51 Normal Distributions Definition:
A Normal distribution is described by a Normal density curve. Any particular Normal distribution is completely specified by two numbers: its mean µ and standard deviation σ. Normal distributions are good descriptions for some distributions of real data. Normal distributions are good approximations of the results of many kinds of chance outcomes. Many statistical inference procedures are based on Normal distributions.

52 Empirical Rule If the data distribution is mound shaped and approximately symmetric, then . . . Approximately 68% of the observations Approximately 95% of the observations Approximately 99.7% of the observations are

53 Empirical Rule This illustrates the percentages given by the Empirical Rule.

54 The distribution of Iowa Test of Basic Skills (ITBS) vocabulary scores for 7th grade students in Gary, Indiana, is close to Normal. Suppose the distribution is N(6.84, 1.55). Sketch the Normal density curve for this distribution. What percent of ITBS vocabulary scores are less than 3.74? What percent of the scores are between 5.29 and 9.94?

55 Common Mistakes

56 Avoid these Common Mistakes
Watch out for categorical data that look numerical! Often, categorical data is coded numerically. For example gender might be coded as 0 = female and 1 = male, but this does not make gender a numerical variable. Categorical data CANNOT be summarized using the mean and standard deviation or the median and interquartile range.

57 Avoid these Common Mistakes
Measures of center don’t tell all. Although measures of center, such as the mean and the median, do give you a sense of what might be typical value for a variable, this is only one characteristic of a data set. Without additional information about variability and distribution shape, you don’t really know much about the behavior of the variable.

58 Avoid these Common Mistakes
Data distributions with different shapes can have the same mean and standard deviation. For example, consider the following two histograms: Both histograms have the same mean of 10 and standard deviation of 2, but VERY different shapes.

59 Avoid these Common Mistakes
Both the mean and the standard deviation are sensitive to extreme values in a data set, especially if the sample size is small. If the data distribution is markedly skewed or if the data set has outliers, the median and interquartile range are a better choice for describing center and spread.

60 Avoid these Common Mistakes
Measures of center and measures of variability describe values of a variable, not frequencies in a frequency distribution or heights of bars in a histogram. For example, consider the following two frequency distributions and histograms:

61 Avoid these Common Mistakes
Be careful with boxplots based on small sample sizes. Boxplots convey information about center, variability, and shape, but interpreting shape information is problematic when the sample size is small.

62 Avoid these Common Mistakes
Not all distributions are mound shaped. Using the Empirical Rule in situations where you are not convinced that the data distribution is mound shaped and approximately symmetric can lead to incorrect statements.

63 Avoid these Common Mistakes
Watch for outliers! Unusual observations in a data set often provide important information about the variable under study, so it is important to consider outliers in addition to describing what is typical. Outliers can also be problematic because the values of some summaries are influenced by outliers and because some methods for drawing conclusions from data are not appropriate if the data set has outliers.


Download ppt "Numerical Methods for Describing Data Distributions"

Similar presentations


Ads by Google