10 Chapter Data Analysis/Statistics: An Introduction

10 Chapter Data Analysis/Statistics: An Introduction
Copyright © 2016, 2013, and 2010, Pearson Education, Inc.

10-4 Measures of Central Tendency and Variation
Students will be able to understand and explain: • Central tendency and variation (spread); • Computing the mean; • Median and mode of a data set; • Appropriate measures of central tendency; • Measures of spread; • Boxplots and comparing sets of data; • Variance and standard deviation; • Normal distribution; and • Percentiles, quartiles, and deciles.

Measures of Central Tendency
Two important aspects of data are its center and its spread. The mean and median are measures of central tendency that describe where data are centered. The range, interquartile range, variance, mean absolute deviation, and standard deviation describe the spread of data and should be used with measures of central tendency.

Computing Means The number commonly used to characterize a set of data is the arithmetic mean, frequently called the average, or the mean. The arithmetic mean of the numbers x1, x2,…, xn, denoted x and read “x bar” is given by

Understanding the Mean as a Balance Point
The mean of 5 is the balance point where the sum of the distances from the mean to the data points above the mean equals the sum of the distances from the mean to the data points below the mean.

The sum of the distances above the mean is or 8. The sum of the distances below the mean is or 8. The data are centered about the mean, but the mean does not belong to the set of data.

It is possible to rearrange the data to have the same mean but be spread very differently.

Computing Medians The value exactly in the middle of an ordered set of numbers is the median. To find the median for a set of n numbers, Arrange the numbers in order from least to greatest. a. If n is odd, the median is the middle number. b. If n is even, the median is the mean of the two middle numbers.

Computing Medians A median is often reported with the interquartile range, a measure of spread that shows where the middle 50% of the scores lie with the median in that range. The two together form a much better pair to describe the data than the median alone.

Finding Modes The mode of a set of data is sometimes reported as a measure of central tendency, but when it is reported in that form, it is frequently being misused. The mode of a set of data is the number that appears most frequently, if there is one, but the mode does not have to be in any way a measure of central tendency. A mode is frequently reported with categorical data.

Example Find the (a) mean, (b) median, and (c) mode for the data:
a. b. The data is arranged from least to greatest and there are an even number of data, so the median is

Example (continued) Find the (a) mean, (b) median, and (c) mode for the data: c. The set of data is bimodal. Both 60 and 95 are modes.

Choosing the Most Appropriate Average
Although the mean is the most commonly used “average” to describe a set of data, it may not always be the most appropriate choice.

Example Suppose a company employs 20 people. The president of the company earns $200,000, the vice president earns $75,000, and 18 employees earn $10,000 each. Is the mean the best number to choose to represent the “average” salary for the company?

Example (continued) The mean salary is
In this case, the mean salary of $22,750 is not representative. Either the median or mode, both of which are $10,000, would describe the typical salary better. The mean is affected by extreme values. In most cases, the median is not affected by extreme values.

Example Suppose nine students make the following scores on a test:
Is the median the best “average” to represent the set of scores? The median score is 92.

Example (continued) From that score, one might infer that the students all scored very well, yet 92 is certainly not a typical score. In this case, the mean of approximately 69 might be more appropriate than the median. However, with the spread of the scores, neither is very appropriate for this distribution.

Example Is the mode an appropriate “average” for the following test scores? The mode is 98. The score of 98 is not representative of the set of data because of the large spread of scores and the much lower mean (and median).

Measures of Spread Consider the following data: Range = upper extreme – lower extreme = 35 – 20 = 15 20, 22, 22, 25, 26, 27, 27, 28, 30, 35

Measures of Spread 20, 22, 22, 25, 26, 27, 27, 28, 30, 35 Median = 22 = Q1 Median = 28 = Q3 Interquartile range (IQR) = Q3 − Q1 = 28 − 22 = 6 When the interquartile range is reported along with the median, not only do we know the middle, we know how spread out the middle 50% of the data are.

Box Plots A box plot is a way to display data visually and draw informal conclusions. Box plots show only the visual representations of the five-number summary of the data: the median, the upper and lower quartiles, and the least and greatest values in the distribution.

Box Plots Minimum data point Maximum data point Median Q1 Q2 Q3
Bottom 25% Top 25% 15 20 25 30 35 45 Min Max Q1 Q3 Q2

Example What are the minimum and maximum values, the median, and the lower and upper quartiles of the box plot below? Minimum: 0 Maximum: 70 Median: 20 Lower quartile: 10 Upper quartile: 35

Outliers An outlier is a value that is widely separated from the rest of a group of data. In the set of scores 91, 92, 92, 93, 93, 93, 94, all data are grouped close together and no values are widely separated. In the set of scores 21, 92, 92, 93, 93, 93, 95, 150, both 21 and 150 are widely separated from the rest of the data. These values are potential outliers.

Outliers An outlier is any value that is more than 1.5 times the interquartile range above the upper quartile or below the lower quartile. Outliers are commonly indicated with an asterisk. Whiskers are then drawn to the extreme points that are not outliers.

Example The table shows the final medal standings for the top 20 countries in the 2008 summer Olympics. Draw a box plot of the data and identify possible outliers. Country Count U.S. 110 South Korea 31 Canada 18 China 100 Italy 28 Netherlands 16 Russia 72 Ukraine 27 Brazil 15 Great Britain 47 Japan 25 Kenya 14 Australia 46 Cuba 24 Kazakhstan 13 Germany 41 Belarus 19 Jamaica 11 France 40 Spain

* Example (continued) Extreme scores: 110, 11 Median: 26 Q1 = 17
IQR = 26.5 Outliers are scores greater than (26.5), or 83.25, or less than 17 − 1.5(26.5), or −22.75. 110 and 100 are the only outliers. *

Comparing Sets of Data Box plots are used primarily for large sets of data or for comparing several distributions. The stem and leaf plot is usually a much clearer display for a single distribution. Parallel box plots drawn using the same number line give us the easiest comparison of medians, extreme scores, and the quartiles for the sets of data.

Comparing Sets of Data Although we cannot spot clusters or gaps in box plots as we can with stem and leaf or line plots, we can more easily compare data from different sets. With box plots, we do not need to have sets of data that are approximately the same size, as we did for stem and leaf plots.

Comparing Sets of Data To compare data from two or more sets using their box plots, first study the boxes to see if they are located in approximately the same places. Next, consider the lengths of the boxes to see if the variability of the data is about the same. Also check whether the median, the quartiles, and the extreme values in one set are greater than those in another set.

Comparing Sets of Data From the box plot, we can see that the mean salaries for males have been higher than those for females, since the extreme values, median, and quartiles for the males are greater than those for females. Also, more than 50% of the mean salaries for males are greater than those for the mean salaries of females over the time period.

Variation: Mean Absolute Deviation, Variance, and Standard Deviation
A measure of spread is needed when data are summarized with a single number, such as the mean or median. The simplest way is to find the range. We can also use the interquartile range, the IQR. The most sophisticated is the standard deviation.

Mean Absolute Deviation
The mean absolute deviation (MAD) makes use of the absolute value to find the distance each data point is away from the mean. Then the mean of those distances is found to give an “average distance from the mean” for each of the points.

Compute the mean absolute deviation (MAD) of n numbers as follows: Measure the distance from the mean by subtracting the data value minus the mean. Find the absolute value of each difference. Sum those absolute values (the absolute deviation). Find the mean absolute deviation (MAD) by dividing the sum by the number of scores.

The table shows a set of data along with the computation of the mean absolute deviation.

Pictures of the mean absolute deviation for the given set of test scores are shown in the figures.

Variance and Standard Deviation
The variance and the standard deviation are two commonly used statements of dispersion. These measures are also based on how far the scores are from the mean.

Compute the variance, v, of n numbers as follows: Find the mean of the numbers. Subtract the mean from each number. Square each difference found in Step 2. Find the sum of the squares in Step 3. Divide the sum in Step 4 by n to obtain the variance, v. Find the square root of v to obtain the standard deviation, s.

The standard deviation, s, of n numbers is the square root of the variance, v.

Example Professor Abel gave two group exams. Exam A had grades of 0, 0, 0, 100, 100, 100, and exam B had grades of 50, 50, 50, 50, 50, 50. Find the following for each exam: a. Mean Exam A: 50; exam B: 50 b. Range Exam A: 100; exam B: 0 c. Mean absolute deviation Exam A: 50; exam B: 0

Example (continued) d. Standard deviation e. Median
Exam A: 50; exam B: 50 f. Interquartile range Exam A: 100; exam B: 0

Normal Distributions The graphs of normal distributions are the bell-shaped curves called normal curves. A normal curve is a smooth, bell-shaped curve that depicts frequency values distributed symmetrically about the mean. The mean, median, and mode all have the same value.

Normal Distributions The curve extends infinitely in both directions and gets closer and closer to the x-axis but never reaches it. The curve is symmetrical about its center point, but not all symmetrical distributions are normal.

Normal Distributions

Normal Distributions On a normal curve, about 68% of the values lie within 1 standard deviation of the mean, about 95% lie within 2 standard deviations, and about 99.8% are within 3 standard deviations. The percentages represent approximations of the total percent of area under the curve.

Example 10 When a standardized test was scored, there was a mean of 500 and a standard deviation of 100. Suppose that 10,000 students took the test and their scores had a bell-shaped distribution, making it possible to use a normal curve to approximate the distribution. a. How many scored between 400 and 600? Since 1 standard deviation on either side of the mean is from 400 to 600, about 68% of the scores fall in this interval. Thus, 0.68(10,000), or 6800, students scored between 400 and 600.

Example 10 (continued) b. How many scored between 300 and 700?
About 95% of 10,000, or 9500, students scored between 300 and 700. c. How many scored between 200 and 800? About 99.8% of 10,000, or 9980, students scored between 200 and 800. d. How many scored above 800? About 0.1% of 10,000, or 10, students scored above 800.

Application of the Normal Curve
Suppose that a group of 200 students asked their teacher to grade “on a curve.” If the mean on the test was 71, with a standard deviation of 7, the graph shows how the grades could be assigned.

Application of the Normal Curve
Based on the normal curve, the table shows the range of grades that the teacher might assign if the grades are rounded.

Percentiles When students take a standardized test such as the ACT or SAT, their scores are often reported in percentiles. A percentile shows a person’s score relative to other scores. Percentiles divide the set of data into 100 equal parts. Deciles are points that divide a distribution into 10 equally spaced sections. The rth percentile is represented by Pr.

Example 10 A standardized test that was distributed along a normal curve had a median (and mean) of 500 and a standard deviation of 100. The 16th percentile, is 400 because 400 is 1 standard deviation below the median. Find P50 and P84. Since 500 is the median, 50% of the distribution is less than 500. P50 = 500. Since 600 is 1 standard deviation above the median, 84% of the distribution is less than 600. P84 = 600.

Example 10 a. Ossie was ranked 25th in a class of 250. What was his percentile rank? There were 250 − 25, or 225, students ranked below Ossie. Hence, or 90%, of the class ranked below him. Therefore, Ossie ranked at the 90th percentile.

Example 10 (continued) b. In a class of 50, Cathy has a percentile rank of 60. What is her class standing? Since Cathy’s percentile rank is 60, 60% of the class ranks below her. Since 60% of 50 = 30, 30 students ranked below Cathy. Therefore, Cathy is 20th in her class.

10 Chapter Data Analysis/Statistics: An Introduction

Similar presentations

Presentation on theme: "10 Chapter Data Analysis/Statistics: An Introduction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

10 Chapter Data Analysis/Statistics: An Introduction

Similar presentations

Presentation on theme: "10 Chapter Data Analysis/Statistics: An Introduction"— Presentation transcript:

Similar presentations

About project

Feedback