Descriptive Statistics: Presenting and Describing Data

Slides:



Advertisements
Similar presentations
Calculating & Reporting Healthcare Statistics
Advertisements

B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Descriptive Statistics Healey Chapters 3 and 4 (1e) or Ch. 3 (2/3e)
Today: Central Tendency & Dispersion
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
Describing Data: Numerical
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
Describing Data from One Variable
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Tuesday August 27, 2013 Distributions: Measures of Central Tendency & Variability.
Created by Tom Wegleitner, Centreville, Virginia Section 2-4 Measures of Center.
Chapter 2 Describing Data.
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
Descriptive Statistics: Presenting and Describing Data.
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
Central Tendency & Dispersion
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
1 a value at the center or middle of a data set Measures of Center.
CHAPTER 2: Basic Summary Statistics
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Chapter 3 Numerical Descriptive Measures. 3.1 Measures of central tendency for ungrouped data A measure of central tendency gives the center of a histogram.
Descriptive Statistics ( )
Statistics for Managers Using Microsoft® Excel 5th Edition
One-Variable Statistics
Measures of Dispersion
Descriptive Statistics
Chapter 2: Methods for Describing Data Sets
Describing, Exploring and Comparing Data
Introduction to Summary Statistics
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
Chapter 3 Describing Data Using Numerical Measures
Numerical Descriptive Measures
CHAPTER 1 Exploring Data
Descriptive Statistics
DAY 3 Sections 1.2 and 1.3.
Dr Seyyed Alireza Moravveji Community Medicine Specialist
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Numerical Descriptive Measures
Descriptive Statistics: Describing Data
Warmup Draw a stemplot Describe the distribution (SOCS)
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Numerical Descriptive Measures
Descriptive Statistics Healey Chapters 3 and 4 (1e) or Ch. 3 (2/3e)
Summary (Week 1) Categorical vs. Quantitative Variables
Measures of Center.
Summary (Week 1) Categorical vs. Quantitative Variables
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 2: Basic Summary Statistics
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Numerical Descriptive Measures
Chapter 1: Exploring Data
Presentation transcript:

Descriptive Statistics: Presenting and Describing Data

Frequency Distribution A table or graph describing the number of observations in each category or class of a data set.

Example: Consider the number of bottles of soda sold in a snack bar during lunch hour, on 40 days. (The numbers have been arranged in increasing order.) 63 71 76 81 85 66 73 76 82 85 67 73 76 82 86 68 74 77 84 86 68 74 78 84 89 70 75 79 84 90 71 75 79 85 92 71 75 79 85 94

In order to get a better grasp of this distribution of numbers, we’ll organize them into categories or classes. We’ll look at absolute frequency, relative frequency, cumulative absolute frequency, & cumulative relative frequency

Notation [20, 30] denotes all real numbers between 20 & 30, including the 20 & the 30. (20, 30) denotes all real numbers between 20 & 30, including neither the 20 nor the 30. [20, 30) denotes all real numbers between 20 & 30, including the 20 but not the 30. (20, 30] denotes all real numbers between 20 & 30, including the 30 but not the 20. So the square bracket means include that endpoint & the round parenthesis means do not include that endpoint.

Absolute Frequency class abs. freq. [60, 65) 1 [65, 70) 4 [70, 75) 8 [60, 65) 1 [65, 70) 4 [70, 75) 8 [75, 80) 11 [80, 85) 6 [85, 90) 7 [90, 95) 3 40

Histogram of Absolute Frequency 12 10 8 6 4 2 Absolute frequency 60 65 70 75 80 85 90 95 Bottles of soda

Relative Frequency class abs. freq. rel. freq. [60, 65) 1 0.025 [60, 65) 1 0.025 [65, 70) 4 0.100 [70, 75) 8 0.200 [75, 80) 11 0.275 [80, 85) 6 0.150 [85, 90) 7 0.175 [90, 95) 3 0.075 40 1.000

Relative Frequency This graph looks the same as the last one, except the numbers on the vertical axis are percentages (in decimal form) instead of integers. 0.300 0.250 0.200 0.150 0.100 0.050 0.000 Relative frequency 60 65 70 75 80 85 90 95 Bottles of soda

Frequency Polygon line connecting middle points of tops of bars 12 10 8 6 4 2 Absolute frequency 60 65 70 75 80 85 90 95 Bottles of soda

Cumulative Absolute Frequency class abs. freq. rel. freq. cum. abs. freq. [60, 65) 1 0.025 1 [65, 70) 4 0.100 5 [70, 75) 8 0.200 13 [75, 80) 11 0.275 24 [80, 85) 6 0.150 30 [85, 90) 7 0.175 37 [90, 95) 3 0.075 40 40 1.000

Cumulative Absolute Frequency 40 35 30 25 20 15 10 5 Notice that the graph of the cumulative absolute frequency looks like a set of stairs going up from left to right. Cumulative Absolute Frequency 60 65 70 75 80 85 90 95 Bottles of soda

Cumulative Relative Frequency class abs. freq. rel. freq. cum. abs. freq. cum. rel. freq. [60, 65) 1 0.025 1 0.025 [65, 70) 4 0.100 5 0.125 [70, 75) 8 0.200 13 0.325 [75, 80) 11 0.275 24 0.600 [80, 85) 6 0.150 30 0.750 [85, 90) 7 0.175 37 0.925 [90, 95) 3 0.075 40 1.000 40 1.000

Cumulative Relative Frequency 1.00 0.75 0.50 0.25 0.00 Again we have our stairs, but the numbers on the vertical axis are percentages (in decimal form), and the height of the last bar is always 1 (or 100%). Cumulative Relative Frequency 60 65 70 75 80 85 90 95 Bottles of soda

Cumulative Relative Frequency Ogive 1.00 0.75 0.50 0.25 0.00 Line connecting the points at the back of the steps. Cumulative Relative Frequency 60 65 70 75 80 85 90 95 Bottles of soda

Next we will consider two types of summary measures: 1. Measures of the center of the distribution (also called measures of central tendency) 2. Measures of the spread of the distribution

Measures of the center of the distribution, or central tendency, or typical value, or average

Measures of the Center of the Distribution Mean or Arithmetic Mean: add up the values of the observations; then divide by the number of observations. Median: the value for which half of the observations are above that value & half are below it. Mode: Most common, most frequent, or most probable value.

Determining the location of the median Recall that the median is the value for which half of the observations are above that value & half are below it. So we are looking for the middle value. Suppose there are n numbers in our data set. We arrange them in order from the smallest value to the largest, and give the smallest value rank 1, the second smallest rank 2, and so forth up to the largest value, which has rank n. The rank of the median will be (n+1)/2.

Remember that n is the number of elements in the data set. We have two possible cases. Case 1: n is odd. Case 2: n is even.

Case 1: n is odd. Recall that the rank of the median is (n+1)/2. Example: n = 9 Then (n+1)/2 = (9+1)/2 = 10/5 = 5. So the value of the 5th number is our median.

Case 2: n is even. Recall that the rank of the median is (n+1)/2. Example: n = 10 Then (n+1)/2 = (10+1)/2 = 11/2 = 5.5. So we are looking for the value halfway between the 5th and 6th numbers. So we add the values of the 5th and 6th numbers together and divide by 2. The result is our median.

Example 1 Observations 2, 2, 3, 4, 8, 10, 13 Mean 6 Median 4 Mode 2

Example 2 Observations -5, 8, 8, 9, 10, 12 Mean 7 Median 8.5 Mode 8

Example 3 Observations 2, 3, 4, 4, 4, 7 Mean 4 Median 4 Mode 4

Example 4 Observations 11, 9, 26, 11, 10, 11 To calculate the median, we will want to have the observations in order: 9 , 10, 11, 11, 11, 26 Mean 13 Median 11 Mode 11

Computing the Mean for a Frequency Distribution of a Population Salary xi Freq. fi 700 8 800 23 900 75 1000 90 1100 43 1200 11 250 We will denote the number of observations in our population as N. In this example, it’s 250.

Computing the Mean for a Frequency Distribution of a Population Salary xi Freq. fi 700 8 800 23 900 75 1000 90 1100 43 1200 11 250 First we need the sum of all the observations: (700 + 700 + 700 + … + 700) + (800 + 800 + 800 + … + 800) + … + (1200 + 1200 + 1200 + … + 1200)

Computing the Mean for a Frequency Distribution of a Population Salary xi Freq. fi 700 8 800 23 900 75 1000 90 1100 43 1200 11 250 First we need the sum of all the observations: (700 + 700 + 700 + … + 700) + (800 + 800 + 800 + … + 800) + … + (1200 + 1200 + 1200 + … + 1200) = (700 • 8) + (800 • 23) + (900 • 75) + (1000 • 90) + (1100 • 43) + (1200 • 11)

Computing the Mean for a Frequency Distribution of a Population Salary xi Freq. fi xi fi 700 8 5600 800 23 18,400 900 75 67,500 1000 90 90,000 1100 43 47,300 1200 11 13,200 250 242,000

Computing the Mean for a Frequency Distribution of a Population Salary xi Freq. fi xi fi 700 8 5600 800 23 18,400 900 75 67,500 1000 90 90,000 1100 43 47,300 1200 11 13,200 250 242,000 Then to get the mean, we will divide that sum by the number of observations.

Computing the Mean for a Frequency Distribution of a Population Salary xi Freq. fi xi fi 700 8 5600 800 23 18,400 900 75 67,500 1000 90 90,000 1100 43 47,300 1200 11 13,200 250 242,000 So the mean equals 242,000 / 250 = 968.0.

m Notation We denote the mean of a population by the Greek letter mu: For a simple list of numbers, we computed  as: If c is the number of categories or classes in our frequency distribution, then we computed  for a frequency distribution as:

What is the mode of this frequency distribution? Salary xi Freq. fi 700 8 800 23 900 75 1000 90 1100 43 1200 11 250 The mode is the most frequent or most common value, which in this example is 1000.

What is the median of this frequency distribution? Salary xi Freq. fi 700 8 800 23 900 75 1000 90 1100 43 1200 11 250 Remember, the median is the middle value, or the average of the two middle values, when there is an even number of observations, as there is here.

Where is the median? Salary value: x x x … x x x x … x x x Position: 1 2 3 … 124 125 126 127 … 248 249 250 The middle is between the salaries in the 125th and 126th positions, where there are 125 values below and 125 above. So we need to determine what salaries are in the 125th and 126th positions.

What is the median of this frequency distribution? Salary xi Freq. fi 700 8 800 23 900 75 1000 90 1100 43 1200 11 250 In the $700 category, we have observations 1 through 8. In the $800 category, we have observations 9 through 31 (= 8+23). In the $900 category, we have observations 32 through 106 (= 8+23+75) . In the $1000 category, we have observations 107 through 196 (= 8+23+75+90) . So the 125th & 126th observations are in the $1000 category. Averaging the values of the two middle observations together, we get (1000+1000)/2 = 1000. So our median is 1000.

Calculating mean & median for interval data Calculating mean & median for interval data. Suppose we have the following population data. Interval frequency f [0, 15) 10 [15, 30) [30, 45) 5 [45, 60) [60, 75)

We will compute the mean first. We have 35 observations. Interval frequency f [0, 15) 10 [15, 30) [30, 45) 5 [45, 60) [60, 75) 35

We need a representative element from each interval We need a representative element from each interval. For that we’ll use the midpoint. Interval frequency f midpoint x [0, 15) 10 7.5 [15, 30) 22.5 [30, 45) 5 37.5 [45, 60) 52.5 [60, 75) 67.5 35

Now we continue as we did before to calculate the mean for a frequency distribution. Interval frequency f midpoint x xf [0, 15) 10 7.5 75.0 [15, 30) 22.5 225.0 [30, 45) 5 37.5 187.5 [45, 60) 52.5 262.5 [60, 75) 67.5 337.5 35

Add up. Interval frequency f midpoint x xf [0, 15) 10 7.5 75.0 [15, 30) 22.5 225.0 [30, 45) 5 37.5 187.5 [45, 60) 52.5 262.5 [60, 75) 67.5 337.5 35 1087.5

Divide by the number of observations, and we have the mean. Interval frequency f midpoint x xf [0, 15) 10 7.5 75.0 [15, 30) 22.5 225.0 [30, 45) 5 37.5 187.5 [45, 60) 52.5 262.5 [60, 75) 67.5 337.5 35 1087.5 m = 1087.5/35 = 31.07

Now let’s calculate the median. Interval frequency f [0, 15) 10 [15, 30) [30, 45) 5 [45, 60) [60, 75) 35

To calculate the median of interval data, we need to make an assumption. We know the number of observations in each interval, but not exactly what they are. We’re going to assume that the observations are evenly distributed in the intervals.

First, we need to figure out in which category the median is. Interval frequency f [0, 15) 10 [15, 30) [30, 45) 5 [45, 60) [60, 75) 35 There are 35 observations, so the middle one is the 18th one. (There are 17 observations below the 18th and 17 above it.) The first 10 observations are in the first category. The 11th to the 20th observations are in the second category. So the median must be in the second category.

The formula for calculating the median for interval data looks quite different from what we did before. Lmd is the lower limit on the category containing the median. N is the population size. Sfp is the sum of the frequencies of the categories preceding the category containing the median. fmd is the frequency of the category containing the median. width is the width of the interval containing the median.

Lmd is the lower limit on the category containing the median. 15 N is the population size. 35 Sfp is the sum of the frequencies of the categories preceding the category containing the median. 10 fmd is the frequency of the category containing the median. width is the width of the interval containing the median. Let’s go through the parts of the formula, keeping in mind that the median is in the second category. Interval frequency f [0, 15) 10 [15, 30) [30, 45) 5 [45, 60) [60, 75) 35

Now we just assemble the pieces. Interval frequency f [0, 15) 10 [15, 30) [30, 45) 5 [45, 60) [60, 75) 35

What does this mean? [0, 15) 10 [15, 30) [30, 45) 5 [45, 60) [60, 75) Interval frequency f [0, 15) 10 [15, 30) [30, 45) 5 [45, 60) [60, 75) 35 Remember that the median is the 18 observation. That means it’s the 8th observation of 10 in the second category. So it is closer to the end of that interval than the beginning. What the formula is telling us is that the median is 0.75 or ¾ of the way through the distance of 15 units, in the interval starting at 15.

Measures of dispersion or the spread of the distribution

Measures of Dispersion Range Mean Absolute Deviation (MAD) Mean Squared Deviation (MSD) Coefficient of Variation (CV) As we shall see, the first three are measures of absolute dispersion, while the CV is a measure of relative dispersion.

largest value minus smallest value range largest value minus smallest value

Example 1 Observations: 1 2 2 2 3 4 4 5 6 The range is 6 -1 = 5

Example 2 Observations: 1 1 1 1 1 1 1 1 6 The range is 6 -1 = 5 Intuitively, this distribution seems to be less spread out than the distribution in Example 1, but the range doesn’t capture that.

Mean Absolute Deviation (MAD) This formula is for the MAD for a simple list of numbers. We’ll do the MAD for a frequency distribution shortly.

Example: x 4 8 10 13 15 First we need the mean.

Example: x 4 8 10 13 15 50

Example: x 4 8 10 13 15 50 m = 50/5 = 10

Example: x x-m 4 -6 8 -2 10 13 3 15 5 50 m = 50/5 = 10

Example: x x-m | x-m | 4 -6 6 8 -2 2 10 13 3 15 5 50 m = 50/5 = 10

Example: x x-m | x-m | 4 -6 6 8 -2 2 10 13 3 15 5 50 16 m = 50/5 = 10

Example: x x-m | x-m | 4 -6 6 8 -2 2 10 13 3 15 5 50 16 m = 50/5 = 10 13 3 15 5 50 16 m = 50/5 = 10 MAD = 16/5 = 3.2

Population Variance or Mean Squared Deviation (MSD)

Example: x x-m 4 -6 8 -2 10 13 3 15 5 Recall m = 10

Example: x x-m ( x-m )2 4 -6 36 8 -2 10 13 3 9 15 5 25 Recall m = 10

Example: x x-m ( x-m )2 4 -6 36 8 -2 10 13 3 9 15 5 25 74 Recall m = 10

Example: x x-m ( x-m )2 4 -6 36 8 -2 10 13 3 9 15 5 25 74 Recall m = 10 s2 = MSD = 74/5 = 14.8

population standard deviation √Population Variance

Example: population standard deviation In the example we just did, the population variance was 14.8 . So the standard deviation is √14.8 = 3.847

Calculating the MAD, MSD, & Std. Dev. for a Frequency Distribution xi fi 1 3 2 5

The total number of observations N is the sum of the frequencies or 10. xi fi 1 3 2 5 10

Calculate the population mean m. xi fi xifi 1 3 2 5 10 6

Calculate the population mean m. xi fi xifi 1 3 2 5 10 6 19

Calculate the population mean m. xi fi xifi 1 3 2 5 10 6 19 m = 19/10 =1.9

Calculate the Mean Absolute Deviation (MAD). xi fi xifi xi - m 1 3 -0.9 2 5 10 0.1 6 1.1 19 m = 19/10 =1.9

Calculate the Mean Absolute Deviation (MAD). xi fi xifi xi - m |xi – m| 1 3 -0.9 0.9 2 5 10 0.1 6 1.1 19 m = 19/10 =1.9

Calculate the Mean Absolute Deviation (MAD). xi fi xifi xi - m |xi – m| |xi – m|fi 1 3 -0.9 0.9 2.7 2 5 10 0.1 0.5 6 1.1 2.2 19 m = 19/10 =1.9

Calculate the Mean Absolute Deviation (MAD). xi fi xifi xi - m |xi – m| |xi – m|fi 1 3 -0.9 0.9 2.7 2 5 10 0.1 0.5 6 1.1 2.2 19 m = 19/10 =1.9 5.4

Calculate the Mean Absolute Deviation (MAD). xi fi xifi xi - m |xi – m| |xi – m|fi 1 3 -0.9 0.9 2.7 2 5 10 0.1 0.5 6 1.1 2.2 19 m = 19/10 =1.9 5.4 MAD = 5.4/10 = 0.54

Calculate the Mean Squared Deviation (MSD) or Population Variance s2. xi fi xifi xi - m |xi – m| |xi – m|fi (xi – m)2 1 3 -0.9 0.9 2.7 0.81 2 5 10 0.1 0.5 0.01 6 1.1 2.2 1.21 19 m = 19/10 =1.9 5.4 MAD = 5.4/10 = 0.54

Calculate the Mean Squared Deviation (MSD) or Population Variance s2. xi fi xifi xi - m |xi – m| |xi – m|fi (xi – m)2 (xi – m)2 fi 1 3 -0.9 0.9 2.7 0.81 2.43 2 5 10 0.1 0.5 0.01 0.05 6 1.1 2.2 1.21 2.42 19 m = 19/10 =1.9 5.4 MAD = 5.4/10 = 0.54

Calculate the Mean Squared Deviation (MSD) or Population Variance s2. xi fi xifi xi - m |xi – m| |xi – m|fi (xi – m)2 (xi – m)2 fi 1 3 -0.9 0.9 2.7 0.81 2.43 2 5 10 0.1 0.5 0.01 0.05 6 1.1 2.2 1.21 2.42 19 m = 19/10 =1.9 5.4 MAD = 5.4/10 = 0.54 4.90

Calculate the Mean Squared Deviation (MSD) or Population Variance s2. xi fi xifi xi - m |xi – m| |xi – m|fi (xi – m)2 (xi – m)2 fi 1 3 -0.9 0.9 2.7 0.81 2.43 2 5 10 0.1 0.5 0.01 0.05 6 1.1 2.2 1.21 2.42 19 m = 19/10 =1.9 5.4 MAD = 5.4/10 = 0.54 4.90 s2 = MSD = 4.90/10 = 0.49

Last, we calculate the standard deviation. xi fi xifi xi - m |xi – m| |xi – m|fi (xi – m)2 (xi – m)2 fi 1 3 -0.9 0.9 2.7 0.81 2.43 2 5 10 0.1 0.5 0.01 0.05 6 1.1 2.2 1.21 2.42 19 m = 19/10 =1.9 5.4 MAD = 5.4/10 = 0.54 4.90 s2 = MSD = 4.90/10 = 0.49

Last, we calculate the standard deviation. xi fi xifi xi - m |xi – m| |xi – m|fi (xi – m)2 (xi – m)2 fi 1 3 -0.9 0.9 2.7 0.81 2.43 2 5 10 0.1 0.5 0.01 0.05 6 1.1 2.2 1.21 2.42 19 m = 19/10 =1.9 5.4 MAD = 5.4/10 = 0.54 4.90 s2 = MSD = 4.90/10 = 0.49 s =√0.49 = 0.7

So the formulae for calculating the MAD, and MSD (or population variance) are The standard deviation is still just the square root of the variance.

The formulae we have been using are for populations The formulae we have been using are for populations. If we have samples instead, we have some notational changes and one change in the calculation process.

Notational Changes for Samples instead of Populations First, the sample size is n instead of N. Next, we denote the sample mean by “Xbar” instead of m. To calculate the sample mean for a simple list of numbers we have: To calculate the sample mean for a frequency distribution we have:

Mean Absolute Deviation (MAD) for a sample MAD for a simple list of numbers: MAD for a frequency distribution:

Sample Variance The MSD is just for the population variance. We don’t have an MSD for the sample. The calculation is also slightly different for the sample variance than it was for the population variance.

Sample Variance (denoted by s2) Sample Variance for a simple list of numbers: Sample Variance for a frequency distribution: The only change that is not just notational is that instead of dividing by n, we divide by n-1. The reason for this change is so that the sample variance will be an unbiased estimator of the population variance. We’ll discuss the idea of unbiasedness later in the semester.

Coefficient of Variation (CV) Our other measures of the spread of the distribution were measures of absolute dispersion. The CV is calculated relative to the mean, so it is considered a relative measure of dispersion.

Coefficient of Variation (CV) The CV is simply the standard deviation divided by the mean multiplied by 100 to put it in percentage terms.

Example: Suppose you have data for a sample that has a mean of 200 and a standard deviation of 10. What is the coefficient of variation (CV)? This result tells us that the standard deviation is 5% as large as the mean.

At my web page, there is a summary sheet called “Selected Descriptive Statistics.” It shows some of the different formulae for simple lists of numbers and frequency distributions, for populations and samples. Print it out and look at how the formulae are similar and how they’re different.

Empirical Rule In most data sets, many of the values tend to cluster near the center of the distribution. In symmetric, bell-shaped distributions, approximately 68% of the values are within 1 standard deviations of the mean; approximately 95% of the values are within 2 standard deviations of the mean; and approximately 99.7% of the values are within 3 standard deviations of the mean.

Thus, values that are more than 3 standard deviations from the mean are very atypical and are often called “outliers.” Determining whether an observation is an outlier is equivalent to determining if its “z-score” is less than -3 or greater than +3. The formula for the z-score is:

Example Suppose that a particular sample has a mean of 100, and a standard deviation of 10. Would a value of 120 be considered an outlier? Since the Z-score is not less than -3 or greater than +3, 120 is not an outlier in this sample.

Example Is 150 an outlier in this sample (with mean 100 and standard deviation 10)? Since the Z-score is less than -3 or greater than +3, 150 is an outlier in this sample.

Example Is 60 an outlier in this sample (with mean 100 and standard deviation 10)? Since the Z-score is less than -3 or greater than +3, 60 is an outlier in this sample.

Symmetric versus Skewed Distributions

Symmetric Distribution For a symmetric distribution, the left and right sides are mirror images of each other. The mean, median, and mode are the same.

Positively or Right-Skewed Distribution If the longer tail is to the right, the distribution is positively skewed or skewed to the right. A small number of very large values pulls the mean up, so the mean is larger than the median.

Negatively or Left-Skewed Distribution If the longer tail is to the left, the distribution is negatively skewed or skewed to the left. A small number of very small values pulls the mean down, so the mean is smaller than the median.