Graphs and Descriptive Stats

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

Lesson Describing Distributions with Numbers parts from Mr. Molesky’s Statmonkey website.
Descriptive Measures MARE 250 Dr. Jason Turner.
Lecture 4 Chapter 2. Numerical descriptors
Looking at data: distributions - Describing distributions with numbers IPS chapter 1.2 © 2006 W.H. Freeman and Company.
Looking at data: distributions - Describing distributions with numbers
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
Describing distributions with numbers
Objectives 1.2 Describing distributions with numbers
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Module #1 contd Center of a distribution Spread of a distribution
Numerical descriptors BPS chapter 2 © 2006 W.H. Freeman and Company.
Review BPS chapter 1 Picturing Distributions with Graphs What is Statistics ? Individuals and variables Two types of data: categorical and quantitative.
Numerical descriptors BPS chapter 2 © 2006 W.H. Freeman and Company.
Numerical descriptions of distributions
IPS Chapter 1 © 2012 W.H. Freeman and Company  1.1: Displaying distributions with graphs  1.2: Describing distributions with numbers  1.3: Density Curves.
One-Variable Statistics. Descriptive statistics that analyze one characteristic of one sample  Where’s the middle?  How spread out is it?  How do different.
Lecture #3 Tuesday, August 30, 2016 Textbook: Sections 2.4 through 2.6
MAT 446 Supplementary Note for Ch 1
One-Variable Statistics
Chapter 1: Exploring Data
Numerical descriptions of distributions
CHAPTER 2: Describing Distributions with Numbers
Descriptive Statistics (Part 2)
CHAPTER 2: Describing Distributions with Numbers
Description of Data (Summary and Variability measures)
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
DAY 3 Sections 1.2 and 1.3.
Please take out Sec HW It is worth 20 points (2 pts
POPULATION VS. SAMPLE Population: a collection of ALL outcomes, responses, measurements or counts that are of interest. Sample: a subset of a population.
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1 Warm Up .
CHAPTER 2: Describing Distributions with Numbers
Chapter 1: Exploring Data
SYMMETRIC SKEWED LEFT SKEWED RIGHT
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Measures of Center.
Day 52 – Box-and-Whisker.
Honors Statistics Review Chapters 4 - 5
CHAPTER 2: Describing Distributions with Numbers
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Compare and contrast histograms to bar graphs
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Graphs and Descriptive Stats
Presentation transcript:

Graphs and Descriptive Stats Center of a distribution Spread of a distribution Quartiles 5-Number Summary and Boxplot Outliers

Learning Objectives By the end of this lecture, you should be able to: Recognize how scales, mislabeled axes, etc on charts can be misleading Describe the two most common statistics to describe the center of a dataset, and when they should be used Describe two common statistics used to describe the spread of a dataset, and when they should be used Understand boxplots and the 5-number summary Describe what is meant by an outlier and describe two techniques for identifying outliers. Describe and apply the 1.5*IQR rule for outliers

Misleading chart through poor choice of scale/axis

Scales matter A picture is worth a thousand words, How you stretch the axes and choose your scales can give a different impression. A picture is worth a thousand words, BUT There is nothing like hard numbers.  Look at the scales.

Outliers This is a very important topic. Outliers refer to values that seem somehow ‘extreme’ or well outside the typical range of values in your dataset. How to deal with outliers is a very involved subject, and while it certainly merits much discussion, we will not delve into it too much today. Your goal for today is to identify outliers. That is, to develop some ability to look at a number and make a reasonably educated decision as to whether or not that value is an outlier. We will discuss two techniques for doing so shortly: Examination of a histogram Using the “1.5 * IQR” Rule

Describing the center and spread of a distribution A distribution is best described through a combination of visuals (e.g. graphs), and numbers. Two key numeric descriptions are: Center: e.g. the mean Spread (aka “Variation”) Center: Statistics for describing the center: Mean, Median, Mode Mean: Most of us are familiar with the ‘mean’ (average). However, we should typically only use the mean if the dataset has no outliers, and is not highly skewed. Median: a better choice for the center of a distribution that has outliers, or is skewed Mode: Will discuss later Spread (Variation) Statistics for describing the spread: Percentiles, Quartiles, Standard Deviation We will discuss these shortly

Measure of center: the mean The mean or arithmetic average To calculate the average, or mean, add all values, then divide by the number of individuals. It is the “center of mass.” Sum of heights is 1598.3 Divided by 25 women = 63.9 inches Heights of 25 women in inches

______________________________ Another measure of center: the median The median is the midpoint of a distribution—the number such that half of the observations are smaller and half are larger.  n = 25 (n+1)/2 = 26/2 = 13 Median = 3.4 2.a. If n is odd, the median is observation (n+1)/2 down the list 1. Sort observations by size. n = number of observations ______________________________ n = 24  n/2 = 12 Median = (3.3+3.4) /2 = 3.35 2.b. If n is even, the median is the mean of the two middle observations. Survival years for Disease X

‘Resistant’ is an important term ‘Resistant’ is an important term. We say that the median is ‘resistant’ to outliers because the presence of 1 or 2 outliers does not affect the median dramatically. Conversely, the mean is not resistant to outliers. Example: Consider a series of incomes (in thousands) taken from a graduate classroom: 18, 24, 37, 41, 62, 63, 2000 The median income is the middle value in the dataset: $41,000 However, the mean is dramatically higher: $320,000 since the one individual who made $2 million dollars pulls the mean disproportionally in the high direction. As a result, we end up with a ‘center’ value that probably does not truly represent the ‘average’ income of our sample. In other words, we say that: The median is resistant to outliers The mean is not resistant to outliers

Effect of outliers on the mean and median (Without the outliers) Effect of outliers on the mean and median With the outliers Percent of people dying Note the presence of outliers: Those two fortunate people who managed to live several years longer than the others. These two somewhat larger values moved the mean up from 3.4 to 4.2 However, the median, the number of years it takes for half the people to die only went from 3.4 to 3.6. Also note that the median is fairly resistant, but not 100% resistant. The median is not sensitive to the size of the outlier, rather, it is sensitive to the number of outliers. This is typical behavior for the mean and median. The mean is sensitive to outliers, because when you add all the values up to get the mean the outliers are weighted disproportionately by their large size. However, when you get the median, they are just another two points to count –the actual size of those values does not affect things.

Measures of spread / variation Most people intuitively ‘get’ the benefit of knowing the center of a distribution (e.g. the ‘average’ salary of first-year doctors). However, a piece of data that is sadly neglected but is EVERY bit as important, is the spread of the data (also known as the variation). Just as there are different ways of describing the center of a distribution (e.g. mean, median, mode), there are different techniques for describing the spread of a distribution. As with the center, you must know which description of the spread is the best of the most accurate tool for describing the spread. Common techniques for describing the variation in a dataset: Range: the highest and lowest values in the dataset. Important, but outliers can give people a highly inaccurate picture (imagine if you looked at the range of salaries). Quartiles – dividing the range into four Standard Deviation / Variance: this is one of the most effective means of describing the spread, and a tool that we will come back to constantly throughout this course.

Percentiles and Quartiles The xth percentile (e.g. the 38th percentile) is the value at which ‘x’ percent of observations fall below it. Example: If your height is said to be in the 80th percentile, it means that 80% of the people measured were shorter than you. Two commonly used percentiles are the first quartile and the third quartile. These refer to the 25th and 75th percentiles respectively. Q1 (first quartile): Refers to the 25th percentile. Ie: 25% of observations are below this value. Q2 (second quartile): Refers to the 50th percentile. In other words, the median! Q3 (third quartile): Refers to the 75th percentile. Ie: 75% of observations fall below this value.

5-Number Summary and Box Plot Once you have divided your dataset into quartiles, you now have one very widely-used technique for creating a neat little summary of the data. It is called the ‘5 Number Summary’ and is made up of: Lowest number First (lower) quartile Median (not the mean!) Third (upper) quartile Highest number Once you have this summary in hand, you can even ‘draw’ it using a simple (but very convenient) plot known as a box plot.

Determining the quartiles: Start by finding the median. (This is Q2). Then find the middle value between the lowest number and the median (excluding the median itself). This is the first quartile, Q1. It is the value in the sample that has 25% of the observations (data points) at or below it. Then find the middle value between the median and the highest number. This is the third quartile, Q3. It is the value in the sample that has 75% of the data at or below it. (It is the median of the upper half of the sorted data, excluding M). Survival time (years) n=25 Q1= first quartile = 2.2 Median = “Q2” = 3.4 Q3= third quartile = 4.35

Determining the Five Number Summary The five number summary is made up of: Minimum number Q1 Median (Q2) Q3 Maximum number For this dataset, the summary is: 0.6, 2.2, 3.4, 4.35, 6.1 Q1= first quartile = 2.2 M = median = 3.4 Again, the five number summary is a good tool for summarizing the center and spread of skewed distributions. Q3= third quartile = 4.35

The boxplot is a graph of the 5-Number summary Largest = max = 6.1 BOXPLOT Q3= third quartile = 4.35 M = median = 3.4 Q1= first quartile = 2.2 Five-number summary: min Q1 M Q3 max Smallest = min = 0.6

Comparing box plots for a normal and a right-skewed distribution Boxplots for skewed data Comparing box plots for a normal and a right-skewed distribution Boxplots remain true to the data and depict clearly symmetry or skew. In addition to being very useful for visualizing the spread, boxplots are very useful for comparing two or more datasets. For example, let’s compare our two diseases (MM and Disease-X). Start at the bottom - you can die of either disease in the first year, The first quarter of individuals die of Disease -X in 2 years. Compare that fact with the 25% with MM die within ONE year The median of this graph tells us the point at which half are dead. Note that they are not too different - 3 1/2 vs 2 ½ For both diseases 3/4 of people are dead by about 4.5 years. Disease X kills everyone by year 6 while some people with MM hang on a long time. You can see that the distribution of variation around the midpoint is symmetric for Disease-X and highly skewed towards larger values for MM. This boxplot has given us quite a bit of information!

This boxplot has given us quite a bit of information! Interpreting Boxplots One of the best uses for boxplots is to compare and contrast two or more datasets. For example, let’s compare our two diseases (MM and Disease-X). Start at the bottom - you can die of either disease in the first year, The first quarter of individuals die of Disease -X in 2 years. Compare that fact with the 25% with MM die within ONE year The median of this graph tells us the point at which half are dead. Note that they are not too different - 3 1/2 vs 2 ½ For both diseases 3/4 of people are dead by about 4.5 years. Disease X kills everyone by year 6 while some people with MM hang on a long time. This boxplot has given us quite a bit of information! You can see that the distribution of variation around the midpoint is symmetric for Disease-X and highly skewed towards larger values for MM.

OUTLIERS – Identification of the Outlier At what point do we typically label a datapoint as an outlier? We will discuss two methods here: One way is to look at a chart and see if any values appear to be “off the chart” relative to the large majority of values. Another tool is the “1.5 IQR” Rule for outliers. Example on the next slide

Identifying outlier(s) on a histogram The overall pattern is fairly symmetrical except for 2 states that clearly do not belong to the main trend. Alaska and Florida have an unusually high representation of elderly in their population. A large gap in the histogram is suggestive of outliers. Again, for the time being, we are NOT currently interested in what to do with outliers; merely in how to identify them. How to handle them can be a surprisingly controversial topic! Alaska Florida

Identification of outliers using the 1.5 IQR Rule To start, we need the 5-Number Summary Determine the distance between Q1 and Q3 – this is called the Interquartile Range, or IQR. Multiply by 1.5 Determine the distance from the suspicious data point to the nearest quartile (Q1 or Q3). Determine the distance between Q1 and Q3, called the interquartile range, or IQR. We call an observation a suspected outlier if it falls more than 1.5 times the size of the interquartile range (IQR) below the first quartile or above the third quartile. This technique is called the “1.5 * IQR rule for outliers.” Example on the next slide

Example of the 1.5 IQR Rule Q1= first quartile = 2.2 Here is the 5-number summary for the dataset discussed earlier: 0.6, 2.2, 3.4, 4.35, 6.1 Would a value of 7.5 be an outlier? What about 8? IQR = 4.35-2.2 = 2.15 1.5*IQR = 3.23 For a number to be an outlier on the high side, it must be greater than 4.35 +3.23: 7.58 So, 7.5 would not be considered an outlier by this criteria. However, 8 would. Similarly, numbers below -1.03 (i.e. 2.2-3.23), would be outliers in the negative direction. However, in this case, such numbers would not exist since there are no “negative” survival times! Remember to always keep the “real world” in mind!!! Q1= first quartile = 2.2 Q3= third quartile = 4.35

Remember that a histogram does not give you ALL of the data - it is merely a summary (albeit a good one!) of the data (the distributrion). However, in order to calculate statistics using specific numbers (e.g. to calculate a 5-number summary) you wold need to see the actual dataset. For this example, I will provide you with Q1 and Q3: Q1: 19.27 Q3: 45.40 IQR = 45.40 – 19.27 = 26.13 1.5*IQR = 39.2 Any amount more than 84.60 is a suspected outlier.

How to deal with OUTLIERS Outliers require thought. The first step is to decide whether a data point should indeed be labeled as an outlier. Once you have decided that it is an outlier, the next question is what you want to do with it. There are two options for dealing with outliers – you can include them in your analysis, or you can leave them out. Exclude outliers: Suppose you have a datapoint that is extremely high – and you think it was recorded in error. In this case, you would not want to include this value in your calculations since values like mean and standard deviation would be thrown off by this bad datapoint. However, if you choose to leave out a datapoint, you MUST include in your paper a discussion of your reasons for doing so. Include outliers: The other option, of course, is to include the outlier(s) in your calculations and analysis. In this case, you have to decide which statistics to use (mean vs median, etc) Discussion question: Suppose we wanted to determine the average height of DePaul students and we use our class as a sample. However, that particular day, we are being visited by an incoming freshman who just happens to be the tallest person in the world. Would you include him/her in your analysis? I would probably leave him out of the analysis since he does not represent the ‘typical’ DePaul student. However, when reporting my decision, I MUST report that I did so, and explain my decision. Example on the next slide