Unit 1 – Descriptive Statistics Throughout the course of these lectures we will work within this same scenario: We are a team of junior climate scientists who have been tasked by our superiors to gather and analyze the yearly temperature data for region CA105 (Tracy, CA). Our first task was to gather daily temperature measures for 15 consecutive days using precisely calibrated monitoring equipment at 1:00pm each day.
Unit 1 – Descriptive Statistics Our first task was to gather daily temperature measures for 15 consecutive days using precisely calibrated monitoring equipment at 1:00pm each day. Data Set 1: Temperature (F) at 1:00pm for region CA105 (June 1 – June 15, 2015)
Unit 1 – Descriptive Statistics Lecture Notes – Part 1 MeanRange MedianInterquartile Range ModeStandard Deviation
Measures of Center Mean (Average) The mean is the average of the data values. That is, if the amount were evenly divided into the same number of points, how much each would get. X-bar is the symbol we use for the mean. To quickly calculate the mean, enter the data set into L1, then press STAT ►CALC ►1-Var Stats
Measures of Center Median (Middle) The Median is the Middle data point or, in the case of a data set with an even number of data points, the average of the two middle data points. M is the symbol we use for the median. To quickly calculate the Median, enter the data set into L1, then press STAT ►CALC ►1-Var Stats
Measures of Center Mode (Most Common) The Mode is the most frequent data point(s). The Mode is unique because there can be more than one in a given data set. The Mode is pretty much useless. There isn’t a short cut to find the mode, however, you can sort a list which helps you find them faster. To sort List 1 Ascending: STAT ►EDIT ►SortA(L1)
Measures of Spread Range (Spread) The Range is the simplest way to measure the spread of a data set. To quickly calculate the Range, use the 1-Var Stats printout and subtract maxX – minX.
Measures of Spread Interquartile Range (IQR) The Interquartile Range is the distance between Quartiles 1 and The best way to think of this is that Q1 and Q3 are the “Medians of the Median” which is easy to find by hand sometimes and sometimes it’s a little complicated (even number of data points). Use the 1-Var Stats printout as a shortcut.
Measures of Spread Standard Deviation (σ “sigma”) The Standard Deviation is the most common measure of spread. Notice that in the 1-Var Stats printout, s is the symbol for Standard Deviation, rather than sigma. We will discuss why at a later date.
Measures of Spread Standard Deviation (σ “sigma”)
Unit 1 – Descriptive Statistics Lecture Notes – Part 2 Outliers 1.5 IQR Test Resistant Measure Not Resistant
Outliers Outliers are data points which are far enough away from the rest of the data set to be considered abnormal. The test that is typically applied to determine if a data point is an outlier is called the 1.5 IQR Test
1.5 IQR Test To conduct the 1.5 IQR Test, first find the IQR (Interquartile Range). IQR = Q3 – Q1. IQR = 100 – 87 = 13 Next, multiply the IQR by x 13 =
1.5 IQR Test cont. Now take that value (19.5) and do this: 1 st : Subtract it from Q1: 87 – 19.5 = nd : Add it to Q3: = Any data point that falls on this interval will not be an outlier. Data points which fall outside of this interval will be considered an outlier
Resistant vs. Not Resistant Outliers are important because they can influence the behavior of other statistics. Some Statistical measures are “Resistant” – that is, they are not influenced by an outlier. Some are “Not Resistant” – they are influenced by outliers
Resistant vs. Not Resistant The following statistical measures ARE resistant: Median IQR The following statistical measures are NOT resistant: Mean Range Standard Deviation
Resistant vs. Not Resistant The following statistical measures ARE resistant: Median IQR The Median and the IQR simply are not impacted by the presence of an outlier. Try changing 120 to a different value, for example, 110, and note that both the Median and IQR remain the same. This is because these values are both a measure of “middleness” of the data set. Changing the extremes has no impact on them
Resistant vs. Not Resistant The following statistical measures are NOT resistant: Mean Range Standard Deviation All 3 of these values are impacted by the presence of an outlier but we typically don’t worry much about the Range. The impact on the Mean and Standard Deviation are the most important. Try changing our outlier to 110 to see what happens to both the mean and standard deviation
Resistant vs. Not Resistant Why does this matter? Outliers cause “skew” in our data set, which will be discussed later. For now, try looking back at the other 3 data sets we have worked with. Do any of those data sets have outliers? Do any have no outliers? What do you notice about the relationship between the Median and the Mean when there is an outlier vs. when there isn’t?
Resistant vs. Not Resistant You should notice that for a data set with no outliers, the Median and Mean are very close together. In a data set with a high outlier, the Mean > Median. In a data set with a low outlier, the Mean < Median. Talk to your neighbor about why this is the case. In either case, what will be the impact of the outlier on standard deviation?
Unit 1 – Descriptive Statistics Lecture Notes – Part IQR Test Shortcut Additive Transformations
1.5 IQR Shortcut We’ll learn more about Box and Whisker Plots later but we might as well see them now. Steps: 1.►STAT PLOT 2.Stat Plot 1 ► Turn On ► Type: Modified Box Plot 3.►Zoom ►
1.5 IQR Shortcut Modified Box Plot Now press Trace. The following will be displayed: Min Q1 Med Q3 Max Outlier(s)
Additive Transformation We just got bad news from our project manager – apparently our equipment wasn’t calibrated correctly. After some testing, it was found that all of the temperature readings were 4 degrees too high. To adjust our data set, we simply use the formula: y = x – 4 Where x is the old data and y is the new data.
Additive Transformation y = x – Predict: What will happen to each measure? Center:Spread: MeanRange MedianIQR ModeStandard Deviation What will happen to the outliers?
Additive Transformation y = x – Mean = decreases by 4 Median = decreases by 4 Mode = decrease(s) by 4 Range = no change IQR = no change Standard Deviation = no change Outliers = decreases by 4
Unit 1 – Descriptive Statistics Lecture Notes – Part 4 Multiplicative Transformation
We just got even worse news from our project manager – apparently our equipment was really acting up. After some additional testing, it was found that all of the temperature readings were 10% too high and need to be multiplied by.9 to correct for the error. To adjust our data set, we simply use the formula: y =.9x
Multiplicative Transformation Predict: What will happen to each measure? Center:Spread: MeanRange MedianIQR ModeStandard Deviation What will happen to the outliers?
Multiplicative Transformation Mean = decreases by 10% ► 82.5 Median = decreases by 10% 91 ► 81.9 Mode = decrease(s) by 10% 91 and 96 ► 81.9 and 86.4 Range = decreases by 10% 40 ► 36 IQR = decreases by 10% 13 ► 11.7 Standard Deviation = decreases by 10% ► Outliers = decreases by 10% 116 ► 104.4
Unit 1.1 Concept Check Using Flashcards, Notes, Warmups, Homeworks, etc. check with a partner for the remainder of the period that you each understand all of the following concepts. Center vs. Spread1.5 IQR TestCalculator Skills MeanBox and Whisker PlotUsing Lists MedianAdditive TransformationsUnarchiving Lists Mode+Impact on each measureSorting Lists RangeMultiplicative Transformation1-Var Stats IQR+Impact on each measureStat Plots Standard DeviationModified Box Plot OutlierTrace Resistant vs. Not ResistantStatZoom Outliers’ affect on the…Side by Side Box Plots Mean Median Mode Range IQR Standard Deviation