Stat 31, Section 1, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms Binwidth is critical Bivariate display: scatterplot Course Organization & Website https://www.unc.edu/%7Emarron/UNCstat31-2005/Stat31sec1Home.html
Exploratory Data Analysis 4 “Time Plots”, i.e. “Time Series: Idea: when time structure is important, plot variable as a function of time: variable time Often useful to “connect the dots”
Class Time Series Example Monthly Airline Passenger Numbers https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg5Done.xls Increasing Trend (long term growth, over years) Increasing Variation (appears proportional to trend) “Seasonal Effect” - 12 Month Cycle (Peak in summer, less in winter)
Airline Passengers Example Interesting variation: log transformation Stabilizes variation Since log of product is sum Shows changing variation prop’l to trend Log10 is “most interpretable” (log10(1000) = 3, …) Generally useful trick (there are others)
Airline Passengers Example A look under the hood https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg5Raw.xls Use Chart Wizard Chart Type: Line (or could do XY) Use subtype for points & lines Use menu for first log10 Although could just type it in Drag down to repeat for whole column
Time Series HW HW: 1.36, 1.37 Use EXCEL
Exploratory Data Analysis 5 Numerical Summaries of Quant. Variables: Idea: Summarize distributional information (“center”, “spread”, “skewed”) In Text, Sec. 1.2 for data (subscripts allow “indexing numbers” in list)
Numerical Summaries “Centers” (note there are several) “Mean” = Average = Greek letter “Sigma”, for “sum” In EXCEL, use “AVERAGE” function
Numerical Summaries of Center “Median” = Value in middle (of sorted list) Unsorted E.g: Sorted E.g: 3 0 1 1 27 “in middle”? (no) 2 better “middle”! 2 3 0 27 EXCEL: use function “MEDIAN”
Difference Betw’n Mean & Median Symmetric Distribution: Essentially no difference Right Skewed: 50% area 50% area M bigger since “feels tails more strongly”
Difference Betw’n Mean & Median Outliers (unusual values): Nice Web Example: http://www.stat.sc.edu/~west/applets/box.html Mean feels outliers much more strongly Leaves “range of most of data” Good notion of “center”? (perhaps not) Median affected very minimally Robustness Terminology: Median is “resistant to the effect of outliers”
Difference Betw’n Mean & Median A more flexible web example: http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html Get various dist’ns, by manipulating bar heights See Mean, Median and more Similar for symmetric distributions Very different when skewed “Big Gap”, can make median jump a lot But mean is less sensitive (more “continuous”)
Numerical Centerpoint HW HW: 1.49 a (but make histograms), b Use EXCEL
Numerical Summaries (cont.) “Spreads” (again there are several) 1. Range = biggest - smallest range Problems: Feels only “outliers” Not “bulk of data” Very non-resistant to outliers
Numerical Summaries of Spread Variance = = “average squared distance to “ EXCEL: VAR Drawback: units are wrong e. g. For in feet is in square feet
Numerical Summaries of Spread Standard Deviation EXCEL: STDEV Scale is right But not resistant to outliers Will use quite a lot later (for reasons described later)
Interactive View of S. D. Revisit flexible web example: http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html Note SD range centered at mean Can put SD “right near middle” (densely packed data) Can put SD at “edges of data” (U shaped data) Can put SD “outside of data” (big spike + outlier) But generally “sensible measure of spread”
Variance – S. D. HW HW: for both data sets in 1.49, find the: Standard Deviation (26.4, 32.9) Use EXCEL
Numerical Summaries of Spread Interquartile Range = IQR Based on “quartiles”, Q1 and Q3 (idea: shows where are 25% & 75% “through the data”) 25% 25% 25% 25% Q1 Q2 = median Q3 IQR = Q3 – Q1
Quartiles Example Revisit flexible web example: Right skewness gives: https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls Right skewness gives: Median < Mean (mean “feels farther points more strongly”) Q1 near median Q3 quite far (makes sense from histogram)
Tools Data Analysis Descriptive Stats Quartiles Example A look under the hood: https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Raw.xls Can compute as separate functions for each Or use: Tools Data Analysis Descriptive Stats Which gives many other measures as well Use “k-th largest & smallest” to get quartiles
5 Number Summary Summarize Information About: Minimum Q1 - 1st Quartile Median Q3 - 3rd Quartile Maximum Summarize Information About: Center - from 3 Spread - from 2 & 4 (maybe 1 & 6) Skewness - from 2, 3 & 4 Outliers - from 1 & 5
5 Number Summary How to Compute? EXCEL function QUARTILE https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls EXCEL function QUARTILE “One stop shopping” IQR seems to need explicit calculation
Rule for Defining “Outliers” Caution: There are many of these Textbook version: Above Q3 + 1.5 * IQR Below Q1 – 1.5 * IQR For stamps data: https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls No outliers at “low end” Some that “high end”
Box Plot Additional Visual Display Device Again legacy from pencil & paper days Not supported in EXCEL We will skip
5 Number Sum. & Outliers HW 1.49 c, d 1.46 and add: (d) How much does the mean change if you omit Montana and Wyoming?