Download presentation
Presentation is loading. Please wait.
Published byColin Caldwell Modified over 9 years ago
1
Week 2 September 8-12 Five Mini-Lectures QMM 510 Fall 2014
2
4-2 Describing Data Numerically Describing Data Numerically ML 2.1 Chapter Contents 4.1 Numerical Description 4.2 Measures of Center 4.3 Measures of Variability 4.4 Standardized Data 4.5 Percentiles, Quartiles, and Box Plots 4.6 Correlation and Covariance 4.7 Grouped Data 4.8 Skewness and Kurtosis Chapter 4 So many topics, so little time …
3
4-3 Chapter 4 Center, Variability, Shape Three key characteristics of numerical data:
4
4-4 Chapter 4 Visual Description
5
4-5 A familiar measure of centerA familiar measure of center Excel function =AVERAGE(Data) where Data is an array of data values.Excel function =AVERAGE(Data) where Data is an array of data values. Mean Mean Chapter 4 Measures of Center
6
4-6 The median (M) is the 50 th percentile or midpoint of the sorted sample data.The median (M) is the 50 th percentile or midpoint of the sorted sample data. M separates the upper and lower halves of the sorted observations.M separates the upper and lower halves of the sorted observations. If n is odd, the median is the middle observation in the data array.If n is odd, the median is the middle observation in the data array. If n is even, the median is the average of the middle two observations in the data array.If n is even, the median is the average of the middle two observations in the data array. Median Median Chapter 4 Measures of Center
7
4-7 The most frequently occurring data value. Familiar and easy to understand. But - data may have multiple modes or no mode. Most useful for discrete or categorical data with only a few values.Rarely useful for continuous data or data with a wide range. Mode Mode Chapter 4 Example: Example: Revenue growth in 32 bio-tech companies last year. Caution:. Excel’s =MODE(Data) returns only the first mode (1.71 in this example). Caution: In decimal data, some data values may occur more than once, but this is likely due to chance (not central tendency). Excel’s =MODE(Data) returns only the first mode (1.71 in this example). Measures of Center
8
4-8 Compare mean and median or look at the histogram to determine degree of skewness. Figure 4.10 shows prototype population shapes showing varying degrees of skewness. Chapter 4 Measures of Center
9
4-9 The geometric mean (G) is a multiplicative average.The geometric mean (G) is a multiplicative average. Geometric Mean Geometric Mean Chapter 4 Growth Rates Growth Rates A variation on the geometric mean used to find the average growth rate for a time series. In Excel =GEOMEAN(Data) or =(2*3*7*9*10*12)^(1/6) Measures of Center
10
4-10 For example, from 2006 to 2010, JetBlue Airlines revenues are:For example, from 2006 to 2010, JetBlue Airlines revenues are: Growth Rates Growth Rates The average growth rate: or 12.5 % per year. Chapter 4 Measures of Center
11
4-11 The midrange is the point halfway between the lowest and highest values of X.The midrange is the point halfway between the lowest and highest values of X. Easy to use but sensitive to extreme data values.Easy to use but sensitive to extreme data values. Here, the midrange (126.5) is higher than the mean (114.70) or median (113). Midrange Midrange For the J.D. Power quality data:For the J.D. Power quality data: Chapter 4 Measures of Center
12
4-12 To calculate the trimmed mean, first remove the highest and lowest k percent of the observations.To calculate the trimmed mean, first remove the highest and lowest k percent of the observations. For example, for the n = 33 P/E ratios, we want a 5 percent trimmed mean (i.e., k =.05).For example, for the n = 33 P/E ratios, we want a 5 percent trimmed mean (i.e., k =.05). To determine how many observations to trim, multiply k by n, which is 0.05 x 33 = 1.65 or 2 observations.To determine how many observations to trim, multiply k by n, which is 0.05 x 33 = 1.65 or 2 observations. So, we would remove the two smallest and two largest observations before averaging the remaining values.So, we would remove the two smallest and two largest observations before averaging the remaining values. Trimmed Mean Trimmed Mean Chapter 4 Measures of Center
13
4-13 Here is a summary of all the measures of central tendency for the J.D. Power data, along with Excel functions.Here is a summary of all the measures of central tendency for the J.D. Power data, along with Excel functions. The trimmed mean mitigates the effects of very high values. Trimmed Mean Trimmed Mean Chapter 4 Measures of Center
14
4-14 Variability is the “spread” of data points about the center of the distribution in a sample. Measures of Variability Measures of Variability Chapter 4 Measures of Variability
15
4-15 Chapter 4 Population variance Population standard deviation Measures of Variability
16
4-16 Chapter 4 Measures of Variability
17
4-17 Useful for comparing variables measured in different units or with different means.Useful for comparing variables measured in different units or with different means. A unit-free measure of dispersion.A unit-free measure of dispersion. Expressed as a percent of the mean.Expressed as a percent of the mean. Only appropriate for nonnegative data. It is undefined if the mean is zero or negative.Only appropriate for nonnegative data. It is undefined if the mean is zero or negative. Coefficient of Variation Coefficient of Variation Chapter 4 Measures of Variability
18
4-18 Chapter 4 Example: Class scores on 16-point quiz on first day of class and after students had an opportunity to review the material. Caution: Only appropriate for nonnegative data. CV is undefined if the mean is zero or negative (this could happen, for example, if stocks in a portfolio had negative rates of return). Measures of Variability
19
4-19 Standardized Data Standardized Data ML 2.2 Chapter 4Topics sorting, standardizing, z-scores sorting, standardizing, z-scores normal distribution as a benchmark normal distribution as a benchmark Empirical Rule (MegaStat) Empirical Rule (MegaStat) outliers and unusual observations outliers and unusual observations Excel functions (Appendix J) Excel functions (Appendix J) examples: birth weight, voting examples: birth weight, voting using MegaStat and Minitab using MegaStat and Minitab
20
4-20 The Empirical Rule states that for data from a normal distribution,The Empirical Rule states that for data from a normal distribution, we expect the interval ± k to contain a known percentage we expect the interval ± k to contain a known percentage of observed data: of observed data: The normal distribution is symmetric and is also known as theThe normal distribution is symmetric and is also known as the bell-shaped curve. bell-shaped curve. k = 1 68.26% will lie within + 1 k = 2 95.44% will lie within + 2 k = 3 99.73% will lie within + 3 Chapter 4 The Empirical Rule
21
4-21 Note: Note: No upper bound is given. Data values outside + 3 are rare. The Empirical Rule The Empirical Rule Chapter 4 Standardized Data
22
4-22 A standardized variable (Z) redefines each observation in terms of the number of standard deviations from the mean. A negative z value means the observation is to the left of the mean. Positive z means the observation is to the right of the mean. Chapter 4 Standardization formula for a population: Standardization formula for a sample (for n > 30): Standardized Data
23
4-23 Chapter 4 Standardized Data
24
4-24 Chapter 4 Standardized Data Example: Birth Weights (n = 1429) 5 pound baby’s z-score: z = (80-116.14)/21.96 = -1.65 8 pound baby’s z-score: z = (144-116.14)/21.96 = 1.27 11 pound baby’s z-score: z = (176-116.14)/21.96 = 2.73 Resembles a normal except for the low tail (a few extremely tiny babies ). Source Birth records from the North Carolina State Center for Health and Environmental Statistics and the Institute for Research in Social Science at University of North Carolina at Chapel Hill.
25
4-25 Chapter 4 Standardized Data Example: Voting in 2004 Presidential Election) Only two states stand out as unusual Note: Note: Sorting the data values allows you to see the extremes. Values within μ ±1σ are not less interesting. Use Excel’s function =STANDARDIZE(x, μ, σ)
26
4-26 Chapter 4Excel Voting percent in 50 states Note: Note: In Excel’s Descriptive Statistics, you can’t choose the statistics displayed.
27
4-27 Chapter 4MegaStat Note: Note: You can choose the statistics displayed (e.g.,Empirical Rule). Voting percent in 50 states
28
4-28 Chapter 4 Appendix J: Excel Functions
29
4-29 Chapter 4 Appendix J: Excel Functions
30
4-30 Quantiles Quantiles ML 2.3 Chapter 4Topics percentiles, quartiles, boxplots fences, another view of outliers examples: birth weight. City MPG
31
4-31 PercentilesPercentiles are data that have been divided into 100 groups. For example, you score in the 83 rd percentile on a standardized test. That means that 83% of the test-takers scored below you. Deciles are data that have been divided into10 groups.Deciles are data that have been divided into10 groups. Quintiles are data that have been divided into 5 groups.Quintiles are data that have been divided into 5 groups. Quartiles are data that have been divided into 4 groups.Quartiles are data that have been divided into 4 groups. Percentiles Percentiles Chapter 4 Percentiles, Quartiles, and Box-Plots
32
4-32 benchmarks Percentiles may be used to establish benchmarks for comparison purposes (e.g. health care, manufacturing, and banking industries use 5th, 25th, 50th, 75th and 90th percentiles). Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. Percentiles can be used in employee merit evaluation and salary benchmarking. Percentiles Percentiles Chapter 4 Percentiles, Quartiles, and Box-Plots
33
4-33 QuartilesQuartiles are scale points that divide the sorted data into four groups of approximately equal size. The three values that separate the four groups are called Q 1, Q 2, and Q 3. Quartiles Quartiles Chapter 4 Percentiles, Quartiles, and Box-Plots
34
4-34 medianThe second quartile Q 2 is the median, a measure of central tendency. Quartiles Quartiles Chapter 4 Percentiles, Quartiles, and Box-Plots
35
4-35 method of mediansFor small data sets, find quartiles using method of medians: Step 1: Sort the observations. Step 2: Find the median Q 2. below Step 3: Find the median of the data values that lie below Q 2. above Step 4: Find the median of the data values that lie above Q 2. Method of Medians Method of Medians Chapter 4 Percentiles, Quartiles, and Box-Plots
36
4-36 The first quartile Q 1 is the median of the data values below Q 2The first quartile Q 1 is the median of the data values below Q 2 The third quartile Q 3 is the median of the data values above Q 2.The third quartile Q 3 is the median of the data values above Q 2. For first half of data, 50% above, 50% below Q 1. For second half of data, 50% above, 50% below Q 3. Quartiles – The method of medians Quartiles – The method of medians Chapter 4 Percentiles, Quartiles, and Box-Plots
37
4-37 Method of Medians Method of Medians Chapter 4Example: Percentiles, Quartiles, and Box-Plots
38
4-38 exploratory data analysisA useful tool of exploratory data analysis (EDA). box-and-whisker plot.Also called a box-and-whisker plot. five-number summary:Based on a five-number summary: X min, Q 1, Q 2, Q 3, X max For the previous P/E ratios example: 7 27 35.5 40.5 49 X min, Q 1, Q 2, Q 3, X max Chapter 4 Box Plots Percentiles, Quartiles, and Box-Plots
39
4-39 The box plot is displayed visually, like this. Chapter 4 Box Plots Percentiles, Quartiles, and Box-Plots
40
4-40 Chapter 4 Box Plots Percentiles, Quartiles, and Box-Plots
41
4-41 The average of the first and third quartiles.The average of the first and third quartiles. midhinge The name midhinge derives from the idea that, if the “box” were folded in half, it would resemble a “hinge”. Box Plots: Midhinge Chapter 4 Percentiles, Quartiles, and Box-Plots
42
4-42 Use quartiles to detect unusual data points. fences These points are called fences and can be found using the following formulas: unusual outliersValues outside the inner fences are unusual while those outside the outer fences are outliers. Box Plots: Fences and Unusual Data Values Box Plots: Fences and Unusual Data Values Chapter 4 Percentiles, Quartiles, and Box-Plots
43
4-43 Chapter 4 Example: Birth Weights (n = 1429) Box-Plots with Fences Source Birth records from the North Carolina State Center for Health and Environmental Statistics and the Institute for Research in Social Science at University of North Carolina at Chapel Hill. Note: Note: The middle 50% of birth weights lie within a small range (105 to 130, or about 6.56 lb to 8.13 lbs). But there are extremes on the low end.
44
4-44 Fences Visualized: Chapter 4FencesExample: Interpretation: Interpretation: There are three outliers (beyond the inner upper fence). One is on the border of the upper outer fence, so is almost an extreme outlier. Lower fences are not displayed since they are irrelevant for this sample. Box-Plots with Fences
45
4-45 Interpretation: Based on the fences, there is only one outlier and no extreme outliers. Lower fences are not displayed since they are not needed for this samp Interpretation: Based on the fences, there is only one outlier and no extreme outliers. Lower fences are not displayed since they are not needed for this sample. Chapter 4 Example: Fences and Unusual Data Values Example: Fences and Unusual Data Values Outlier Box-Plots with Fences
46
4-46 Correlation, Grouped Data, Shape Correlation, Grouped Data, Shape ML 2.4 Chapter 4Topics scatter plots scatter plots correlation coefficient correlation coefficient covariance – population, sample covariance – population, sample mean from grouped mean mean from grouped mean skewness, kurtosis (Excel) skewness, kurtosis (Excel)
47
4-47 The sample correlation coefficient is a statistic that describes the degree of linearity between paired observations on two quantitative variables X and Y. Correlation Coefficient Correlation Coefficient Note: Note: -1 ≤ r ≤ +1 Chapter 4 Correlation and Covariance Perfect negative correlation Perfect positive correlation
48
4-48 Illustration of Correlation Coefficients Chapter 4 Correlation and Covariance
49
4-49 The sample correlation coefficient describes the degree of linearity between paired observations on two quantitative variables X and Y. Correlation Coefficient: Examples Correlation Coefficient: Examples Note: Note: -1 ≤ r ≤ +1 Chapter 4 X = car weight (lbs), Y = city MPGX = gestation (months), Y = birth weight (oz) Correlation and Covariance
50
4-50 The sample correlation coefficient describes the degree of linearity between paired observations on two quantitative variables X and Y. Correlation Coefficient: Example Correlation Coefficient: Example Note: Note: -1 ≤ r ≤ +1 Chapter 4 Correlation and Covariance
51
4-51 The covariance of two random variables X and Y (denoted σ XY ) measures the degree to which the values of X and Y change together. Covariance Covariance Chapter 4 Correlation and Covariance Caution: Caution: The covariance is not easy to interpret because its units depend on Y (e.g., dollars). That’s why we usually refer to the correlation coefficient (it is unit free).
52
4-52 Group Mean Group Mean Chapter 4 Grouped Data Weighted Mean Weighted Mean
53
4-53 Group Mean Group Mean Chapter 4 Grouped Data Note: Note: You will rarely need this. If you are given only grouped data. you will have to make your own tables in Excel (like this).
54
4-54 Skewness Skewness Chapter 4Skewness To interpret Excel’s skewness coefficient, you need a table showing critical values for various sample sizes. Note: Note: You can assess skewness from the histogram or boxplot (usually revealed by outliers or a long tail). It’s usually not worth it to bother with the table.
55
4-55 To interpret Excel’s kurtosis coefficient, you need a table showing critical values for various sample sizes. Chapter 4Kurtosis Caution: Caution: You cannot reliably assess kurtosis from the histogram, because its x-axis scale affects its appearance. Maybe best to let statisticians worry about this topic.
56
0-56 Assignments Assignments ML 2.5 Connect C-2 (covers chapter 4) Connect C-2 (covers chapter 4) You get three attempts You get three attempts Feedback is given if requested Feedback is given if requested Printable if you wish Printable if you wish Deadline is midnight each Monday Deadline is midnight each Monday Project P-1 (data, tasks, questions Project P-1 (data, tasks, questions ) Review instructions Review instructions Look at the data Look at the data Your task is to write a nice, readable report (not a spreadsheet) Your task is to write a nice, readable report (not a spreadsheet) Length is up to you Length is up to you
57
0-57 Projects: General Instructions General Instructions For each team project, submit a short (5-10 page) report (using Microsoft Word or equivalent) that answers the questions posed. Strive for effective writing (see textbook Appendix I). Creativity and initiative will be rewarded. Avoid careless spelling and grammar. Paste graphs and computer tables or output into your written report (it may be easier to format tables in Excel and then use Paste Special > Picture to avoid weird formatting and permit sizing within Word). Allocate tasks among team members as you see fit, but all should review and proofread the report (submit only one report).
58
0-58 Project P-1 Random teams are assigned on Moodle (submit only one report). Data: Download Big Dataset 02 - Crime in Major Cities from Moodle. Your team is assigned one crime category (but you can change it if you wish). Copy the city names and the chosen crime data column to a new spreadsheet. Delete lines (if any) with missing data. Analysis: (a) Sort the observations (with city names). (b) List the top 10 and bottom 10 data values (with city names). (c) For the entire data set, calculate the mean and median. What do they tell you about center? Would the mode be helpful for this type of data? Explain. (d) Calculate the standard deviation. (e) Calculate the standardized z-value for each observation. (f) Are there outliers or unusual data values (see p. 137)? Discuss. (g) Use MegaStat (or Minitab or Excel) to make a histogram. Describe its shape. (h) Calculate the quartiles. Make a boxplot and describe it. (i) Make a scatter plot of your kind of crime versus a different type of crime. What does it show? (j) Ambitious students: Sort the database in random order (see bottom of page 36) using Excel’s function =RAND(). Copy and paste the first few sorted lines into your report to illustrate your sorting method. Comment on anything unusual (or interesting things that you might find on the web).Moodle Watch the video walkthrough using Voting, North Carolina Births, and CEO compensation as examples (posted on Moodle)
59
0-59 Project P-1 your 2010 data will look like this (2005 and 2000 are also available)
60
0-60 Example: CEO Compensation sorting is a good first step
61
0-61 Example: CEO Compensation Highlight all data (including the headings) and use Custom Sort
62
0-62 Example: CEO Compensation now you can clearly see the high and low data values (and comment on any weird data values)
63
0-63 Example: CEO Compensation use MegaStat’s Descriptive Statistics to get your basic stats along with a nice boxplot
64
0-64 Example: CEO Compensation use MegaStat’s Frequency Distributions to get a frequency table, histogram, etc severely skewed annotated by user normal if logs used?
65
0-65 Example: CEO Compensation standardize the sorted list by subtracting the mean from each x value and then dividing by the standard deviation (or use =STANDARDIZE function)
66
0-66 Example: CEO Compensation after standardizing the sorted list, unusual z values can be seen
67
0-67 Example: CEO Compensation to randomize the list, paste values of =RAND() beside data and custom sort on =RAND()
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.