Numerical Descriptive Techniques

Slides:



Advertisements
Similar presentations
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
Advertisements

1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Descriptive Statistics
Calculating & Reporting Healthcare Statistics
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data Lesson2-1 Lesson 2: Descriptive Statistics.
Chap 3-1 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 3 Describing Data: Numerical.
Descriptive Statistics – Central Tendency & Variability Chapter 3 (Part 2) MSIS 111 Prof. Nick Dedeke.
B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch. 2-1 Statistics for Business and Economics 7 th Edition Chapter 2 Describing Data:
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Numerical Descriptive Techniques Chapter 會計資訊系統計學 ( 一 ) 上課投影片 4-2 Introduction §Recall Chapter 2, where we used graphical techniques to describe.
Slides by JOHN LOUCKS St. Edward’s University.
Basic Business Statistics 10th Edition
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Chapter 2 Describing Data with Numerical Measurements
Numerical Descriptive Techniques
1 Descriptive Statistics: Numerical Methods Chapter 4.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 4.1 Chapter Four Numerical Descriptive Techniques.
Copyright © 2009 Cengage Learning 4.1 Day 5 Numerical Descriptive Techniques.
Review of Measures of Central Tendency, Dispersion & Association
1 Tendencia central y dispersión de una distribución.
Describing distributions with numbers
Economics 173 Business Statistics Lecture 2 Fall, 2001 Professor J. Petry
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 4.1 Chapter Four Numerical Descriptive Techniques.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Statistics for Managers.
Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Numerical Descriptive Techniques
Chapter 3 – Descriptive Statistics
1 1 Slide © 2009 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
1 Measure of Center  Measure of Center the value at the center or middle of a data set 1.Mean 2.Median 3.Mode 4.Midrange (rarely used)
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
Business Statistics: Communicating with Numbers
Chapter 3 Descriptive Statistics: Numerical Methods Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Descriptive Statistics: Numerical Methods
Review of Measures of Central Tendency, Dispersion & Association
Describing distributions with numbers
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
Variation This presentation should be read by students at home to be able to solve problems.
1 Economics 173 Business Statistics Lectures 1 & 2 Summer, 2001 Professor J. Petry.
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
Chapter Four Numerical Descriptive Techniques Sir Naseer Shahzada.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Business Statistics, A First Course.
Business Statistics Spring 2005 Summarizing and Describing Numerical Data.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
Statistics Lecture Notes Dr. Halil İbrahim CEBECİ Chapter 03 Numerical Descriptive Techniques.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
EXPECTATION, VARIANCE ETC. - APPLICATION 1. 2 Measures of Central Location Usually, we focus our attention on two types of measures when describing population.
Statistical Methods © 2004 Prentice-Hall, Inc. Week 3-1 Week 3 Numerical Descriptive Measures Statistical Methods.
Copyright © 2009 Cengage Learning 4.1 Chapter Four Numerical Descriptive Techniques.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
MM150 ~ Unit 9 Statistics ~ Part II. WHAT YOU WILL LEARN Mode, median, mean, and midrange Percentiles and quartiles Range and standard deviation z-scores.
Descriptive Statistics ( )
Business and Economics 6th Edition
Numerical Descriptive Techniques
Chapter 4 Describing Data (Ⅱ ) Numerical Measures
Descriptive Statistics: Numerical Methods
Ch 4 實習.
Keller: Stats for Mgmt & Econ, 7th Ed
Numerical Descriptive Statistics
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
St. Edward’s University
Business and Economics 7th Edition
Presentation transcript:

Numerical Descriptive Techniques Chapter 4 Numerical Descriptive Techniques

4.2 Measures of Central Location Usually, we focus our attention on two types of measures when describing population characteristics: Central location (e.g. average) Variability or spread The measure of central location reflects the locations of all the actual data points.

4.2 Measures of Central Location The measure of central location reflects the locations of all the actual data points. How? With two data points, the central location should fall in the middle between them (in order to reflect the location of both of them). But if the third data point appears on the left hand-side of the midrange, it should “pull” the central location to the left. With one data point clearly the central location is at the point itself.

The Arithmetic Mean This is the most popular and useful measure of central location Sum of the observations Number of observations Mean =

The Arithmetic Mean Sample mean Population mean Sample size Population size

The Arithmetic Mean Example 4.1 Example 4.2 The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet. 7 22 11.0 Example 4.2 Suppose the telephone bills of Example 2.1 represent the population of measurements. The population mean is 42.19 38.45 45.77 43.59

The Median The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude. Example 4.3 Find the median of the time on the internet for the 10 adults of example 4.1 Suppose only 9 adults were sampled (exclude, say, the longest time (33)) Comment 0, 0, 5, 7, 8, 9, 12, 14, 22, 33 Even number of observations Odd number of observations 0, 0, 5, 7, 8, 9, 12, 14, 22, 33 8.5, 0, 0, 5, 7, 8 9, 12, 14, 22 8

The Mode The Mode of a set of observations is the value that occurs most frequently. Set of data may have one mode (or modal class), or two or more modes. For large data sets the modal class is much more relevant than a single-value mode. The modal class

The Mode The Mean, Median, Mode The Mode Example 4.5 Find the mode for the data in Example 4.1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 Solution All observation except “0” occur once. There are two “0”. Thus, the mode is zero. Is this a good measure of central location? The value “0” does not reside at the center of this set (compare with the mean = 11.0 and the mode = 8.5).

Relationship among Mean, Median, and Mode If a distribution is symmetrical, the mean, median and mode coincide If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mode Mean Median

Relationship among Mean, Median, and Mode If a distribution is symmetrical, the mean, median and mode coincide If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) A negatively skewed distribution (“skewed to the left”) Mode Mean Mean Mode Median Median

The Geometric Mean This is a measure of the average growth rate. Let Ri denote the the rate of return in period i (i=1,2…,n). The geometric mean of the returns R1, R2, …,Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods.

The Geometric Mean = For the given series of rate of returns the nth period return is calculated by: If the rate of return was Rg in every period, the nth period return would be calculated by: = Rg is selected such that…

4.3 Measures of variability Measures of central location fail to tell the whole story about the distribution. A question of interest still remains unanswered: How much are the observations spread out around the mean value?

4.3 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. This data set is now changing to...

4.3 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. Larger variability The same average value does not provide as good representation of the observations in the data set as before.

The range The range of a set of observations is the difference between the largest and smallest observations. Its major advantage is the ease with which it can be computed. Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points. But, how do all the observations spread out? ? ? ? The range cannot assist in answering this question Range Smallest observation Largest observation

The Variance This measure reflects the dispersion of all the observations The variance of a population of size N x1, x2,…,xN whose mean is m is defined as The variance of a sample of n observations x1, x2, …,xn whose mean is is defined as

Why not use the sum of deviations? Consider two small populations: 9-10= -1 A measure of dispersion Should agrees with this observation. Can the sum of deviations Be a good measure of dispersion? 11-10= +1 The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion. 8-10= -2 A 12-10= +2 8 9 10 11 12 Sum = 0 …but measurements in B are more dispersed then those in A. The mean of both populations is 10... 4-10 = - 6 16-10 = +6 B 7-10 = -3 4 7 10 13 16 13-10 = +3

The Variance Let us calculate the variance of the two populations Why is the variance defined as the average squared deviation? Why not use the sum of squared deviations as a measure of variation instead? After all, the sum of squared deviations increases in magnitude when the variation of a data set increases!!

Which data set has a larger dispersion? The Variance Let us calculate the sum of squared deviations for both data sets Which data set has a larger dispersion? Data set B is more dispersed around the mean A B 1 2 3 1 3 5

The Variance A B SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10 SumB = (1-3)2 + (5-3)2 = 8 SumA > SumB. This is inconsistent with the observation that set B is more dispersed. A B 1 2 3 1 3 5

The Variance However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked. sA2 = SumA/N = 10/5 = 2 sB2 = SumB/N = 8/2 = 4 A B 1 2 3 1 3 5

The Variance Example 4.7 Solution The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance Solution

The Variance – Shortcut method

Standard Deviation The standard deviation of a set of observations is the square root of the variance .

Standard Deviation Example 4.8 To examine the consistency of shots for a new innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7-iron) club, and 75 with the new club. The distances were recorded. Which 7-iron is more consistent?

Standard Deviation Example 4.8 – solution The Standard Deviation Standard Deviation Example 4.8 – solution Excel printout, from the “Descriptive Statistics” sub-menu. The innovation club is more consistent, and because the means are close, is considered a better club

Interpreting Standard Deviation The standard deviation can be used to compare the variability of several distributions make a statement about the general shape of a distribution. The empirical rule: If a sample of observations has a mound-shaped distribution, the interval

Interpreting Standard Deviation Example 4.9 A statistics practitioner wants to describe the way returns on investment are distributed. The mean return = 10% The standard deviation of the return = 8% The histogram is bell shaped.

Interpreting Standard Deviation Example 4.9 – solution The empirical rule can be applied (bell shaped histogram) Describing the return distribution Approximately 68% of the returns lie between 2% and 18% [10 – 1(8), 10 + 1(8)] Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)] Approximately 99.7% of the returns lie between -14% and 34% [10 – 3(8), 10 + 3(8)]

The Chebysheff’s Theorem The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1-1/k2 for k > 1. This theorem is valid for any set of measurements (sample, population) of any shape!! K Interval Chebysheff Empirical Rule 1 at least 0% approximately 68% 2 at least 75% approximately 95% 3 at least 89% approximately 99.7% (1-1/12) (1-1/22) (1-1/32)

The Chebysheff’s Theorem Example 4.10 The annual salaries of the employees of a chain of computer stores produced a positively skewed histogram. The mean and standard deviation are $28,000 and $3,000,respectively. What can you say about the salaries at this chain? Solution At least 75% of the salaries lie between $22,000 and $34,000 28000 – 2(3000) 28000 + 2(3000) At least 88.9% of the salaries lie between $$19,000 and $37,000 28000 – 3(3000) 28000 + 3(3000)

The Coefficient of Variation The coefficient of variation of a set of measurements is the standard deviation divided by the mean value. This coefficient provides a proportionate measure of variation. A standard deviation of 10 may be perceived large when the mean value is 100, but only moderately large when the mean value is 500

4.4 Measures of Relative Standing and Box Plots Percentile The pth percentile of a set of measurements is the value for which p percent of the observations are less than that value 100(1-p) percent of all the observations are greater than that value. Example Suppose your score is the 60% percentile of a SAT test. Then 60% of all the scores lie here 40% Your score

Quartiles Commonly used percentiles First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile Second (middle)quartile,Q2, = 50th percentile Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile

Quartiles Example Find the quartiles of the following set of measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8

Quartiles Solution Sort the observations 2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30 15 observations The first quartile At most (.25)(15) = 3.75 observations should appear below the first quartile. Check the first 3 observations on the left hand side. At most (.75)(15)=11.25 observations should appear above the first quartile. Check 11 observations on the right hand side. Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations.

Location of Percentiles Find the location of any percentile using the formula Example 4.11 Calculate the 25th, 50th, and 75th percentile of the data in Example 4.1

Location of Percentiles Example 4.11 – solution After sorting the data we have 0, 0, 5, 7, 8, 9, 22, 33. 2 3 0 5 1 Location Values Location 3 3.75 The 2.75th location Translates to the value (.75)(5 – 0) = 3.75 2.75

Location of Percentiles Example 4.11 – solution continued The 50th percentile is halfway between the fifth and sixth observations (in the middle between 8 and 9), that is 8.5.

Location of Percentiles Example 4.11 – solution continued The 75th percentile is one quarter of the distance between the eighth and ninth observation that is 14+.25(22 – 14) = 16. Eighth observation Ninth observation

Quartiles and Variability Quartiles can provide an idea about the shape of a histogram Q1 Q2 Q3 Q1 Q2 Q3 Positively skewed histogram Negatively skewed histogram

Interquartile range = Q3 – Q1 This is a measure of the spread of the middle 50% of the observations Large value indicates a large spread of the observations Interquartile range = Q3 – Q1

Box Plot This is a pictorial display that provides the main descriptive measures of the data set: L - the largest observation Q3 - The upper quartile Q2 - The median Q1 - The lower quartile S - The smallest observation 1.5(Q3 – Q1) Whisker S Q1 Q2 Q3 L

Box Plot Example 4.14 (Xm02-01) Left hand boundary = 9.275–1.5(IQR)= -104.226 Right hand boundary=84.9425+ 1.5(IQR)=198.4438 -104.226 9.275 84.9425 119.63 198.4438 26.905 No outliers are found

Box Plot Additional Example - GMAT scores Create a box plot for the data regarding the GMAT scores of 200 applicants (see GMAT.XLS) 417.5 449 512 537 575 669.5 788 512-1.5(IQR) 575+1.5(IQR)

Box Plot GMAT - continued Interpreting the box plot results Q1 512 Q2 537 Q3 575 449 669.5 25% 50% 25% Interpreting the box plot results The scores range from 449 to 788. About half the scores are smaller than 537, and about half are larger than 537. About half the scores lie between 512 and 575. About a quarter lies below 512 and a quarter above 575.

The histogram is positively skewed Box Plot GMAT - continued The histogram is positively skewed Q1 512 Q2 537 Q3 575 449 669.5 25% 50% 25% 50% 25% 25%

Box Plot Example 4.15 (Xm04-15) Example 4.15 – solution A study was organized to compare the quality of service in 5 drive through restaurants. Interpret the results Example 4.15 – solution Minitab box plot

Box Plot Jack in the Box Jack in the box is the slowest in service Hardee’s Hardee’s service time variability is the largest McDonalds Wendy’s service time appears to be the shortest and most consistent. Wendy’s Popeyes

Box Plot Times are symmetric Times are positively skewed Jack in the Box Jack in the box is the slowest in service Hardee’s Hardee’s service time variability is the largest McDonalds Wendy’s service time appears to be the shortest and most consistent. Wendy’s Popeyes

4.5 Measures of Linear Relationship The covariance and the coefficient of correlation are used to measure the direction and strength of the linear relationship between two variables. Covariance - is there any pattern to the way two variables move together? Coefficient of correlation - how strong is the linear relationship between two variables

Covariance mx (my) is the population mean of the variable X (Y). N is the population size. x (y) is the sample mean of the variable X (Y). n is the sample size.

Covariance Compare the following three sets xi yi (x – x) (y – y) 2 6 7 13 20 27 -3 1 -7 21 14 x=5 y =20 Cov(x,y)=17.5 xi yi 2 6 7 20 27 13 Cov(x,y) = -3.5 x=5 y =20 xi yi (x – x) (y – y) (x – x)(y – y) 2 6 7 27 20 13 -3 1 -7 -21 -14 x=5 y =20 Cov(x,y)=-17.5

Covariance If the two variables move in the same direction, (both increase or both decrease), the covariance is a large positive number. If the two variables move in opposite directions, (one increases when the other one decreases), the covariance is a large negative number. If the two variables are unrelated, the covariance will be close to zero.

The coefficient of correlation This coefficient answers the question: How strong is the association between X and Y.

The coefficient of correlation +1 -1 Strong positive linear relationship COV(X,Y)>0 or r or r = No linear relationship COV(X,Y)=0 Strong negative linear relationship COV(X,Y)<0

The coefficient of correlation If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship). If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship). No straight line relationship is indicated by a coefficient close to zero.

The coefficient of correlation and the covariance – Example 4.16 Compute the covariance and the coefficient of correlation to measure how GMAT scores and GPA in an MBA program are related to one another. Solution We believe GMAT affects GPA. Thus GMAT is labeled X GPA is labeled Y

The coefficient of correlation and the covariance – Example 4.16 Student x y x2 y2 xy 9.6 1 599 358801 92.16 5750.4 cov(x,y)=(1/12-1)[67,559.2-(7587)(106.4)/12]=26.16 Sx = {(1/12-1)[4,817,755-(7587)2/12)]}.5=43.56 Sy = similar to Sx = 1.12 r = cov(x,y)/SxSy = 26.16/(43.56)(1.12) = .5362 2 689 8.8 474721 77.44 6063.2 3 584 7.4 341056 54.76 4321.6 4 …………………………………………………. 631 10 398161 100 6310 11 593 8.8 351649 77.44 5218.4 12 683 8 466489 64 5464 Total 7,587 106.4 4,817,755 957.2 67,559.2

The coefficient of correlation and the covariance – Example 4 The coefficient of correlation and the covariance – Example 4.16 – Excel Use the Covariance option in Data Analysis If your version of Excel returns the population covariance and variances, multiply each one by n/n-1 to obtain the corresponding sample values. Use the Correlation option to produce the correlation matrix. Variance-Covariance Matrix Population values Population values GPA GMAT 1.15 23.98 1739.52 GPA GMAT 1.25 26.16 1897.66 Sample values Sample values 12 12-1 ´

The coefficient of correlation and the covariance – Example 4 The coefficient of correlation and the covariance – Example 4.16 – Excel Interpretation The covariance (26.16) indicates that GMAT score and performance in the MBA program are positively related. The coefficient of correlation (.5365) indicates that there is a moderately strong positive linear relationship between GMAT and MBA GPA.

The Least Squares Method We are seeking a line that best fits the data when two variables are (presumably) related to one another. We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized. The y value of point i calculated from the equation The actual y value of point i

The least Squares Method Y Errors Errors X Different lines generate different errors, thus different sum of squares of errors. There is a line that minimizes the sum of squared errors

The least Squares Method The coefficients b0 and b1 of the line that minimizes the sum of squares of errors are calculated from the data.

The Least Squares Method Example 4.17 Find the least squares line for Example 4.16 (Xm04-16.xls)