Measure of Variability (Dispersion, Spread) 1.Range 2.Inter-Quartile Range 3.Variance, standard deviation 4.Pseudo-standard deviation.

Slides:



Advertisements
Similar presentations
Chapter 3, Numerical Descriptive Measures
Advertisements

Describing Quantitative Variables
DESCRIBING DISTRIBUTION NUMERICALLY
Class Session #2 Numerically Summarizing Data
Measures of Dispersion
Descriptive Statistics
Calculating & Reporting Healthcare Statistics
Descriptive Statistics – Central Tendency & Variability Chapter 3 (Part 2) MSIS 111 Prof. Nick Dedeke.
B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
Jan Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale:
Analysis of Research Data
Slides by JOHN LOUCKS St. Edward’s University.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
Correlation and Regression Analysis
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Describing Data: Numerical
Chapter 2 Describing Data with Numerical Measurements
Numerical Descriptive Measures
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,
Objectives 1.2 Describing distributions with numbers
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Chap 6-1 Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall Chapter 6 The Normal Distribution Business Statistics: A First Course 6 th.
Numerical Descriptive Techniques
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1 Chapter 4 Numerical Methods for Describing Data.
© 2008 Brooks/Cole, a division of Thomson Learning, Inc. 1 Chapter 4 Numerical Methods for Describing Data.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
1.3: Describing Quantitative Data with Numbers
Some definitions In Statistics. A sample: Is a subset of the population.
Measures of Variability Variability. Measure of Variability (Dispersion, Spread) Variance, standard deviation Range Inter-Quartile Range Pseudo-standard.
Chapter 3 Descriptive Statistics: Numerical Methods Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Review Measures of central tendency
Chapter 2 Describing Data.
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
1 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES. 2 MEASURES OF CENTRAL TENDENCY FOR UNGROUPED DATA  In Chapter 2, we used tables and graphs to summarize a.
Chapter 8 Making Sense of Data in Six Sigma and Lean
Numerical Measures. Measures of Central Tendency (Location) Measures of Non Central Location Measure of Variability (Dispersion, Spread) Measures of Shape.
Numerical Measures. Measures of Central Tendency (Location) Measures of Non Central Location Measure of Variability (Dispersion, Spread) Measures of Shape.
1 Chapter 4 Numerical Methods for Describing Data.
Unit 3: Averages and Variations Week 6 Ms. Sanchez.
Summary Statistics: Measures of Location and Dispersion.
Data Summary Using Descriptive Measures Sections 3.1 – 3.6, 3.8
Multivariate Data. Descriptive techniques for Multivariate data In most research situations data is collected on more than one variable (usually many.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
Numerical descriptions of distributions
Summarizing Data Graphical Methods. Histogram Stem-Leaf Diagram Grouped Freq Table Box-whisker Plot.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Measures of Variability Variability. Measure of Variability (Dispersion, Spread) Variance, standard deviation Range Inter-Quartile Range Pseudo-standard.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
MM150 ~ Unit 9 Statistics ~ Part II. WHAT YOU WILL LEARN Mode, median, mean, and midrange Percentiles and quartiles Range and standard deviation z-scores.
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Descriptive Statistics ( )
Exploratory Data Analysis
Numerical Measures.
Multivariate Data.
Chapter 6 ENGR 201: Statistics for Engineers
NUMERICAL DESCRIPTIVE MEASURES
Descriptive Statistics
Description of Data (Summary and Variability measures)
Numerical Descriptive Measures
Numerical Measures: Skewness and Location
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Measures of Variability
The Normal Distribution
Presentation transcript:

Measure of Variability (Dispersion, Spread) 1.Range 2.Inter-Quartile Range 3.Variance, standard deviation 4.Pseudo-standard deviation

Measure of Central Location 1.Mean 2.Median

1.Range R = Range = max - min 2.Inter-Quartile Range (IQR) Inter-Quartile Range = IQR = Q 3 - Q 1

Example The data Verbal IQ on n = 23 students arranged in increasing order is: Q 2 = 96Q 1 = 89 Q 3 = 105 min = 80max = 119

Range and IQR Range = max – min = 119 – 80 = 39 Inter-Quartile Range = IQR = Q 3 - Q 1 = 105 – 89 = 16

3.Sample Variance Let x 1, x 2, x 3, … x n denote a set of n numbers. Recall the mean of the n numbers is defined as:

The numbers are called deviations from the the mean

The sum is called the sum of squares of deviations from the the mean. Writing it out in full: or

The Sample Variance Is defined as the quantity: and is denoted by the symbol

The Sample Standard Deviation s Definition: The Sample Standard Deviation is defined by: Hence the Sample Standard Deviation, s, is the square root of the sample variance.

Example Let x 1, x 2, x 3, x 4, x 5 denote a set of 5 denote the set of numbers in the following table. i12345 xixi

Then = x 1 + x 2 + x 3 + x 4 + x 5 = = 66 and

The deviations from the mean d 1, d 2, d 3, d 4, d 5 are given in the following table.

The sum and

Also the standard deviation is:

Interpretations of s In Normal distributions –Approximately 2/3 of the observations will lie within one standard deviation of the mean –Approximately 95% of the observations lie within two standard deviations of the mean –In a histogram of the Normal distribution, the standard deviation is approximately the distance from the mode to the inflection point

s Inflection point Mode

s 2/3 s

2s

Example A researcher collected data on 1500 males aged The variable measured was cholesterol and blood pressure. –The mean blood pressure was 155 with a standard deviation of 12. –The mean cholesterol level was 230 with a standard deviation of 15 –In both cases the data was normally distributed

Interpretation of these numbers Blood pressure levels vary about the value 155 in males aged Cholesterol levels vary about the value 230 in males aged

2/3 of males aged have blood pressure within 12 of 155. i.e. between =143 and = /3 of males aged have Cholesterol within 15 of 230. i.e. between =215 and = 245.

95% of males aged have blood pressure within 2(12) = 24 of 155. Ii.e. between =131 and = % of males aged have Cholesterol within 2(15) = 30 of 230. i.e. between =200 and = 260.

A Computing formula for: Sum of squares of deviations from the the mean : The difficulty with this formula is that will have many decimals. The result will be that each term in the above sum will also have many decimals.

The sum of squares of deviations from the the mean can also be computed using the following identity:

To use this identity we need to compute:

Then:

Example The data Verbal IQ on n = 23 students arranged in increasing order is:

= = 2244 = =

Then: You will obtain exactly the same answer if you use the left hand side of the equation

A quick (rough) calculation of s The reason for this is that approximately all (95%) of the observations are between and Thus

Example Verbal IQ on n = 23 students min = 80and max = 119 This compares with the exact value of s which is The rough method is useful for checking your calculation of s.

The Pseudo Standard Deviation (PSD) Definition: The Pseudo Standard Deviation (PSD) is defined by:

Properties For Normal distributions the magnitude of the pseudo standard deviation (PSD) and the standard deviation (s) will be approximately the same value For leptokurtic distributions the standard deviation (s) will be larger than the pseudo standard deviation (PSD) For platykurtic distributions the standard deviation (s) will be smaller than the pseudo standard deviation (PSD)

Example Verbal IQ on n = 23 students Inter-Quartile Range = IQR = Q 3 - Q 1 = 105 – 89 = 16 Pseudo standard deviation This compares with the standard deviation

An outlier is a “wild” observation in the data Outliers occur because –of errors (typographical and computational) –Extreme cases in the population We will now consider the drawing of box- plots where outliers are identified

Box-whisker Plots showing outliers

An outlier is a “wild” observation in the data Outliers occur because –of errors (typographical and computational) –Extreme cases in the population We will now consider the drawing of box- plots where outliers are identified

To Draw a Box Plot we need to: Compute the Hinge (Median, Q 2 ) and the Mid-hinges (first & third quartiles – Q 1 and Q 3 ) To identify outliers we will compute the inner and outer fences

The fences are like the fences at a prison. We expect the entire population to be within both sets of fences. If a member of the population is between the inner and outer fences it is a mild outlier. If a member of the population is outside of the outer fences it is an extreme outlier.

Lower outer fence F 1 = Q 1 - (3)IQR Upper outer fence F 2 = Q 3 + (3)IQR

Lower inner fence f 1 = Q 1 - (1.5)IQR Upper inner fence f 2 = Q 3 + (1.5)IQR

Observations that are between the lower and upper fences are considered to be non- outliers. Observations that are outside the inner fences but not outside the outer fences are considered to be mild outliers. Observations that are outside outer fences are considered to be extreme outliers.

mild outliers are plotted individually in a box-plot using the symbol extreme outliers are plotted individually in a box-plot using the symbol non-outliers are represented with the box and whiskers with –Max = largest observation within the fences –Min = smallest observation within the fences

Inner fences Outer fence Mild outliers Extreme outlier Box-Whisker plot representing the data that are not outliers

Example Data collected on n = 109 countries in Data collected on k = 25 variables.

The variables 1.Population Size (in 1000s) 2.Density = Number of people/Sq kilometer 3.Urban = percentage of population living in cities 4.Religion 5.lifeexpf = Average female life expectancy 6.lifeexpm = Average male life expectancy

7.literacy = % of population who read 8.pop_inc = % increase in popn size (1995) 9.babymort = Infant motality (deaths per 1000) 10.gdp_cap = Gross domestic product/capita 11.Region = Region or economic group 12.calories = Daily calorie intake. 13.aids = Number of aids cases 14.birth_rt = Birth rate per 1000 people

15.death_rt = death rate per 1000 people 16.aids_rt = Number of aids cases/ people 17.log_gdp = log 10 (gdp_cap) 18.log_aidsr = log 10 (aids_rt) 19.b_to_d =birth to death ratio 20.fertility = average number of children in family 21.log_pop = log 10 (population)

22.cropgrow = ?? 23.lit_male = % of males who can read 24.lit_fema = % of females who can read 25.Climate = predominant climate

The data file as it appears in SPSS

Consider the data on infant mortality Stem-Leaf diagram stem = 10s, leaf = unit digit

median = Q 2 = 27 Quartiles Lower quartile = Q 1 = the median of lower half Upper quartile = Q 3 = the median of upper half Summary Statistics Interquartile range (IQR) IQR = Q 1 - Q 3 = 66.5 – 12 = 54.5

lower = Q 1 - 3(IQR) = 12 – 3(54.5) = The Outer Fences No observations are outside of the outer fences lower = Q 1 – 1.5(IQR) = 12 – 1.5(54.5) = The Inner Fences upper = Q 3 = 1.5(IQR) = 66.5 – 1.5(54.5) = upper = Q 3 = 3(IQR) = 66.5 – 3(54.5) = Only one observation (168 – Afghanistan) is outside of the inner fences – (mild outlier)

Box-Whisker Plot of Infant Mortality Infant Mortality

Example 2 In this example we are looking at the weight gains (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork). – Ten test animals for each diet

Table Gains in weight (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork) Level High ProteinLow protein Source Beef Cereal PorkBeefCerealPork Diet Median Mean IQR PSD Variance Std. Dev

High ProteinLow Protein Beef Cereal Pork

Conclusions Weight gain is higher for the high protein meat diets Increasing the level of protein - increases weight gain but only if source of protein is a meat source

Measures of Shape

Skewness Kurtosis Positively skewed Negatively skewed Symmetric Platykurtic LeptokurticNormal (mesokurtic)

Measure of Skewness – based on the sum of cubes Measure of Kurtosis – based on the sum of 4 th powers

The Measure of Skewness

The Measure of Kurtosis The 3 is subtracted so that g 2 is zero for the normal distribution

Interpretations of Measures of Shape Skewness Kurtosis g 1 > 0g 1 = 0 g 1 < 0 g 2 < 0 g 2 = 0 g 2 > 0

Descriptive techniques for Multivariate data In most research situations data is collected on more than one variable (usually many variables)

Graphical Techniques The scatter plot The two dimensional Histogram

The Scatter Plot For two variables X and Y we will have a measurements for each variable on each case: x i, y i x i = the value of X for case i and y i = the value of Y for case i.

To Construct a scatter plot we plot the points: ( x i, y i ) for each case on the X-Y plane. ( x i, y i ) xixi yiyi

Data Set #3 The following table gives data on Verbal IQ, Math IQ, Initial Reading Acheivement Score, and Final Reading Acheivement Score for 23 students who have recently completed a reading improvement program InitialFinal VerbalMathReadingReading StudentIQIQAcheivementAcheivement

(84,80)

Some Scatter Patterns

Circular No relationship between X and Y Unable to predict Y from X

Ellipsoidal Positive relationship between X and Y Increases in X correspond to increases in Y (but not always) Major axis of the ellipse has positive slope

Example Verbal IQ, MathIQ

Some More Patterns

Ellipsoidal (thinner ellipse) Stronger positive relationship between X and Y Increases in X correspond to increases in Y (more freqequently) Major axis of the ellipse has positive slope Minor axis of the ellipse much smaller

Increased strength in the positive relationship between X and Y Increases in X correspond to increases in Y (almost always) Minor axis of the ellipse extremely small in relationship to the Major axis of the ellipse.

Perfect positive relationship between X and Y Y perfectly predictable from X Data falls exactly along a straight line with positive slope

Ellipsoidal Negative relationship between X and Y Increases in X correspond to decreases in Y (but not always) Major axis of the ellipse has negative slope slope

The strength of the relationship can increase until changes in Y can be perfectly predicted from X

Some Non-Linear Patterns

In a Linear pattern Y increase with respect to X at a constant rate In a Non-linear pattern the rate that Y increases with respect to X is variable

Growth Patterns

Growth patterns frequently follow a sigmoid curve Growth at the start is slow It then speeds up Slows down again as it reaches it limiting size

Measures of strength of a relationship (Correlation) Pearson’s correlation coefficient (r) Spearman’s rank correlation coefficient (rho,  )

Assume that we have collected data on two variables X and Y. Let ( x 1, y 1 ) ( x 2, y 2 ) ( x 3, y 3 ) … ( x n, y n ) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

From this data we can compute summary statistics for each variable. The means and

The standard deviations and

These statistics: give information for each variable separately but give no information about the relationship between the two variables

Consider the statistics:

The first two statistics: are used to measure variability in each variable they are used to compute the sample standard deviations and

The third statistic: is used to measure correlation If two variables are positively related the sign of will agree with the sign of

When is positive will be positive. When x i is above its mean, y i will be above its mean When is negative will be negative. When x i is below its mean, y i will be below its mean The product will be positive for most cases.

This implies that the statistic will be positive Most of the terms in this sum will be positive

On the other hand If two variables are negatively related the sign of will be opposite in sign to

When is positive will be negative. When x i is above its mean, y i will be below its mean When is negative will be positive. When x i is below its mean, y i will be above its mean The product will be negative for most cases.

Again implies that the statistic will be negative Most of the terms in this sum will be negative

Pearsons correlation coefficient is defined as below:

The denominator: is always positive

The numerator: is positive if there is a positive relationship between X ad Y and negative if there is a negative relationship between X ad Y. This property carries over to Pearson’s correlation coefficient r

Properties of Pearson’s correlation coefficient r 1.The value of r is always between –1 and If the relationship between X and Y is positive, then r will be positive. 3.If the relationship between X and Y is negative, then r will be negative. 4.If there is no relationship between X and Y, then r will be zero. 5.The value of r will be +1 if the points, ( x i, y i ) lie on a straight line with positive slope. 6.The value of r will be -1 if the points, ( x i, y i ) lie on a straight line with negative slope.

r =1

r = 0.95

r = 0.7

r = 0.4

r = 0

r = -0.4

r = -0.7

r = -0.8

r = -0.95

r = -1

Computing formulae for the statistics:

To compute first compute Then

Example Verbal IQ, MathIQ

Data Set #3 The following table gives data on Verbal IQ, Math IQ, Initial Reading Acheivement Score, and Final Reading Acheivement Score for 23 students who have recently completed a reading improvement program InitialFinal VerbalMathReadingReading StudentIQIQAcheivementAcheivement

Now Hence

Thus Pearsons correlation coefficient is: