Presentation is loading. Please wait.

Presentation is loading. Please wait.

Measure of Variability (Dispersion, Spread) 1.Range 2.Inter-Quartile Range 3.Variance, standard deviation 4.Pseudo-standard deviation.

Similar presentations


Presentation on theme: "Measure of Variability (Dispersion, Spread) 1.Range 2.Inter-Quartile Range 3.Variance, standard deviation 4.Pseudo-standard deviation."— Presentation transcript:

1 Measure of Variability (Dispersion, Spread) 1.Range 2.Inter-Quartile Range 3.Variance, standard deviation 4.Pseudo-standard deviation

2 Measure of Central Location 1.Mean 2.Median

3 1.Range R = Range = max - min 2.Inter-Quartile Range (IQR) Inter-Quartile Range = IQR = Q 3 - Q 1

4 Example The data Verbal IQ on n = 23 students arranged in increasing order is: 80 82 84 86 86 89 90 94 94 95 95 96 99 99 102 102 104 105 105 109 111 118 119 Q 2 = 96Q 1 = 89 Q 3 = 105 min = 80max = 119

5 Range and IQR Range = max – min = 119 – 80 = 39 Inter-Quartile Range = IQR = Q 3 - Q 1 = 105 – 89 = 16

6 3.Sample Variance Let x 1, x 2, x 3, … x n denote a set of n numbers. Recall the mean of the n numbers is defined as:

7 The numbers are called deviations from the the mean

8 The sum is called the sum of squares of deviations from the the mean. Writing it out in full: or

9 The Sample Variance Is defined as the quantity: and is denoted by the symbol

10 The Sample Standard Deviation s Definition: The Sample Standard Deviation is defined by: Hence the Sample Standard Deviation, s, is the square root of the sample variance.

11 Example Let x 1, x 2, x 3, x 4, x 5 denote a set of 5 denote the set of numbers in the following table. i12345 xixi 101521713

12 Then = x 1 + x 2 + x 3 + x 4 + x 5 = 10 + 15 + 21 + 7 + 13 = 66 and

13 The deviations from the mean d 1, d 2, d 3, d 4, d 5 are given in the following table.

14 The sum and

15 Also the standard deviation is:

16 Interpretations of s In Normal distributions –Approximately 2/3 of the observations will lie within one standard deviation of the mean –Approximately 95% of the observations lie within two standard deviations of the mean –In a histogram of the Normal distribution, the standard deviation is approximately the distance from the mode to the inflection point

17 s Inflection point Mode

18 s 2/3 s

19 2s

20 Example A researcher collected data on 1500 males aged 60-65. The variable measured was cholesterol and blood pressure. –The mean blood pressure was 155 with a standard deviation of 12. –The mean cholesterol level was 230 with a standard deviation of 15 –In both cases the data was normally distributed

21 Interpretation of these numbers Blood pressure levels vary about the value 155 in males aged 60-65. Cholesterol levels vary about the value 230 in males aged 60-65.

22 2/3 of males aged 60-65 have blood pressure within 12 of 155. i.e. between 155-12 =143 and 155+12 = 167. 2/3 of males aged 60-65 have Cholesterol within 15 of 230. i.e. between 230-15 =215 and 230+15 = 245.

23 95% of males aged 60-65 have blood pressure within 2(12) = 24 of 155. Ii.e. between 155-24 =131 and 155+24 = 179. 95% of males aged 60-65 have Cholesterol within 2(15) = 30 of 230. i.e. between 230- 30 =200 and 230+30 = 260.

24 A Computing formula for: Sum of squares of deviations from the the mean : The difficulty with this formula is that will have many decimals. The result will be that each term in the above sum will also have many decimals.

25 The sum of squares of deviations from the the mean can also be computed using the following identity:

26 To use this identity we need to compute:

27 Then:

28

29 Example The data Verbal IQ on n = 23 students arranged in increasing order is: 8082848686899094 949595969999102102 104105105109111118119

30 = 80 + 82 + 84 + 86 + 86 + 89 + 90 + 94 + 94 + 95 + 95 + 96 + 99 + 99 + 102 + 102 + 104 + 105 + 105 + 109 + 111 + 118 + 119 = 2244 = 80 2 + 82 2 + 84 2 + 86 2 + 86 2 + 89 2 + 90 2 + 94 2 + 94 2 + 95 2 + 95 2 + 96 2 + 99 2 + 99 2 + 102 2 + 102 2 + 104 2 + 105 2 + 105 2 + 109 2 + 111 2 + 118 2 + 119 2 = 221494

31 Then: You will obtain exactly the same answer if you use the left hand side of the equation

32

33

34 A quick (rough) calculation of s The reason for this is that approximately all (95%) of the observations are between and Thus

35 Example Verbal IQ on n = 23 students min = 80and max = 119 This compares with the exact value of s which is 10.782. The rough method is useful for checking your calculation of s.

36 The Pseudo Standard Deviation (PSD) Definition: The Pseudo Standard Deviation (PSD) is defined by:

37 Properties For Normal distributions the magnitude of the pseudo standard deviation (PSD) and the standard deviation (s) will be approximately the same value For leptokurtic distributions the standard deviation (s) will be larger than the pseudo standard deviation (PSD) For platykurtic distributions the standard deviation (s) will be smaller than the pseudo standard deviation (PSD)

38 Example Verbal IQ on n = 23 students Inter-Quartile Range = IQR = Q 3 - Q 1 = 105 – 89 = 16 Pseudo standard deviation This compares with the standard deviation

39 An outlier is a “wild” observation in the data Outliers occur because –of errors (typographical and computational) –Extreme cases in the population We will now consider the drawing of box- plots where outliers are identified

40 Box-whisker Plots showing outliers

41 An outlier is a “wild” observation in the data Outliers occur because –of errors (typographical and computational) –Extreme cases in the population We will now consider the drawing of box- plots where outliers are identified

42 To Draw a Box Plot we need to: Compute the Hinge (Median, Q 2 ) and the Mid-hinges (first & third quartiles – Q 1 and Q 3 ) To identify outliers we will compute the inner and outer fences

43 The fences are like the fences at a prison. We expect the entire population to be within both sets of fences. If a member of the population is between the inner and outer fences it is a mild outlier. If a member of the population is outside of the outer fences it is an extreme outlier.

44 Lower outer fence F 1 = Q 1 - (3)IQR Upper outer fence F 2 = Q 3 + (3)IQR

45 Lower inner fence f 1 = Q 1 - (1.5)IQR Upper inner fence f 2 = Q 3 + (1.5)IQR

46 Observations that are between the lower and upper fences are considered to be non- outliers. Observations that are outside the inner fences but not outside the outer fences are considered to be mild outliers. Observations that are outside outer fences are considered to be extreme outliers.

47 mild outliers are plotted individually in a box-plot using the symbol extreme outliers are plotted individually in a box-plot using the symbol non-outliers are represented with the box and whiskers with –Max = largest observation within the fences –Min = smallest observation within the fences

48 Inner fences Outer fence Mild outliers Extreme outlier Box-Whisker plot representing the data that are not outliers

49 Example Data collected on n = 109 countries in 1995. Data collected on k = 25 variables.

50 The variables 1.Population Size (in 1000s) 2.Density = Number of people/Sq kilometer 3.Urban = percentage of population living in cities 4.Religion 5.lifeexpf = Average female life expectancy 6.lifeexpm = Average male life expectancy

51 7.literacy = % of population who read 8.pop_inc = % increase in popn size (1995) 9.babymort = Infant motality (deaths per 1000) 10.gdp_cap = Gross domestic product/capita 11.Region = Region or economic group 12.calories = Daily calorie intake. 13.aids = Number of aids cases 14.birth_rt = Birth rate per 1000 people

52 15.death_rt = death rate per 1000 people 16.aids_rt = Number of aids cases/100000 people 17.log_gdp = log 10 (gdp_cap) 18.log_aidsr = log 10 (aids_rt) 19.b_to_d =birth to death ratio 20.fertility = average number of children in family 21.log_pop = log 10 (population)

53 22.cropgrow = ?? 23.lit_male = % of males who can read 24.lit_fema = % of females who can read 25.Climate = predominant climate

54 The data file as it appears in SPSS

55 Consider the data on infant mortality Stem-Leaf diagram stem = 10s, leaf = unit digit

56 median = Q 2 = 27 Quartiles Lower quartile = Q 1 = the median of lower half Upper quartile = Q 3 = the median of upper half Summary Statistics Interquartile range (IQR) IQR = Q 1 - Q 3 = 66.5 – 12 = 54.5

57 lower = Q 1 - 3(IQR) = 12 – 3(54.5) = - 151.5 The Outer Fences No observations are outside of the outer fences lower = Q 1 – 1.5(IQR) = 12 – 1.5(54.5) = - 69.75 The Inner Fences upper = Q 3 = 1.5(IQR) = 66.5 – 1.5(54.5) = 148.25 upper = Q 3 = 3(IQR) = 66.5 – 3(54.5) = 230.0 Only one observation (168 – Afghanistan) is outside of the inner fences – (mild outlier)

58 Box-Whisker Plot of Infant Mortality Infant Mortality

59 Example 2 In this example we are looking at the weight gains (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork). – Ten test animals for each diet

60 Table Gains in weight (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork) Level High ProteinLow protein Source Beef Cereal PorkBeefCerealPork Diet 123456 7398949010749 1027479769582 1185696909773 10411198648086 8195102869881 10788102517497 100821087274106 877791906770 11786120958961 11192105785882 Median103.087.0100.082.084.581.5 Mean100.085.999.579.283.978.7 IQR24.018.011.018.023.016.0 PSD17.7813.338.1513.3317.0411.05 Variance229.11225.66119.17192.84246.77273.79 Std. Dev.15.1415.0210.9213.8915.7116.55

61 High ProteinLow Protein Beef Cereal Pork

62 Conclusions Weight gain is higher for the high protein meat diets Increasing the level of protein - increases weight gain but only if source of protein is a meat source

63 Measures of Shape

64 Skewness Kurtosis Positively skewed Negatively skewed Symmetric Platykurtic LeptokurticNormal (mesokurtic)

65 Measure of Skewness – based on the sum of cubes Measure of Kurtosis – based on the sum of 4 th powers

66 The Measure of Skewness

67 The Measure of Kurtosis The 3 is subtracted so that g 2 is zero for the normal distribution

68 Interpretations of Measures of Shape Skewness Kurtosis g 1 > 0g 1 = 0 g 1 < 0 g 2 < 0 g 2 = 0 g 2 > 0

69 Descriptive techniques for Multivariate data In most research situations data is collected on more than one variable (usually many variables)

70 Graphical Techniques The scatter plot The two dimensional Histogram

71 The Scatter Plot For two variables X and Y we will have a measurements for each variable on each case: x i, y i x i = the value of X for case i and y i = the value of Y for case i.

72 To Construct a scatter plot we plot the points: ( x i, y i ) for each case on the X-Y plane. ( x i, y i ) xixi yiyi

73 Data Set #3 The following table gives data on Verbal IQ, Math IQ, Initial Reading Acheivement Score, and Final Reading Acheivement Score for 23 students who have recently completed a reading improvement program InitialFinal VerbalMathReadingReading StudentIQIQAcheivementAcheivement 186941.11.7 21041031.51.7 386921.51.9 41051002.02.0 51181151.93.5 6961021.42.4 790871.51.8 8951001.42.0 9105961.71.7 1084801.61.7 1194871.61.7 121191161.73.1 1382911.21.8 1480931.01.7 151091241.82.5 161111191.43.0 1789941.61.8 18991171.62.6 1994931.41.4 20991101.42.0 2195971.51.3 221021041.73.1 23102931.61.9

74

75 (84,80)

76

77 Some Scatter Patterns

78

79

80 Circular No relationship between X and Y Unable to predict Y from X

81

82

83 Ellipsoidal Positive relationship between X and Y Increases in X correspond to increases in Y (but not always) Major axis of the ellipse has positive slope

84

85 Example Verbal IQ, MathIQ

86

87 Some More Patterns

88

89

90 Ellipsoidal (thinner ellipse) Stronger positive relationship between X and Y Increases in X correspond to increases in Y (more freqequently) Major axis of the ellipse has positive slope Minor axis of the ellipse much smaller

91

92 Increased strength in the positive relationship between X and Y Increases in X correspond to increases in Y (almost always) Minor axis of the ellipse extremely small in relationship to the Major axis of the ellipse.

93

94

95 Perfect positive relationship between X and Y Y perfectly predictable from X Data falls exactly along a straight line with positive slope

96

97

98 Ellipsoidal Negative relationship between X and Y Increases in X correspond to decreases in Y (but not always) Major axis of the ellipse has negative slope slope

99

100 The strength of the relationship can increase until changes in Y can be perfectly predicted from X

101

102

103

104

105

106 Some Non-Linear Patterns

107

108

109 In a Linear pattern Y increase with respect to X at a constant rate In a Non-linear pattern the rate that Y increases with respect to X is variable

110 Growth Patterns

111

112

113 Growth patterns frequently follow a sigmoid curve Growth at the start is slow It then speeds up Slows down again as it reaches it limiting size

114 Measures of strength of a relationship (Correlation) Pearson’s correlation coefficient (r) Spearman’s rank correlation coefficient (rho,  )

115 Assume that we have collected data on two variables X and Y. Let ( x 1, y 1 ) ( x 2, y 2 ) ( x 3, y 3 ) … ( x n, y n ) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

116 From this data we can compute summary statistics for each variable. The means and

117 The standard deviations and

118 These statistics: give information for each variable separately but give no information about the relationship between the two variables

119 Consider the statistics:

120 The first two statistics: are used to measure variability in each variable they are used to compute the sample standard deviations and

121 The third statistic: is used to measure correlation If two variables are positively related the sign of will agree with the sign of

122 When is positive will be positive. When x i is above its mean, y i will be above its mean When is negative will be negative. When x i is below its mean, y i will be below its mean The product will be positive for most cases.

123 This implies that the statistic will be positive Most of the terms in this sum will be positive

124 On the other hand If two variables are negatively related the sign of will be opposite in sign to

125 When is positive will be negative. When x i is above its mean, y i will be below its mean When is negative will be positive. When x i is below its mean, y i will be above its mean The product will be negative for most cases.

126 Again implies that the statistic will be negative Most of the terms in this sum will be negative

127 Pearsons correlation coefficient is defined as below:

128 The denominator: is always positive

129 The numerator: is positive if there is a positive relationship between X ad Y and negative if there is a negative relationship between X ad Y. This property carries over to Pearson’s correlation coefficient r

130 Properties of Pearson’s correlation coefficient r 1.The value of r is always between –1 and +1. 2.If the relationship between X and Y is positive, then r will be positive. 3.If the relationship between X and Y is negative, then r will be negative. 4.If there is no relationship between X and Y, then r will be zero. 5.The value of r will be +1 if the points, ( x i, y i ) lie on a straight line with positive slope. 6.The value of r will be -1 if the points, ( x i, y i ) lie on a straight line with negative slope.

131 r =1

132 r = 0.95

133 r = 0.7

134 r = 0.4

135 r = 0

136 r = -0.4

137 r = -0.7

138 r = -0.8

139 r = -0.95

140 r = -1

141 Computing formulae for the statistics:

142

143 To compute first compute Then

144 Example Verbal IQ, MathIQ

145 Data Set #3 The following table gives data on Verbal IQ, Math IQ, Initial Reading Acheivement Score, and Final Reading Acheivement Score for 23 students who have recently completed a reading improvement program InitialFinal VerbalMathReadingReading StudentIQIQAcheivementAcheivement 186941.11.7 21041031.51.7 386921.51.9 41051002.02.0 51181151.93.5 6961021.42.4 790871.51.8 8951001.42.0 9105961.71.7 1084801.61.7 1194871.61.7 121191161.73.1 1382911.21.8 1480931.01.7 151091241.82.5 161111191.43.0 1789941.61.8 18991171.62.6 1994931.41.4 20991101.42.0 2195971.51.3 221021041.73.1 23102931.61.9

146

147 Now Hence

148 Thus Pearsons correlation coefficient is:


Download ppt "Measure of Variability (Dispersion, Spread) 1.Range 2.Inter-Quartile Range 3.Variance, standard deviation 4.Pseudo-standard deviation."

Similar presentations


Ads by Google