Presentation is loading. Please wait.

Presentation is loading. Please wait.

8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.

Similar presentations


Presentation on theme: "8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe."— Presentation transcript:

1 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe the standard deviation. The empirical rule states that approximately 68% of the cases in a normal distribution fall within one standard deviation of the mean, 95% of the cases in a normal distribution fall within two standard deviations of the mean, and 99.7% of all cases fall within three standard deviations of the mean. While distributions of real quantitative variables are usually not normal, the empirical rule has been demonstrated to be applicable if the distribution is “nearly normal.” The determination that a variable is “nearly normal” requires us to propose a set of criteria for determining the boundary between “nearly normal” and “not nearly normal.”

2 8/9/2015Slide 2 Like all of the criteria that we use in statistics, we will propose a criteria, recognize that is really an approximation rather than a precise estimate, and hope that common sense will prevail in applying the criteria. We have previously identified the criteria we will use for assessing the “nearly normal” condition: skewness, kurtosis, and extreme outliers. We will use our previous requirements for skewness and kurtosis (both between -1.0 and +1.0), but we will define outliers as cases that are more than 3 standard deviations from the mean (either above or below). The last criteria is derived from the empirical rule: if 99.7% of the cases in a normal distribution fall within three standard deviations of the mean, then those that fall outside three standard deviations must be relatively uncommon.

3 8/9/2015Slide 3 The requirement to compare the scores in a distribution to the mean plus or minus three standard deviations could lead to a lot of tedious arithmetic. Fortunately, there is a relatively easy substitute – converting the values in the distribution to “standard scores.” Standard scores convert the values into any distribution into the distance between the score and the mean of the distribution in standard deviation units. Standardizing variables gives them a common unit of measure that makes it easy to compare scores across quantitative variables. For example, if I converted a student’s GRE score (e.g. 1100) and GPA (3.78) to standard scores, I would know which was further away from the mean for all students, and thus a higher measure of academic potential.

4 8/9/2015Slide 4 SPSS will automatically convert any distribution to standard scores (also referred to as z-scores) and we can use the same formula over and over to identify outliers. Many procedures use standardized scores to present findings or diagnostics, e.g. we will analyze standardized residuals in regression analysis. If the original variable does not satisfy the “nearly normal condition,” we will re-express the data values as logarithms and squares to see if we can induce normality. If the transformation is successful at meeting the criteria for a nearly normal distribution, we will calculate the percentage of cases falling within 1 and 2 standard deviations of the mean and compare our findings to the percentage prescribed by the empirical rule.

5 8/9/2015Slide 5 In these problems, we will base our assessment of normality on more expanded criteria than we have used previously. Since we are concerned with determining probabilities or percentages based on the normal distribution, we are concerned with kurtosis as well as skewness. The height of the distribution as measured by kurtosis has an impact on the standard deviation, which in turn has an impact on the percentage of cases within one standard deviation of the mean and within two standard deviations of the mean. In the last assignment, we used a boxplot strategy to identify outliers. In this assignment, we will define outliers as cases falling outside three standard deviations of the mean.

6 SOLVING HOMEWORK PROBLEMS The Empirical Rule states that about 68% of the values will fall within 1 standard deviation of the mean and 95% of the values will fall within 2 standard deviations of the mean, provided the variable satisfies the nearly normal condition that the distribution is unimodal and symmetric. There are numerous statistical tests and graphic methods for evaluating the normality of a distribution. In these problems, we will use a simple rule of thumb that states that the distribution of the variable is reasonably normal if both skewness and kurtosis of the distribution are between -1.0 and + 1.0 and there are no outliers less than or equal to three standard deviations below the mean or greater than or equal to three standard deviations above the mean. Slide 6

7 If the distribution satisfies the nearly normal condition, we will test whether or not the percentages specified by the empirical rule hold for the variable. We will consider the rule to be satisfied if the actual percentage of values falls within 2% of the proportion indicated by the empirical rule. If the distribution does not satisfy the nearly normal condition, we will examine the impact on the normality assumption when the distribution is re-expressed by computing the logarithm of the values if the variable is skewed to the right. If the variable is negatively skewed, we will square the values and examine the impact on the normality assumption. If the transformation is successful at meeting the criteria for a nearly normal distribution, we will calculate the percentage of cases falling within 1 and 2 standard deviations of the mean and compare the actual percentage to the percentage prescribed by the empirical rule. Slide 7

8 Slide 8 The introductory statement in the question indicates: The data set to use (2001WorldFactBook) The task to accomplish (verifying the empirical rule for a normally distributed variable) The variable to use in the analysis: population [pop]

9 Slide 9 These problem also contain a second paragraph of instructions that provide the formulas to use if the analysis requires us to re-express or transform the variable to achieve normality, and the formula to restore the transformed values back to the original scale.

10 Slide 10 The first statement concerns the number of valid cases. To answer this question, we produce the descriptive statistics using the SPSS Descriptives procedure. The Descriptive procedure creates standard scores for the variable, which will facilitate our check of the empirical rule.

11 Slide 11 To compute the descriptive statistics and standard scores, select the Descriptive Statistics > Descriptives command from the Analyze menu.

12 Slide 12 Move the variable for the analysis pop to the Variable(s) list box. Click on the Options button to select optional statistics.

13 Slide 13 The check boxes for Mean and Std. Deviation are already marked by default. Click on Continue button to close the dialog box. Mark the Kurtosis and Skewness check boxes. This will provide the statistics for assessing normality.

14 Slide 14 Click on the OK button to produce the output. Mark the check box Save standardized values as variables.

15 Slide 15 If we scroll the Data View all the way to the right, we see that SPSS has create the standard scores. To name it, SPSS prepends the letter “Z” to the variable name.

16 Slide 16 In the output table for Descriptive Statistics, the number of valid cases for population is 218. If we had more than one variable in the table, the Valid N (listwise) row would tell us the number of cases that are not missing data for any of the variables in the table. SPSS does not tell us the number of cases that are missing data in this table. To get the number missing, we would have to compare the number of cases in the data set to the N for population.

17 Slide 17 The 'Descriptive Statistics' table in the SPSS output showed the number of cases for the variable "population" [pop] to be 218. Click on the check box to mark the statement as correct.

18 Slide 18 The next statement requires us to check the evidence for meeting the “nearly normal condition”: Skewness between -1.0 and +1.0 Kurtosis between -1.0 and +1.0 No outliers with standard scores less than or equal to -3.0 or greater than or equal to +3.0

19 Slide 19 "Population" [pop] did not satisfy the criteria for a normal distribution. Both the skewness (11.71) and kurtosis (155.82) fell outside the range from -1.0 to +1.0.

20 Slide 20 Though we know that we do not satisfy the “nearly normal condition,“ we will still do the check for outliers. Click the right mouse button on the column header for Zpop, and select Sort Ascending from the pop-up menu. This will show any negative outliers at the top of the column.

21 Slide 21 At the top of the column, we do not see any negative values less than or equal to -3.0.

22 Slide 22 Click the right mouse button again on the column header for Zpop, and select Sort Descending from the pop-up menu. This will show any positive outliers at the top of the column.

23 Slide 23 At the top of the column, we see one positive value (13.52) greater than or equal to +3.0.

24 Slide 24 If we scroll back to the left, we see that the outlier for population was China, with a population of 1,273,111,290.

25 Slide 25 "Population" [pop] did not satisfy the criteria for a normal distribution. Both the skewness (11.71) and kurtosis (155.82) fell outside the range from -1.0 to +1.0. There was one outlier that had a standard score less than or equal to -3.0 or greater than or equal to +3.0: - China with a value of 1,273,111,290 (z=13.52) We do not mark the check for the nearly normal condition.

26 Slide 26 The next pair of statements asks us about two possibilities for re-expressing the values to see if the transformed distribution satisfies the nearly normal condition. If the skewness of the distribution of the variable is positive, we test the log transformation. If the skewing is negative, we test the square transformation. In this problem, the skewness was 11.71, so we use the logarithmic transformation.

27 Slide 27 The formula for transforming pop to LG_pop is provided in the second paragraph of instructions: LG10(pop).

28 Slide 28 To compute the transformed variable, select the Compute command from the Transform menu.

29 Slide 29 In the Compute Variable dialog box, we type the name for the new variable, LG_pop, in the Target Variable text box. In the Numeric Expression text box, type the formula as shown to compute base 10 logarithms of the values of pop. My convention for naming transformed variables is to add the variable name to the letters LG_ for a log transformation and SQ_ for a square transformation. This helps me keep the relationship between the variables clear. Click on the OK button to compute the transformed variable.

30 Slide 30 Scroll the data editor window to the right to see the transformed variable, LG_pop.

31 Slide 31 To calculate the descriptive statistics so we can check the normality conditions for the transformed variable, click on the Dialog Recall tool button, and select Descriptives.

32 Slide 32 Since we want the same statistics computed for the variable pop, we only need to replace the variable pop with LG_pop. Click on the OK button to produce the output. Be sure the check box for saving standardized values remains checks so that Descriptives will compute standard scores for LG_pop.

33 Slide 33 The log transformation of "population" [LG_pop] satisfied the criteria for a normal distribution. The skewness of the distribution (-0.50) was between -1.0 and +1.0 and the kurtosis of the distribution (-0.41) was between -1.0 and +1.0. Next, we will check for outliers that had a standard score less than or equal to -3.0 or greater than or equal to +3.0.

34 Slide 34 The Descriptives procedure add ZLG_pop to the data set. When we sort ZLG_pop in ascending order, we see that there are no outliers with standard scores less than or equal to -3.0.

35 Slide 35 When we sort ZLG_pop in descending order, we see that there are no outliers with standard scores greater than or equal to +3.0.

36 Slide 36 The log transformation of "population" [LG_pop] satisfied the criteria for a normal distribution. The skewness of the distribution (-0.50) was between -1.0 and +1.0 and the kurtosis of the distribution (-0.41) was between -1.0 and +1.0. There were no outliers that had a standard score less than or equal to -3.0 or greater than or equal to +3.0. The log distribution satisfies the nearly normal condition so we mark the check box.

37 Slide 37 The final pair of question in the problem focuses on verifying whether or not percentages based on the distribution of the log transformed variable agree with the percentages specified in the empirical rule.

38 Slide 38 We will create a new variable that will have a value of 1 if the standard score is within 1 standard deviation of the mean, and 0 if it has a value outside this range. To compute the new variable, select the Compute command from the Transform menu.

39 Slide 39 We will name the new variable within1sd, selecting a name which describes its contents. Type the formula as shown in the Numeric Expression text box. The formula will assign within1sd a value of 1 if the standard score the log transformation of population is greater than or equal to -1.0 and less than or equal to +1.0. If the value is not between -1.0 and +1.0, within1sd will be assigned a 0.

40 Slide 40 Scroll down in data view to see values of 0 and 1 for within1sd. When the standard scores for LG_pop are larger than 1.0, within1sd is assigned the value of 0. When the standard scores for LG_pop are less than or equal to 1.0, within1sd is assigned the value of 1.

41 Slide 41 To find the percentage of cases that have a standard score between -1.0 and +1.0 (within1sd = 1), we will run a frequency distribution on within1sd. To create the frequency distribution, select Descriptive Statistics > Frequencies from the Analyze menu.

42 Slide 42 First, move the variable within1sd to the Variable(s) list box. Second, click on the OK button to produce the output.

43 Slide 43 66.5% of the values fall within one standard deviation of the mean. If we use 2% as the margin of error, 66.5% is within 2% of the 68% prescribed by the empirical rule.

44 Slide 44 We will create a second new variable that will have a value of 1 if the standard score is within 2 standard deviations of the mean, and 0 if it has a value outside this range. To compute the new variable, select the Compute Variable command from the Recall Dialog pop-up menu.

45 Slide 45 Replace the variable name “within1sd” with the name “within2sd”. Replace the criteria of -1.0 with -2.0 and replace +1.0 with +2.0.

46 Slide 46 Scroll down in data view to see values of 0 and 1 for within2sd. When the standard scores for LG_pop are larger than 2.0, within2sd is assigned the value of 0. When the standard scores for LG_pop are less than or equal to 2.0, within1sd is assigned the value of 1.

47 Slide 47 We will request a second frequency distribution to tally within2sd. To request the frequency distribution, select the Frequencies command from the Recall Dialog pop-up menu.

48 Slide 48 First, remove the variable within1sd from the Variable(s) list box and move the variable within2sd into the list box. Second, click on the OK button to produce the output.

49 Slide 49 95.9% of the values fall within two standard deviations of the mean. If we use 2% as the margin of error, 95.9% is within 2% of the 95% prescribed by the empirical rule.

50 Slide 50 The actual percentage of the values of ZLG_pop between -1.0 and +1.0 was 66.5%, which is within 2% of 68%. The actual percentage of the values of ZLG_pop between -2.0 and +2.0 was 95.9%, which is within 2% of 95%. The statement that "the actual percentage of cases within one standard deviation of the mean was close to the percentage predicted by the empirical rule" is correct. The statement that "the actual percentage of cases within two standard deviations of the mean was close to the percentage predicted by the empirical rule" is correct. We mark both of the check boxes.

51 Slide 51 In solving this problem, we created three variables in the data set: Zpop, LG_pop, and ZLG_pop. Since subsequent problems will create other, additional variables, I suggest you delete the created variables at the conclusion of each problem by selecting the columns in the data set and using the Clear command in the Edit menu.


Download ppt "8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe."

Similar presentations


Ads by Google