Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 4: The Normal Distribution and Z-Scores
Quick Review of Box-and-Whisker Plots First find the median location and mdn First find the median location and mdn Find the quartile locations Find the quartile locations Medians of the upper and lower half of distribution Medians of the upper and lower half of distribution Quartile location = (mdn location + 1) / 2 Quartile location = (mdn location + 1) / 2 These are termed the “hinges” These are termed the “hinges” Note: drop fractional values of mdn location Note: drop fractional values of mdn location Hinges bracket interquartile range (IQR) Hinges bracket interquartile range (IQR) Hinges serve as top and bottom of box Hinges serve as top and bottom of box
Box-and-Whisker Plots Find the H-spread Find the H-spread Range between two quartiles Range between two quartiles Simply the IQR Simply the IQR Area inside box in plot Area inside box in plot Draw the whiskers Draw the whiskers Lines from hinges to farthest points not more than 1.5 X H-spread Lines from hinges to farthest points not more than 1.5 X H-spread Outliers Outliers Points beyond whiskers Points beyond whiskers Denoted with asterisks Denoted with asterisks
Stem-and-Leaf Plot Frequency Stem & Leaf Extremes (>=12) Stem width: Each leaf: 1 case(s)
Outlier Detection One rule of thumb is to classify points as outliers if they are beyond 3 sd’s from the mean. One rule of thumb is to classify points as outliers if they are beyond 3 sd’s from the mean. As we’ll see later in this lecture, that implies that they are very rare occurrences As we’ll see later in this lecture, that implies that they are very rare occurrences One problem One problem Presence of outlier inflates standard deviation Presence of outlier inflates standard deviation Box-and-Whisker Plot outlier detection is not influenced by this issue. Box-and-Whisker Plot outlier detection is not influenced by this issue. H-spread “trims” off influence of extreme points H-spread “trims” off influence of extreme points
Descriptives With and Without “Outlier” If point is allowed to inflate variance, it will not be considered an outlier. If it is not, it will.
Boxplots to Compare Groups Useful in providing a quick visual check on group distributions in an experiment. Useful in providing a quick visual check on group distributions in an experiment. Mean =3 in all groups Mean =3 in all groups
The Normal Distribution A specific distribution characterized by a bell-shaped form A specific distribution characterized by a bell-shaped form Much used to calculate probabilities of scores on variables Much used to calculate probabilities of scores on variables
What’s So Useful About Distributions? Distributions specify the way scores deviate around a measure of central tendency. Distributions specify the way scores deviate around a measure of central tendency. In so doing, they allow us to calculate the probabilities of specific values occurring. In so doing, they allow us to calculate the probabilities of specific values occurring.
Pie Chart An example for a nominal scale An example for a nominal scale Areas “under the curve” provide information on probabilities Areas “under the curve” provide information on probabilities Most criminals are on probation 70% (.7 prob) that a criminal would be on probation or in jail
More on Distributions & Prob Same “adding” of areas under curve holds for histograms Same “adding” of areas under curve holds for histograms If 64 of 289 cases occur within an interval of interest: If 64 of 289 cases occur within an interval of interest: 22% of cases have this “score” 22% of cases have this “score” Probability of any selected case having this score is.22 Probability of any selected case having this score is.22 Integrating area under curve provides a probability estimate Integrating area under curve provides a probability estimate
Normal Distribution For continuous variables, we simply connect “tops” of bars to form a curve. For continuous variables, we simply connect “tops” of bars to form a curve. Abscissa: Horizontal Axis Abscissa: Horizontal Axis Ordinate: Vertical Axis Ordinate: Vertical Axis Density: Height of curve at a value of X Density: Height of curve at a value of X
Normal Distribution Mathematically defines as: Mathematically defines as: Pi and e are constants (3.14, 2.72) Pi and e are constants (3.14, 2.72) When the mean and sd are calculated, the distribution can be drawn and densities at any given points determined. When the mean and sd are calculated, the distribution can be drawn and densities at any given points determined.
Normal Distribution It would be difficult to calculate probabilities/densities for each new sample. It would be difficult to calculate probabilities/densities for each new sample. Therefore, we use the standard normal distribution and transform scores on variables to fit it. Therefore, we use the standard normal distribution and transform scores on variables to fit it. A normal distribution with a mean of zero and a sd=1 [N(0,1)]. A normal distribution with a mean of zero and a sd=1 [N(0,1)].
Distribution Forms Many processes can be described by a normal distribution, but not all. Many processes can be described by a normal distribution, but not all. Number of meteor strikes, number of supreme court retirements? Number of meteor strikes, number of supreme court retirements? Here use Poisson, which is governed by the expected number of occurrences for an interval. Here use Poisson, which is governed by the expected number of occurrences for an interval.
Score Transformations In order to use the standard normal tables to determine probabilities, we transform scores. In order to use the standard normal tables to determine probabilities, we transform scores. Linear transformations of means do not change the shape of the distribution Linear transformations of means do not change the shape of the distribution If we have a dist with a mean of 50, we need to transform scores so that 50=0 If we have a dist with a mean of 50, we need to transform scores so that 50=0 Take deviations: (X-50) for new point values Take deviations: (X-50) for new point values Solves problem of getting mean to zero, but what about standard deviation? Solves problem of getting mean to zero, but what about standard deviation?
Score Transformations The Standard Normal has a sd = 1 The Standard Normal has a sd = 1 If we divide all values of a variable by a constant, we divide the standard deviation by that constant If we divide all values of a variable by a constant, we divide the standard deviation by that constant To get a sd=1, we simply divide the mean transformed (i.e., deviation scores) by the sd of the distribution. To get a sd=1, we simply divide the mean transformed (i.e., deviation scores) by the sd of the distribution. If the sd=5, dividing all scores by 5 produces an sd=1 If the sd=5, dividing all scores by 5 produces an sd=1
Z-scores and the Standard Normal Distribution This transformation of raw scores produces z scores This transformation of raw scores produces z scores Z scores are interpreted as the number of standard deviation units above or below the mean Z scores are interpreted as the number of standard deviation units above or below the mean Raw score of 7 in a distribution with mean = 10 and sd=2 produces: Raw score of 7 in a distribution with mean = 10 and sd=2 produces:
Z Score Transformation A linear transformation A linear transformation addition, subtraction, multiplication, and/or division by constants addition, subtraction, multiplication, and/or division by constants Does not change form of the distribution Does not change form of the distribution Z-scoring or “standardizing” a distribution does not make the distribution a normal one Z-scoring or “standardizing” a distribution does not make the distribution a normal one Shape will be the same, but mean = 0 and sd = 1 Shape will be the same, but mean = 0 and sd = 1
Z Score Benefits Allows us to compare scores collected on different metrics Allows us to compare scores collected on different metrics Each score can be interpreted based on its deviation from the mean with respect to the magnitude of average deviations Each score can be interpreted based on its deviation from the mean with respect to the magnitude of average deviations Allows us to easily obtain probabilities for specific scores based on a “known” normal distribution density function Allows us to easily obtain probabilities for specific scores based on a “known” normal distribution density function
Z Score to Probabilities If we know a z score, we can calculate probabilities attached to it. If we know a z score, we can calculate probabilities attached to it. Area under the curve is 1.00 Area under the curve is 1.00 Tabled values of standard normal distribution reflect area from the mean to that value Tabled values of standard normal distribution reflect area from the mean to that value Note that if distribution shape differs substantially from normal, probability estimates will be incorrect Note that if distribution shape differs substantially from normal, probability estimates will be incorrect
Z Score to Probabilities A z=1.00 in the table corresponds to an area of 0.34 A z=1.00 in the table corresponds to an area of 0.34 A score between z=0 and z=1 has a probability of occurring of 0.34 A score between z=0 and z=1 has a probability of occurring of 0.34 The probability of a score at or below z=1 is: The probability of a score at or below z=1 is: = =.84 The probability of a score higher than z=1 is: The probability of a score higher than z=1 is: =.16; or = =.16; or =.16 The probability of a score -1<z<1? The probability of a score -1<z<1? = =.68 Distribution is symmetric Distribution is symmetric
Curve Area Applet
Setting Probable Limits for Observations Many times, it is useful to predict an interval in which a randomly sampled data point will fall. Many times, it is useful to predict an interval in which a randomly sampled data point will fall. A randomly sampled individual’s score should fall between X and X’ with 95% certainty. A randomly sampled individual’s score should fall between X and X’ with 95% certainty. This implies we’re looking for the area under the curve that covers 95% (cut off 2.5% in each tail) This implies we’re looking for the area under the curve that covers 95% (cut off 2.5% in each tail)
Setting Probable Limits for Observations From the table, we can see that a z=1.96 leaves 2.5% remaining in tail. From the table, we can see that a z=1.96 leaves 2.5% remaining in tail.
Setting Probable Limits for Observations From the table, we can see that a z=1.96 leaves 2.5% remaining in tail. From the table, we can see that a z=1.96 leaves 2.5% remaining in tail. We simply need to calculate what raw score corresponds to a z=1.96. We simply need to calculate what raw score corresponds to a z=1.96. Note that here we must know population mean and sd. Note that here we must know population mean and sd.
Setting Probable Limits for Observations If mean is 50 and sd=10 If mean is 50 and sd=10
Converting Z’s to Other Standard Scores Standard scores are ones with predetermined means and sd’s Standard scores are ones with predetermined means and sd’s New score = New SD (z) + New Mean New score = New SD (z) + New Mean For IQ [N(100,15): For IQ [N(100,15): IQ score for z of 1 = 15 (1) = 115 IQ score for z of 1 = 15 (1) = 115