The Gaussian (Normal) Distribution: More Details & Some Applications
The Gaussian (Normal) Distribution The Gaussian Distribution is one of the most used distributions in all of science. It is also called the “bell curve” or the Normal Distribution. If this is the “Normal Distribution”, logically, shouldn’t there also be an “Abnormal Distribution”?
Johann Carl Friedrich Gauss (1736–1806, Germany) Mathematician, Astronomer & Physicist. Sometimes called the “Prince of Mathematics" (?) A child prodigy in math. (Do you have trouble believing some of the following? I do!) Age 3: He informed his father of a mistake in a payroll calculation & gave the correct answer!! Age 7: His teacher gave the problem of summing all integers to his class to keep them busy. Gauss quickly wrote the correct answer 5050 on his slate!! Whether or not you believe all of this, it is 100% true that he Made a HUGE number of contributions to Mathematics, Physics, & Astronomy!!
Johann Carl Friedrich Gauss A Genius! He made a HUGE number of contributions to Mathematics, Physics, & Astronomy 1. Proved The Fundamental Theorem of Algebra, that every polynomial has a root of the form a+bi. 2. Proved The fundamental Theorem of Arithmetic, that every natural number can be represented as a product of primes in only one way. 3. Proved that every number is the sum of at most 3 triangular numbers. 4. Developed the method of least squares fitting & many other methods in statistics & probability. 5. Proved many theorems of integral calculus, including the divergence theorem (when applied to the E field, it is what is called Gauss’s Law). 6. Proved many theorems of number theory. 7. Made many contributions to the orbital mechanics of the solar system. 8. Made many contributions to Non-Euclidean geometry 9. One of the first to rigorously study the Earth’s magnetic field
x f ( x ral itrbuion: =0, = 1 Characteristics of a Normal or Gaussian Distribution a It is Symmetric It’s Mean, Median, & Mode are Equal
A 2-Dimensional Gaussian
Gaussian or Normal Distribution It is a symmetrical, bell-shaped curve. It has a point of inflection at a position 1 standard deviation from mean. Formula: f (X ) x
The Normal Distribution Note the constants: = e = This is a bell shaped curve with different centers and spreads depending on and
There are only 2 variables that determine the curve, the mean & the variance . The rest are constants. For “z scores” ( = 0, = 1), the equation becomes: The negative exponent means that big |z| values give small function values in the tails.
Normal Distribution It’s a probability function, so no matter what the values of and , it must integrate to 1!
The Normal Distribution is Defined by its Mean & Standard Deviation. = 2 = Standard Deviation = l
Normal Distribution Can take on an infinite number of possible values. The probability of any one of those values occurring is essentially zero. Curve has area or probability = 1
A normal distribution with a mean of 0 and a standard deviation of 1 is called the standard normal distribution. Z Value: The distance between a selected value, designated X, and the population mean, divided by the population standard deviation, 7-6
Example 1 The monthly incomes of recent MBA graduates in a large corporation are normally distributed with a mean of $2000 and a standard deviation of $200. What is the Z value for an income of $2200? An income of $1700? For X = $2200, Z= ( )/200 = 1. For X = $1700, Z = ( )/200 = -1.5 A Z value of 1 indicates that the value of $2200 is 1 standard deviation above the mean of $2000, while a Z value of $1700 is 1.5 standard deviation below the mean of $
Probabilities Depicted by Areas Under the Curve Total area under the curve is 1 The area in red is equal to p(z > 1) The area in blue is equal to p(-1< z <0) Since the properties of the normal distribution are known, areas can be looked up on tables or calculated on a computer.
Probability of an Interval
Cumulative Probability
A table will give this probability Given any positive value for z, the corresponding probability can be looked up in standard tables. Given positive z The probability found using a table is the probability of having a standard normal variable between 0 & the given positive z.
Areas Under the Standard Normal Curve
Areas and Probabilities The Table shows cumulative normal probabilities. Some selected entries: zF(z)z z About 54 % of scores fall below z of.1. About 46 % of scores fall below a z of -.1 (1-.54 =.46). About 14% of scores fall between z of 1 and 2 ( ).
Areas Under the Normal Curve About 68 percent of the area under the normal curve is within one standard deviation of the mean. About 95 percent is within two standard deviations of the mean percent is within three standard deviations of the mean. 7-9
x f ( x ral itrbuion: =0, = 1 Areas Under the Normal Curve Between: % % % Irwin/McGraw-Hill © The McGraw-Hill Companies, Inc.,
Key Areas Under the Curve For normal distributions + 1 ~ 68% + 2 ~ 95% + 3 ~ 99.9%
“ Rule” 68% of the data 95% of the data 99.7% of the data
Rule For a Normally distributed variable: 1.> 68.26% of all possible observations lie within one standard deviation on either side of the mean (between and 2. > 95.44% of all possible observations lie within two standard deviations on either side of the mean (between and 3. > 99.74% of all possible observations lie within two standard deviations on either side of the mean (between and
Using the unit normal (z), we can find areas and probabilities for any normal distribution. Suppose X = 120, =100, =10. Then z = ( )/10 = 2. About 98 % of cases fall below a score of 120 if the distribution is normal. In the normal, most (95%) are within 2 of the mean. Nearly everybody (99%) is within 3 of the mean.
Rule
Rule in Math terms…
Example 2 The daily water usage per person in New Providence, New Jersey is normally distributed with a mean of 20 gallons and a standard deviation of 5 gallons. About 68% of the daily water usage per person in New Providence lies between what two values? That is, about 68% of the daily water usage will lie between 15 and 25 gallons. 7-11
Normal Approximation to the Binomial Using the normal distribution (a continuous distribution) as a substitute for a binomial distribution (a discrete distribution) for large values of n seems reasonable because as n increases, a binomial distribution gets closer and closer to a normal distribution. The normal probability distribution is generally deemed a good approximation to the binomial probability distribution when n and n(1- ) are both greater than
Binomial Distribution for n = 3 & n =
Flip coin N times Each outcome has an associated random variable X i (= 1, if heads, otherwise 0) Number of heads: N H is a random variable N H = x 1 + x 2 + …. + x N Central Limit Theorem
Coin flip problem. Probability function of N H –P(Head) = 0.5 (fair coin) N = 5N = 10N = 40
Central Limit Theorem The distribution of the sum of N random variables becomes increasingly Gaussian as N grows. Example: N uniform [0,1] random variables.
% % Probability / % Normal Distribution
Why are normal distributions so important? Many dependent variables are commonly assumed to be normally distributed in the population If a variable is approximately normally distributed we can make inferences about values of that variable Example: Sampling distribution of the mean So what? Remember the Binomial distribution –With a few trials we were able to calculate possible outcomes and the probabilities of those outcomes Now try it for a continuous distribution with an infinite number of possible outcomes. Yikes! The normal distribution and its properties are well known, and if our variable of interest is normally distributed, we can apply what we know about the normal distribution to our situation, and find the probabilities associated with particular outcomes.
Since we know the shape of the normal curve, we can calculate the area under the curve The percentage of that area can be used to determine the probability that a given value could be pulled from a given distribution. The area under the curve tells us about the probability- in other words we can obtain a p-value for our result (data) by treating it as a normally distributed data set.