STATISTICS AND PROBABILITY Raoul LePage Professor STATISTICS AND PROBABILITY www.stt.msu.edu/~lepage click on STT315_F06 Week 9-25-06 and some preparation for exam 2.
solutions given in text 3-33, 3-41, 3-42 (except b, c, h, m, n), suggested exercises solutions given in text 3-33, 3-41, 3-42 (except b, c, h, m, n), 3-43, 3-49, 3-57 (except c, d), 3-59, 3-61, 3-63, 3-65. textbook exercises are not comprehensive Week 9-25-06 and some preparation for exam 2.
HAVING BROAD APPLICATION PROBABILITY MODELS HAVING BROAD APPLICATION NORMAL DISTRIBUTION BERNOULLI TRIALS BINOMIAL DISTRIBUTION POISSON DISTRIBUTION
NORMAL DISTRIBUTION: WHERE ARE THE MEAN AND STANDARD DEVIATION IN THIS PICTURE? note the point of inflexion note the balance point
IQ DISTRIBUTION: ~NORMAL, MEAN 100 STANDARD DEVIATION 15 point of inflexion SD=15 MEAN = 100
DISTRIBUTION OF THE NUMBER OF HEADS IN 100 COIN TOSSES: APPROXIMATELY NORMAL, MEAN 50, STD DEVIATION 5 5 50
DISTRIBUTION OF THE NUMBER OF ACCIDENTS IN ONE MONTH IF WE AVERAGE 39.7 PER MONTH: APPROXIMATELY NORMAL, MEAN 39.7, STD DEVIATION 6.3 6.3 39.7
~68% NORMAL DISTRIBUTIONS ARE ALIKE IN SD UNITS FROM THE MEAN ~ 68% WITHIN 1 SD OF MEAN ~ 95% WITHIN 2 SD OF MEAN Illustrated for the Standard Normal Mean=0, SD=1 ~68%
~95% NORMAL DISTRIBUTIONS ARE ALIKE IN SD UNITS FROM THE MEAN ~ 68% WITHIN 1 SD OF MEAN ~ 95% WITHIN 2 SD OF MEAN Illustrated for the Standard normal Mean=0, SD=1 ~95%
IQ DISTRIBUTION: ~NORMAL, MEAN 100 STANDARD DEVIATION 15 15 ~68/2 =34% ~95/2=47.5% 85 130 100
IQ DISTRIBUTION: ~NORMAL, MEAN 100 STANDARD DEVIATION 15 15 ~68/2 =34% ~95/2=47.5% 85 130 100
STANDARD SCORES CONVERT TO 0 MEAN; SD 1 IQ Z 1 15 Standard Normal 100
STANDARD SCORES CONVERT TO 0 MEAN; SD 1
Z - TABLE CUT AND PASTE P(Z > 0) = P(Z < 0 ) = 0.5 = 0.5 - 0.4961 = 0.0039 P(Z < 1.92) = 0.5 + P(0 < Z < 1.92) = 0.5 + 0.4726 = 0.9726
BERNOULLI DISTRIBUTION x p(x) p (1 denotes “success”) 0 q (0 denotes “failure”) __ 1 0 < p < 1 q = 1 - p
Notation: BERNOULLI RANDOM VARIABLE X P(success) = P(X = 1) = p P(failure) = P(X = 0) = q e.g. X = “sample voter is Democrat” Population has 48% Dem. p = 0.48, q = 0.52 P(X = 1) = 0.48
INDEPENDENT BERNOULLI-p "S" denotes success "F" denotes failure P(S1 S2 F3 F4 F5 F6 S7) = p3 q4 just write P(SSFFFFS) = p3 q4 “the answer only depends upon how many of each, not their order.” e.g. 48% Dem, 5 sampled, with-repl: P(Dem Rep Dem Dem Rep) = 0.483 0.522
BINOMIAL DISTRIBUTION FOR THE TOTAL NUMBER OF SUCCESSES IN INDEPENDENT p-BERNOULLI TRIALS. e.g. P(exactly 2 Dems out of sample of 4) = P(DDRR) + P(DRDR) + P(DDRR) + P(RDDR) + P(RDRD) + P(RRDD) = 6 .482 0.522 ~ 0.374. There are 6 ways to arrange 2D 2R.
BINOMIAL DISTRIBUTION FOR THE TOTAL NUMBER OF SUCCESSES IN INDEPENDENT p-BERNOULLI TRIALS. e.g. P(exactly 3 Dems out of sample of 5) = P(DDDRR) + P(DDRDR) + P(DDRRD) + P(DRDDR) + P(DRDRD) + P(DRRDD) + P(RDDDR) +P(RDDRD) + P(RDRDD) + P(RRDDD) = 10 .483 0.522 ~ 0.299. There are 10 ways to arrange 3D 2R. Same as the number of ways to select 3 from 5.
COUNTING ARRANGEMENTS 5! ways to arrange 5 things in a line Do it thus (1:1 with arrangements): select 3 of the 5 to go first in line, arrange those 3 at the head of line then arrange the remaining 2 after. 5! = (ways to select 3 from 5) 3! 2! So num ways must be 5! /( 3! 2!) = 10.
BINOMIAL FORMULA Let random variable X denote the number of “S” in n independent Bernoulli p-Trials. By definition, X has a Binomial Distribution and for each of x = 0, 1, 2, …, n P(X = x) = (n!/(x! (n-x)!) ) px qn-x e.g. P(44 Dems in sample of 100 voters) = (100!/(44! 56!)) 0.4844 0.52100-44 = 0.05812.
Caveats: Binomial Binomial Coefficient n!/(x! (n-x)!) is the count of how many arrangements there are of a string of x letters “S” and n-x letters “F.” . px qn-x is the shared probability of each string of x letters “S” and n-x letters “F.” (define 0! = 1, p0 = q0 = 1 and the formula goes through for every one of x = 0 through n) is short for the arrangement count = Binomial Coefficient
Normal Approx of Binomial Poisson and its normal Approx Aspects of random sampling Week 9-25-06
Normal Approx of Binomial n = 10, p = 0.4 mean = n p = 4 sd = root(n p q) ~ 1.55 Week 9-25-06
Normal Approx of Binomial n = 30, p = 0.4 mean = n p = 12 sd = root(n p q) ~ 2.683 Week 9-25-06
Normal Approx of Binomial n = 100, p = 0.4 mean = n p = 40 sd = root(n p q) ~ 4.89898 Week 9-25-06
p(x) = e-mean meanx / x! for x = 0, 1, 2, ..ad infinitum Poisson Distribution Governing Counts of Rare Events p(x) = e-mean meanx / x! for x = 0, 1, 2, ..ad infinitum Week 9-25-06
e..g. X = number of times ace of spades turns up in 104 tries Poisson e..g. X = number of times ace of spades turns up in 104 tries X~ Poisson with mean 2 p(x) = e-mean meanx / x! e.g. p(3) = e-2 23 / 3! ~ 0.18 Week 9-25-06
Poisson e.g. X = number of raisins in MY cookie. Batter has 400 raisins and makes 144 cookies. E X = 400/144 ~ 2.78 per cookie p(x) = e-mean meanx / x! e.g. p(2) = e-2.78 2.782 / 2! ~ 0.24 (around 24% of cookies have 2 raisins) Week 9-25-06
note: Poisson sd = root(mean) THE FIRST BEST THING ABOUT THE POISSON IS THAT THE MEAN ALONE TELLS US THE ENTIRE DISTRIBUTION! note: Poisson sd = root(mean) Week 9-25-06
E X = 400/144 ~ 2.78 raisins per cookie sd = root(mean) = 1.67 (for Poisson) Week 9-25-06
Poisson THE SECOND BEST THING ABOUT THE POISSON IS THAT FOR A MEAN AS SMALL AS 3 THE NORMAL APPROXIMATION WORKS WELL. 1.67 = sd = root(mean) Special to Poisson Week 9-25-06 mean 2.78
WE AVERAGE 127.8 ACCIDENTS PER MO. E X = 127.8 accidents If Poisson then sd = root(127.8) = 11.3049 and the approx dist is: sd = root(mean) = 11.3 Special to Poisson ~ Week 9-25-06 mean 127.8 accidents
Aspects of Random Sampling Week 9-25-06
THE GREAT TRICK OF STATISTICS The overwhelming majority of samples of n from a population of N can stand-in for the population. ATT Sysco Pepsico GM Dow population of N = 5 sample of n = 2
THE GREAT TRICK OF STATISTICS The overwhelming majority of samples of n from a population of N can stand-in for the population. ATT Sysco Pepsico GM Dow ATT Pepsico population of N = 5 sample of n = 2
GREAT TRICK : SOME CAVEATS Sample size n must be “large.” For only a few characteristics at a time, such as profit, sales, dividend. SPECTACULAR FAILURES MAY OCCUR! ATT 12 Sysco 21 Pepsi 42 GM 8 Dow 9 population of N = 5 sample of n = 2
With-replacement HOW ARE WE SAMPLING ? ATT 12 Sysco 21 Pepsi 42 GM 8 Dow 9 Pepsi 42 population of N = 5 sample of n = 2
With-replacement vs without replacement. HOW ARE WE SAMPLING ? With-replacement vs without replacement. ATT 12 Sysco 21 Pepsi 42 GM 8 Dow 9 population of N = 5 sample of n = 2
GREAT TRICK : SOME CAVEATS This sample is obviously “not representative.” ATT 12 Sysco 21 Pepsi 42 GM 8 Dow 9 Sysco 21 Pepsi 42 population of N = 5 sample of n = 2
DOES IT MAKE A DIFFERENCE ? Rule of thumb: With and without replacement are about the same if root [(N-n) /(N-1)] ~ 1. with vs without SAME ? population of N sample of n
CORRECTION TO PAGE 25 OF TEXT They would have you believe the population is {8, 9, 12, 42} and the sample is {42}. A SET is a collection of distinct entities. ATT 12 IBM 42 AAA 9 Pepsi 42 GM 8 Dow 9 WE SAMPLE COMPANIES NUMBERS COME WITH THEM Pepsi 42
THE ROLE OF RANDOM SAMPLING IF THE OVERWHELMING MAJORITY OF SAMPLES ARE “GOOD SAMPLES” THEN WE CAN OBTAIN A “GOOD” SAMPLE BY RANDOM SELECTION.
SELECTING A LETTER AT RANDOM HOW TO SAMPLE RANDOMLY ? SELECTING A LETTER AT RANDOM Digits are made to correspond to letters. a = 00-02 b = 03-05 …. z = 75-77 Random digits then give random letters. 1559 9068 … (Table 14, pg. 809) 15 59 90 68 etc… (split into pairs) f t * w etc… (take chosen letters) For samples without replacement just pass over any duplicates.
The Great Trick is far more powerful than we have seen The Great Trick is far more powerful than we have seen. A typical sample closely estimates such things as a population mean or the shape of a population density. But it goes beyond this to reveal how much variation there is among sample means and sample densities. A typical sample not only estimates population quantities. It estimates the sample-to-sample variations of its own estimates.
EXAMPLE : ESTIMATING A MEAN The average account balance is $421.34 for a random with-replacement sample of 50 accounts. We estimate from this sample that the average balance is $421.34 for all accounts. From this sample we also estimate and display a “margin of error” $421.34 +/- $65.22 = . s denotes "sample standard deviation"
SAMPLE STANDARD DEVIATION NOTE: Sample standard deviation s may be calculated in several equivalent ways, some sensitive to rounding errors, even for n = 2.
EXAMPLE : MARGIN OF ERROR CALCULATION The following margin of error calculation for n = 4 is only an illustration. A sample of four would not be regarded as large enough. Profits per sale = {12.2, 15.3, 16.2, 12.8}. Mean = 14.125, s = 1.92765, root(4) = 2. Margin of error = +/- 1.96 (1.92765 / 2) Report: 14.125 +/- 1.8891. A precise interpretation of margin of error will be given later in the course, including the role of 1.96. The interval 14.125 +/- 1.8891 is called a “95% confidence interval for the population mean.” We used: (12.2-14.125)2 + (15.3-14.125)2 + (16.2-14.125)2 + (12.8-14.125)2 = 11.1475.
EXAMPLE : ESTIMATING A PERCENTAGE A random with-replacement sample of 50 stores participated in a test marketing. In 39 of these 50 stores (i.e. 78%) the new package design outsold the old package design. We estimate from this sample that 78% of all stores will sell more of new vs old. We also estimate a “margin of error +/- 11.5% Figured: 1.96 root(pHAT qHAT)/root(n) =1.96 root(.78 .22)/root(50) = 0.114823 in Binomial setup
SAMPLING ONLY 600 FROM 500 MILLION ? A sample of only n = 600 from a population of N = 500 million. (FINE resolution) sample of n = 600 sample mean = 32.84 POP mean = 32.02 FINE resolution densities very close population of N = 500,000 with a sample of n = 600