Chapter 4. Elements of Statistics # brief introduction to some concepts of statistics # descriptive statistics inductive statistics(statistical inference) # Classification of the field of statistics i) Sampling theory ii) Estimation theory iii) Hypothesis testing iv) Curve fitting or Regression v) Analysis of variance
4.2 Sampling Theory–the Sample Mean How many samples are required for a given degree of confidence in the result? # Terminology - population N(size of population) very large or - (random) sample n(size of sample) # one of the most important quantities is the sample mean How close the sample mean might be to the average value of the population?
Let the sample have the numerical value of x 1, x 2, … x n Then, the sample mean is given by Note that we are interested in the statistical properties of arbitrary random samples rather than any particular sample. That is, the sample mean becomes a random variable. Therefore, it is appropriate to denote the sample mean as
We want the mean value of the sample mean close to the true mean value of the population the mean value of the sample mean = the true mean value of the population The sample mean is a unbiased estimate of the true mean. But, this is not sufficient to indicate whether the sample mean is a good estimator of the true population mean.
The variance of the sample mean ? N n (population sampling.) Var mean square of - square of the mean
: statistically indep. Var (!)
Where is the true variance of the population As n =>, Variance => 0, Which means that large sample sizes lead to a better estimate * : 1)N N sampling with replacement
2)N replace Var N-> N = n 0 ( !) `Two examples : pp163 ~165
4.3 Sampling Theory – The sample Variance The population variance is needed for determining the sample size required to achieve a desired variance of the sample mean (see eq. 4-4) Definition(Sample Variance): The expected value of the sample variance can be derived easily using not the true variance, that is, a biased estimate rather than an unbiased one
Now, we redefine the sample variance for having an unbiased estimate of the population variance : Note that these hold for very large N, that is, N=. How about when the population size is not large?
# When N is not large, the expected value of S 2 is given by For obtaining an unbiased estimate, we redefine # The variance of the estimates of the variance : the variance of S 2 : the variance of : where is the 4th central moment of the population
4.4 Sampling Distributions & Confidence Intervals what is the probability that the estimates are within specified bounds? p,d,f 2, sample mean ! normalized sample mean Xi Gaussian and independent => Gaussian (0,1)
X i not Gaussian n=> Z asymptotically Gaussian by the central limit theorem (n n30 ; A rule of thumb) H.W) Solve the problems in chap.4; 4-2.1, 4-2.5, 4-3.1, 4-4.1, 4-5.1, 4-6.1
No longer Gaussian => Student s t distribution with n-1 d.of f. p
`pdf of student s t distribution Where the gamma heavier tails (n 30) n any = ! integer
( ) confidence interval ? interval estimate ( ) q- percent confidence interval (q/100 )
k q pdf. k p (q k )
) q=95% -> (q=99% !)
: q from PDF F Prob. Distribution for Student s + function (See Appendix F or Table 4-2 page 172 for v = 8 )
4.5 Hypothesis Testing The question arises; How does one decide to accept or reject a given hypothesis when the sample size and the confidence level are specified?
Two steps; i) to make some hypothesis about the population ii) to determine if the observed sample confirms or rejects this hypothesis.
Two tests; one-sided or two-sided. The average life time of the light bulb >= 1000 hours 100ohms resisters too high or too low
One-sided test ) A capacitor manufacturer claims that a mean value of breakdown voltage >= 300 V a sample of 100 capacitors –> 99% confidence level is used ) Is the manufacturer s claim valid? ) We would reject the hypothesis!
Normalized r, v, Z 99%
99.5% – accept the hypothesis less likely more severe requirement
(level of significance) (100% - ) more severe!
) sample size=9, no longer Gaussian -> Student s + distribution v=n-1=8 dof 99%, – accept the hypothesis
a small sample size t heavier tail t distribution more likely to exceed the critical value small size less reliable(less severe) than large size tests
Two-sided test ) A manufacture of Zener diodes claims that the true mean breakdown voltage = 10V ) hypothesis : the true accepts or rejects? 100 samples -> 95%
) Rejected! z is outside the interval,
) 9 samples t is inside the interval, accepted! –Less severe than a large sample test
4.6 Curve Fitting and Linear Regression ( ), x y. 1 (linear) or 2 (correlation analysis) x y.
–Scatter diagram ( ) data -n samples
-Curve fitting to find a mathematical relationship regression curve (equation) ; resulting curve
-What is the best fit? In a least squares sense –Let be the errors between the regression curve and the scatter diagram – minimum. – the type of equation to be fitted to the data n smoothing
Linear regression a, b ?
)
MATLAB in function, p = polyfit(y, x, n)
A second-order regression ( p.180, 4-3, 4-6)
4.7 Correlation between Two Sets of Data Two data sets correlated or not?
Linear correlation coefficient Pearson s r Usage ; useful in determining the sources of errors ) a point-to-point digital communication link BER(Bit Error Rate) link quality BER may fluctuate randomly due to wind ) error source wind ? wind 20 resulting BER correlation test r=0.891 yes!