Presentation is loading. Please wait.

Presentation is loading. Please wait.

Brief Review Probability and Statistics. Probability distributions Continuous distributions.

Similar presentations


Presentation on theme: "Brief Review Probability and Statistics. Probability distributions Continuous distributions."— Presentation transcript:

1 Brief Review Probability and Statistics

2 Probability distributions Continuous distributions

3 Defn (density function) Let x denote a continuous random variable then f(x) is called the density function of x 1) f(x) ≥ 0 2) 3)

4 Examples of some important Univariate distributions

5 1.The Normal distribution A common probability density curve is the “Normal” density curve - symmetric and bell shaped Comment: If  = 0 and  = 1 the distribution is called the standard normal distribution Normal distribution with  = 50 and  =15 Normal distribution with  = 70 and  =20

6

7 2.The Chi-squared distribution with degrees of freedom

8

9 Comment: If z 1, z 2,..., z are independent random variables each having a standard normal distribution then U = has a chi-squared distribution with degrees of freedom.

10 3. The F distribution with  degrees of freedom in the numerator and  degrees of freedom in the denominator if x  0 where K =

11

12 Comment: If U 1 and U 2 are independent random variables each having Chi-squared distribution with 1 and 2 degrees of freedom respectively then F = has a F distribution with  degrees of freedom in the numerator and  degrees of freedom in the denominator

13 4.The t distribution with degrees of freedom where K =

14

15 Comment: If z and U are independent random variables, and z has a standard Normal distribution while U has a Chi- squared distribution with degrees of freedom then t = has a t distribution with degrees of freedom.

16 Multivariate Distributions

17 Defn (Joint density function) Let x = (x 1,x 2,x 3,..., x n ) denote a vector of continuous random variables then f(x) = f(x 1,x 2,x 3,..., x n ) is called the joint density function of x = (x 1,x 2,x 3,..., x n ) if 1) f(x) ≥ 0 2) 3)

18 Note:

19 Defn (Marginal density function) The marginal density of x 1 = (x 1,x 2,x 3,..., x p ) (p < n) is defined by: f 1 (x 1 ) = = where x 2 = (x p+1,x p+2,x p+3,..., x n ) The marginal density of x 2 = (x p+1,x p+2,x p+3,..., x n ) is defined by: f 2 (x 2 ) = = where x 1 = ( x 1,x 2,x 3,..., x p )

20 Defn (Conditional density function) The conditional density of x 1 given x 2 (defined in previous slide) (p < n) is defined by: f 1|2 (x 1 |x 2 ) = conditional density of x 2 given x 1 is defined by: f 2|1 (x 2 |x 1 ) =

21 Marginal densities describe how the subvector x i behaves ignoring x j Conditional densities describe how the subvector x i behaves when the subvector x j is held fixed

22 Defn (Independence) The two sub-vectors (x 1 and x 2 ) are called independent if: f(x) = f(x 1, x 2 ) = f 1 (x 1 )f 2 (x 2 ) = product of marginals or the conditional density of x i given x j : f i|j (x i |x j ) = f i (x i ) = marginal density of x i

23 Example (p-variate Normal) The random vector x (p × 1) is said to have the p-variate Normal distribution with mean vector  (p × 1) and covariance matrix  (p × p) (a Positive definite matrix) (written x ~ N p ( ,  )) if:

24 Example (bivariate Normal) The random vector is said to have the bivariate Normal distribution with mean vector and covariance matrix

25

26

27 The Univariate Normal distribution (mean , Variance  2 )

28

29 Theorem (Transformations) Let x = (x 1,x 2,x 3,..., x n ) denote a vector of continuous random variables with joint density function f(x 1,x 2,x 3,..., x n ) = f(x). Let y 1 =  1 (x 1,x 2,x 3,..., x n ) y 2 =  2 (x 1,x 2,x 3,..., x n )... y n =  n (x 1,x 2,x 3,..., x n ) define a 1-1 transformation of x into y.

30 Then the joint density of y is g(y) given by: g(y) = f(x)|J| where = the Jacobian of the transformation

31 Corollary (Linear Transformations) Let x = (x 1,x 2,x 3,..., x n ) denote a vector of continuous random variables with joint density function f(x 1,x 2,x 3,..., x n ) = f(x). Let y 1 = a 11 x 1 + a 12 x 2 + a 13 x 3,... + a 1n x n y 2 = a 21 x 1 + a 22 x 2 + a 23 x 3,... + a 2n x n... y n = a n1 x 1 + a n2 x 2 + a n3 x 3,... + a nn x n define a 1-1 transformation of x into y.

32 Then the joint density of y is g(y) given by:

33 Corollary (Linear Transformations for Normal Random variables) Let x = (x 1,x 2,x 3,..., x n ) denote a vector of continuous random variables having an n-variate Normal distribution with mean vector  and covariance matrix . i.e. x ~ N n ( ,  ) Let y 1 = a 11 x 1 + a 12 x 2 + a 13 x 3,... + a 1n x n y 2 = a 21 x 1 + a 22 x 2 + a 23 x 3,... + a 2n x n... y n = a n1 x 1 + a n2 x 2 + a n3 x 3,... + a nn x n define a 1-1 transformation of x into y. Then y = (y 1,y 2,y 3,..., y n ) ~ N n (A ,A  A')

34 Defn (Expectation) Let x = (x 1,x 2,x 3,..., x n ) denote a vector of continuous random variables with joint density function f(x) = f(x 1,x 2,x 3,..., x n ). Let U = h(x) = h(x 1,x 2,x 3,..., x n ) Then

35 Defn (Conditional Expectation) Let x = (x 1,x 2,x 3,..., x n ) = (x 1, x 2 ) denote a vector of continuous random variables with joint density function f(x) = f(x 1,x 2,x 3,..., x n ) = f(x 1, x 2 ). Let U = h(x 1 ) = h(x 1,x 2,x 3,..., x p ) Then the conditional expectation of U given x 2

36 Defn (Variance) Let x = (x 1,x 2,x 3,..., x n ) denote a vector of continuous random variables with joint density function f(x) = f(x 1,x 2,x 3,..., x n ). Let U = h(x) = h(x 1,x 2,x 3,..., x n ) Then

37 Defn (Conditional Variance) Let x = (x 1,x 2,x 3,..., x n ) = (x 1, x 2 ) denote a vector of continuous random variables with joint density function f(x) = f(x 1,x 2,x 3,..., x n ) = f(x 1, x 2 ). Let U = h(x 1 ) = h(x 1,x 2,x 3,..., x p ) Then the conditional variance of U given x 2

38 Defn (Covariance, Correlation) Let x = (x 1,x 2,x 3,..., x n ) denote a vector of continuous random variables with joint density function f(x) = f(x 1,x 2,x 3,..., x n ). Let U = h(x) = h(x 1,x 2,x 3,..., x n ) and V = g(x) =g(x 1,x 2,x 3,..., x n ) Then the covariance of U and V.

39 Properties Expectation Variance Covariance Correlation

40 1. E[a 1 x 1 + a 2 x 2 + a 3 x 3 +... + a n x n ] = a 1 E[x 1 ] + a 2 E[x 2 ] + a 3 E[x 3 ] +... + a n E[x n ] or E[a'x] = a'E[x]

41 2.E[UV] = E[h(x 1 )g(x 2 )] = E[U]E[V] = E[h(x 1 )]E[g(x 2 )] if x 1 and x 2 are independent

42 3. Var[a 1 x 1 + a 2 x 2 + a 3 x 3 +... + a n x n ] or Var[a'x] = a′  a

43 4. Cov[a 1 x 1 + a 2 x 2 +... + a n x n, b 1 x 1 + b 2 x 2 +... + b n x n ] or Cov[a'x, b'x] = a′  b

44 5. 6.

45 Statistical Inference Making decisions from data

46 There are two main areas of Statistical Inference Estimation – deciding on the value of a parameter –Point estimation –Confidence Interval, Confidence region Estimation Hypothesis testing –Deciding if a statement (hypotheisis) about a parameter is True or False

47 The general statistical model Most data fits this situation

48 Defn (The Classical Statistical Model) The data vector x = (x 1,x 2,x 3,..., x n ) The model Let f(x|  ) = f(x 1,x 2,..., x n |  1,  2,...,  p ) denote the joint density of the data vector x = (x 1,x 2,x 3,..., x n ) of observations where the unknown parameter vector    (a subset of p-dimensional space).

49 An Example The data vector x = (x 1,x 2,x 3,..., x n ) a sample from the normal distribution with mean  and variance  2 The model Then f(x| ,  2 ) = f(x 1,x 2,..., x n | ,  2 ), the joint density of x = (x 1,x 2,x 3,..., x n ) takes on the form: where the unknown parameter vector  ( ,  2 )   ={(x,y)|-∞ < x < ∞, 0 ≤ y < ∞}.

50 Defn (Sufficient Statistics) Let x have joint density f(x|  ) where the unknown parameter vector   . Then S = (S 1 (x),S 2 (x),S 3 (x),..., S k (x)) is called a set of sufficient statistics for the parameter vector  if the conditional distribution of x given S = (S 1 (x),S 2 (x),S 3 (x),..., S k (x)) is not functionally dependent on the parameter vector . A set of sufficient statistics contains all of the information concerning the unknown parameter vector

51 A Simple Example illustrating Sufficiency Suppose that we observe a Success-Failure experiment n = 3 times. Let  denote the probability of Success. Suppose that the data that is collected is x 1, x 2, x 3 where x i takes on the value 1 is the i th trial is a Success and 0 if the i th trial is a Failure.

52 The following table gives possible values of (x 1, x 2, x 3 ). The data can be generated in two equivalent ways: 1.Generating (x 1, x 2, x 3 ) directly from f (x 1, x 2, x 3 |  ) or 2.Generating S from g(S|  ) then generating (x 1, x 2, x 3 ) from f (x 1, x 2, x 3 |S). Since the second step does involve  no additional information will be obtained by knowing (x 1, x 2, x 3 ) once S is determined

53 The Sufficiency Principle Any decision regarding the parameter  should be based on a set of Sufficient statistics S 1 (x), S 2 (x),...,S k (x) and not otherwise on the value of x.

54 A useful approach in developing a statistical procedure 1.Find sufficient statistics 2.Develop estimators, tests of hypotheses etc. using only these statistics

55 Defn (Minimal Sufficient Statistics) Let x have joint density f(x|  ) where the unknown parameter vector   . Then S = (S 1 (x),S 2 (x),S 3 (x),..., S k (x)) is a set of Minimal Sufficient statistics for the parameter vector  if S = (S 1 (x),S 2 (x),S 3 (x),..., S k (x)) is a set of Sufficient statistics and can be calculated from any other set of Sufficient statistics.

56 Theorem (The Factorization Criterion) Let x have joint density f(x|  ) where the unknown parameter vector   . Then S = (S 1 (x),S 2 (x),S 3 (x),..., S k (x)) is a set of Sufficient statistics for the parameter vector  if f(x|  ) = h(x)g(S,  ) = h(x)g(S 1 (x),S 2 (x),S 3 (x),..., S k (x),  ). This is useful for finding Sufficient statistics i.e. If you can factor out  -dependence with a set of statistics then these statistics are a set of Sufficient statistics

57 Defn (Completeness) Let x have joint density f(x|  ) where the unknown parameter vector   . Then S = (S 1 (x),S 2 (x),S 3 (x),..., S k (x)) is a set of Complete Sufficient statistics for the parameter vector  if S = (S 1 (x),S 2 (x),S 3 (x),..., S k (x)) is a set of Sufficient statistics and whenever E[  (S 1 (x),S 2 (x),S 3 (x),..., S k (x)) ] = 0 then P[  (S 1 (x),S 2 (x),S 3 (x),..., S k (x)) = 0] = 1

58 Defn (The Exponential Family) Let x have joint density f(x|  )| where the unknown parameter vector   . Then f(x|  ) is said to be a member of the exponential family of distributions if:  ,where

59 1)- ∞ < a i < b i < ∞ are not dependent on . 2)  contains a nondegenerate k-dimensional rectangle. 3) g(  ), a i,b i and p i (  ) are not dependent on x. 4) h(x), a i,b i and S i (x) are not dependent on q.

60 If in addition. 5) The S i (x) are functionally independent for i = 1, 2,..., k. 6)  [S i (x)]/  x j exists and is continuous for all i = 1, 2,..., k j = 1, 2,..., n. 7) p i (  ) is a continuous function of  for all i = 1, 2,..., k. 8) R = {[p 1 (  ),p 2 (  ),...,p K (  )] |   ,} contains nondegenerate k-dimensional rectangle. Then the set of statistics S 1 (x), S 2 (x),...,S k (x) form a Minimal Complete set of Sufficient statistics.

61 Defn (The Likelihood function) Let x have joint density f(x|  ) where the unkown parameter vector  . Then for a given value of the observation vector x,the Likelihood function, L x (  ), is defined by: L x (  ) = f(x|  ) with   The log Likelihood function l x (  ) is defined by: l x (  ) =lnL x (  ) = lnf(x|  ) with  

62 The Likelihood Principle Any decision regarding the parameter  should be based on the likelihood function L x (  ) and not otherwise on the value of x. If two data sets result in the same likelihood function the decision regarding  should be the same.

63 Some statisticians find it useful to plot the likelihood function L x (  ) given the value of x. It summarizes the information contained in x regarding the parameter vector .

64 An Example The data vector x = (x 1,x 2,x 3,..., x n ) a sample from the normal distribution with mean  and variance  2 The joint distribution of x Then f(x| ,  2 ) = f(x 1,x 2,..., x n | ,  2 ), the joint density of x = (x 1,x 2,x 3,..., x n ) takes on the form: where the unknown parameter vector  ( ,  2 )   ={(x,y)|-∞ < x < ∞, 0 ≤ y < ∞}.

65 The Likelihood function Assume data vector is known x = (x 1,x 2,x 3,..., x n ) The Likelihood function Then L( ,  )= f(x| ,  ) = f(x 1,x 2,..., x n | ,  2 ),

66 or

67 hence Now consider the following data: (n = 10)

68   0 20 50 70

69   0 20 50 70

70 Now consider the following data: (n = 100)

71   0 20 50 70

72   0 20 50 70

73 The Sufficiency Principle Any decision regarding the parameter  should be based on a set of Sufficient statistics S 1 (x), S 2 (x),...,S k (x) and not otherwise on the value of x. If two data sets result in the same values for the set of Sufficient statistics the decision regarding  should be the same.

74 Theorem (Birnbaum - Equivalency of the Likelihood Principle and Sufficiency Principle) L x 1 (  ) = K × L x 2 (  ) if and only if S 1 (x 1 ) = S 1 (x 2 ),..., and S k (x 1 ) = S k (x 2 )

75 The following table gives possible values of (x 1, x 2, x 3 ). The Likelihood function

76 Estimation Theory Point Estimation

77 Defn (Estimator) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Then an estimator of the parameter  (  ) =  (  1,  2,...,  k ) is any function T(x)=T(x 1,x 2,x 3,..., x n ) of the observation vector.

78 Defn (Mean Square Error) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Let T(x) be an estimator of the parameter  (  ). Then the Mean Square Error of T(x) is defined to be:

79 Defn (Uniformly Better) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Let T(x) and T*(x) be estimators of the parameter  (  ). Then T(x) is said to be uniformly better than T*(x) if:

80 Defn (Unbiased ) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Let T(x) be an estimator of the parameter  (  ). Then T(x) is said to be an unbiased estimator of the parameter  (  ) if:

81 Theorem (Cramer Rao Lower bound) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Suppose that: i) exists for all x and for all. ii) iii) iv)

82 Let M denote the p x p matrix with ij th element. Then V = M -1 is the lower bound for the covariance matrix of unbiased estimators of . That is, var(c' ) = c'var( )c ≥ c'M -1 c = c'Vc where is a vector of unbiased estimators of .

83 Defn (Uniformly Minimum Variance Unbiased Estimator) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector   . Then T*(x) is said to be the UMVU (Uniformly minimum variance unbiased) estimator of  (  ) if: 1) E[T*(x)] =  (  ) for all   . 2) Var[T*(x)] ≤ Var[T(x)] for all    whenever E[T(x)] =  (  ).

84 Theorem (Rao-Blackwell) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Let S 1 (x), S 2 (x),...,S K (x) denote a set of sufficient statistics. Let T(x) be any unbiased estimator of  (  ). Then T*[S 1 (x), S 2 (x),...,S k (x)] = E[T(x)|S 1 (x), S 2 (x),...,S k (x)] is an unbiased estimator of  (  ) such that: Var[T*(S 1 (x), S 2 (x),...,S k (x))] ≤ Var[T(x)] for all   .

85 Theorem (Lehmann-Scheffe') Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Let S 1 (x), S 2 (x),...,S K (x) denote a set of complete sufficient statistics. Let T*[S 1 (x), S 2 (x),...,S k (x)] be an unbiased estimator of  (  ). Then: T*(S 1 (x), S 2 (x),...,S k (x)) )] is the UMVU estimator of  (  ).

86 Defn ( Consistency ) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector   . Let T n (x) be an estimator of  (  ). Then T n (x) is called a consistent estimator of  (  ) if for any  > 0:

87 Defn (M. S. E. Consistency ) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector   . Let T n (x) be an estimator of  (  ). Then T n (x) is called a M. S. E. consistent estimator of  (  ) if for any  > 0:

88 Methods for Finding Estimators 1.The Method of Moments 2.Maximum Likelihood Estimation

89 Let x 1, …, x n denote a sample from the density function f(x;  1, …,  p ) = f(x;  ) Method of Moments The k th moment of the distribution being sampled is defined to be:

90 To find the method of moments estimator of  1, …,  p we set up the equations: The k th sample moment is defined to be:

91 for  1, …,  p. We then solve the equations The solutions are called the method of moments estimators

92 The Method of Maximum Likelihood Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  1, …,  p ) where  (  1, …,  p ) are unknown parameters assumed to lie in  (a subset of p-dimensional space). We want to estimate the parameters  1, …,  p

93 Definition: Maximum Likelihood Estimation Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  1, …,  p ) Then the Likelihood function is defined to be L(  ) = L(  1, …,  p ) = f(x 1, …, x n ;  1, …,  p ) the Maximum Likelihood estimators of the parameters  1, …,  p are the values that maximize L(  ) = L(  1, …,  p )

94 the Maximum Likelihood estimators of the parameters  1, …,  p are the values Such that Note: is equivalent to maximizing the log-likelihood function

95 Let x 1, …, x n denote a sample from the Gamma distribution with parameters  and. Example

96 The joint density of x 1, …, x n is given by:

97 The log-likelihood function is given by: Differentiating we get:

98 thus or and where

99 Hence to compute the maximum likelihood estimates of  and 1.We compute 2.Solve the equation for the MLE of . 3.Then the maximum likelihood estimate of is

100 Application The General Linear Model

101 Consider the random variable Y with 1. E[Y] = g(U 1,U 2,..., U k ) =  1  1 (U 1,U 2,..., U k ) +  2  2 (U 1,U 2,..., U k ) +... +  p  p (U 1,U 2,..., U k ) = and 2. var(Y) =  2 where  1,  2,...,  p are unknown parameters and  1,  2,...,  p are known functions of the nonrandom variables U 1,U 2,..., U k. Assume further that Y is normally distributed.

102 Thus the density of Y is: f(Y|  1,  2,...,  p,  2 ) = f(Y| ,  2 ) i = 1,2, …, p

103 Now suppose that n independent observations of Y, (y 1, y 2,..., y n ) are made corresponding to n sets of values of (U 1,U 2,..., U k ) - (u 11,u 12,..., u 1k ), (u 11,u 12,..., u 1k ),... (u 11,u 12,..., u 1k ). Let x ij =  j (u i1,u i2,..., u ik ) j =1, 2,..., p; i =1, 2,..., n. Then the joint density of y = (y 1, y 2,... y n ) is: f(y 1, y 2,..., y n |  1,  2,...,  p,  2 ) = f(y| ,  2 )

104

105 Thus f(y| ,  2 ) is a member of the exponential family of distributions and S = (y'y, X'y) is a Minimal Complete set of Sufficient Statistics.

106 Estimation The General Linear Model

107 The Maximum Likelihood estimates of  and  2 are the values that maximize or equivalently

108 yields the system of linear equations (The Normal Equations) while yields the equation:

109 If [X'X] -1 exists then the normal equations have solution: and

110 Properties of The Maximum Likelihood Estimates Unbiasedness, Minimum Variance

111 Note: and Thus is an unbiased estimator of. Since is also a function of the set of complete minimal sufficient statistics, it is the UMVU estimator of. (Lehman-Scheffe)

112 Note: where In general

113 Thus: where

114 Thus:

115 Let Then Thus s 2 is an unbiased estimator of  2. Since s 2 is also a function of the set of complete minimal sufficient statistics, it is the UMVU estimator of  2.

116 Hypothesis Testing

117 Defn (Test of size  ) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Let  be any subset of . Consider testing the the Null Hypothesis H 0 :   against the alternative hypothesis H 1 :  .

118 Let A denote the acceptance region for the test. (all values x = (x 1,x 2,x 3,..., x n ) of such that the decision to accept H 0 is made.) and let C denote the critical region for the test (all values x = (x 1,x 2,x 3,..., x n ) of such that the decision to reject H 0 is made.). Then the test is said to be of size  if

119 Defn (Power) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Consider testing the the Null Hypothesis H 0 :   against the alternative hypothesis H 1 :  . where  is any subset of . Then the Power of the test for   is defined to be:

120 Defn (Uniformly Most Powerful (UMP) test of size  ) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Consider testing the the Null Hypothesis H 0 :   against the alternative hypothesis H 1 :  . where  is any subset of . Let C denote the critical region for the test. Then the test is called the UMP test of size  if:

121 Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Consider testing the the Null Hypothesis H 0 :   against the alternative hypothesis H 1 :  . where  is any subset of . Let C denote the critical region for the test. Then the test is called the UMP test of size  if:

122 and for any other critical region C* such that: then

123 Theorem (Neymann-Pearson Lemma) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector   = (  0,  1 ). Consider testing the the Null Hypothesis H 0 :  =  0 against the alternative hypothesis H 1 :  =  1. Then the UMP test of size  has critical region: where K is chosen so that

124 Defn (Likelihood Ratio Test of size  ) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Consider testing the the Null Hypothesis H 0 :   against the alternative hypothesis H 1 :  . where  is any subset of  Then the Likelihood Ratio (LR) test of size a has critical region: where K is chosen so that

125 Theorem (Asymptotic distribution of Likelihood ratio test criterion) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Consider testing the the Null Hypothesis H 0 :   against the alternative hypothesis H 1 :  . where  is any subset of  Then under proper regularity conditions on U = -2ln (x) possesses an asymptotic Chi-square distribution with degrees of freedom equal to the difference between the number of independent parameters in  and .

126 Advanced Introductory Techniques

127 Comparing k Populations Means – One way Analysis of Variance (ANOVA)

128 The F test

129 The F test – for comparing k means Situation We have k normal populations Let  i and  denote the mean and standard deviation of population i. i = 1, 2, 3, … k. Note: we assume that the standard deviation for each population is the same.  1 =  2 = … =  k = 

130 We want to test against

131 To test against use the test statistic

132 is called the Between Sum of Squares and is denoted by SS Between It measures the variability between samples the statistic k – 1 is known as the Between degrees of freedom and is called the Between Mean Square and is denoted by MS Between

133 is called the Error Sum of Squares and is denoted by SS Error the statistic is known as the Error degrees of freedom and is called the Error Mean Square and is denoted by MS Error

134 then

135 The Computing formula for F: Compute 1) 2) 3) 4) 5)

136 Then 1) 2) 3)

137 We reject if F  is the critical point under the F distribution with 1 = k - 1degrees of freedom in the numerator and 2 = N – k degrees of freedom in the denominator The critical region for the F test

138 Example In the following example we are comparing weight gains resulting from the following six diets 1.Diet 1 - High Protein, Beef 2.Diet 2 - High Protein, Cereal 3.Diet 3 - High Protein, Pork 4.Diet 4 - Low protein, Beef 5.Diet 5 - Low protein, Cereal 6.Diet 6 - Low protein, Pork

139

140 Hence

141 Thus Thus since F > 2.386 we reject H 0

142 The ANOVA Table A convenient method for displaying the calculations for the F-test

143 Sourced.f.Sum of Squares Mean Square F-ratio Betweenk - 1SS Between MS Between MS B /MS E WithinN - kSS Error MS Error TotalN - 1SS Total Anova Table

144 Sourced.f.Sum of Squares Mean Square F-ratio Between54612.933922.5874.3 Within5411586.000214.556 (p = 0.0023) Total5916198.933 The Diet Example

145 Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS

146 Assume the data is contained in an Excel file

147 Each variable is in a column 1.Weight gain (wtgn) 2.diet 3.Source of protein (Source) 4.Level of Protein (Level)

148 After starting the SSPS program the following dialogue box appears:

149 If you select Opening an existing file and press OK the following dialogue box appears

150 The following dialogue box appears:

151 If the variable names are in the file ask it to read the names. If you do not specify the Range the program will identify the Range: Once you “click OK”, two windows will appear

152 One that will contain the output:

153 The other containing the data:

154 To perform ANOVA select Analyze->General Linear Model-> Univariate

155 The following dialog box appears

156 Select the dependent variable and the fixed factors Press OK to perform the Analysis

157 The Output

158 Comments The F-test H 0 :  1 =  2 =  3 = … =  k against H A : at least one pair of means are different If H 0 is accepted we know that all means are equal (not significantly different) If H 0 is rejected we conclude that at least one pair of means is significantly different. The F – test gives no information to which pairs of means are different. One now can use two sample t tests to determine which pairs means are significantly different

159 Fishers LSD (least significant difference) procedure: 1.Test H 0 :  1 =  2 =  3 = … =  k against H A : at least one pair of means are different, using the ANOVA F-test 2.If H 0 is accepted we know that all means are equal (not significantly different). Then stop in this case 3.If H 0 is rejected we conclude that at least one pair of means is significantly different, then follow this by using two sample t tests to determine which pairs means are significantly different

160 Linear Regression Hypothesis testing and Estimation

161 Assume that we have collected data on two variables X and Y. Let ( x 1, y 1 ) ( x 2, y 2 ) ( x 3, y 3 ) … ( x n, y n ) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

162 The Statistical Model

163 Each y i is assumed to be randomly generated from a normal distribution with mean  i =  +  x i and standard deviation . ( ,  and  are unknown) yiyi  +  x i  xixi Y =  +  X slope =  

164 The Data The Linear Regression Model The data falls roughly about a straight line. Y =  +  X unseen

165 The Least Squares Line Fitting the best straight line to “linear” data

166 Let Y = a + b X denote an arbitrary equation of a straight line. a and b are known values. This equation can be used to predict for each value of X, the value of Y. For example, if X = x i (as for the i th case) then the predicted value of Y is:

167 The residual can be computed for each case in the sample, The residual sum of squares (RSS) is a measure of the “goodness of fit of the line Y = a + bX to the data

168 The optimal choice of a and b will result in the residual sum of squares attaining a minimum. If this is the case than the line: Y = a + bX is called the Least Squares Line

169 The equation for the least squares line Let

170 Computing Formulae:

171 Then the slope of the least squares line can be shown to be:

172 and the intercept of the least squares line can be shown to be:

173 The residual sum of Squares Computing formula

174 Estimating , the standard deviation in the regression model : This estimate of  is said to be based on n – 2 degrees of freedom Computing formula

175 Sampling distributions of the estimators

176 The sampling distribution slope of the least squares line : It can be shown that b has a normal distribution with mean and standard deviation

177 Thus has a standard normal distribution, and has a t distribution with df = n - 2

178 (1 –  )100% Confidence Limits for slope  : t  /2 critical value for the t-distribution with n – 2 degrees of freedom

179 Testing the slope The test statistic is: - has a t distribution with df = n – 2 if H 0 is true.

180 The Critical Region Reject df = n – 2 This is a two tailed tests. One tailed tests are also possible

181 The sampling distribution intercept of the least squares line : It can be shown that a has a normal distribution with mean and standard deviation

182 Thus has a standard normal distribution and has a t distribution with df = n - 2

183 (1 –  )100% Confidence Limits for intercept  : t  /2 critical value for the t-distribution with n – 2 degrees of freedom

184 Testing the intercept The test statistic is: - has a t distribution with df = n – 2 if H 0 is true.

185 The Critical Region Reject df = n – 2

186 Example

187 The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (X i ) in n = 11 countries in 1930, and the death rates, Y i (per 100,000), from lung cancer for men in 1950. Country (i)X i Y i Australia4818 Canada5015 Denmark3817 Finland11035 Great Britain11046 Holland4924 Iceland236 Norway259 Sweden3011 Switzerland5125 USA13020

188

189 Fitting the Least Squares Line

190 First compute the following three quantities:

191 Computing Estimate of Slope (  ), Intercept (  ) and standard deviation (  ),

192 95% Confidence Limits for slope  : t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom 0.0706 to 0.3862

193 95% Confidence Limits for intercept  : -4.34 to 17.85 t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom

194 Y = 6.756 + (0.228)X 95% confidence Limits for slope 0.0706 to 0.3862 95% confidence Limits for intercept -4.34 to 17.85

195 Testing the positive slope The test statistic is:

196 The Critical Region Reject df = 11 – 2 = 9 A one tailed test

197 and conclude we reject

198 Confidence Limits for Points on the Regression Line The intercept  is a specific point on the regression line. It is the y – coordinate of the point on the regression line when x = 0. It is the predicted value of y when x = 0. We may also be interested in other points on the regression line. e.g. when x = x 0 In this case the y – coordinate of the point on the regression line when x = x 0 is  +  x 0

199 x0x0  +  x 0 y =  +  x

200 (1-  )100% Confidence Limits for  +  x 0 : t  /2 is the  /2 critical value for the t-distribution with n - 2 degrees of freedom

201 Prediction Limits for new values of the Dependent variable y An important application of the regression line is prediction. Knowing the value of x (x 0 ) what is the value of y? The predicted value of y when x = x 0 is: This in turn can be estimated by:.

202 The predictor Gives only a single value for y. A more appropriate piece of information would be a range of values. A range of values that has a fixed probability of capturing the value for y. A (1-  )100% prediction interval for y.

203 (1-  )100% Prediction Limits for y when x = x 0 : t  /2 is the  /2 critical value for the t-distribution with n - 2 degrees of freedom

204 Example In this example we are studying building fires in a city and interested in the relationship between: 1. X = the distance of the closest fire hall and the building that puts out the alarm and 2. Y = cost of the damage (1000$) The data was collected on n = 15 fires.

205 The Data

206 Scatter Plot

207 Computations

208 Computations Continued

209

210

211 95% Confidence Limits for slope  : t.025 = 2.160 critical value for the t-distribution with 13 degrees of freedom 4.07 to 5.77

212 95% Confidence Limits for intercept  : 7.21 to 13.35 t.025 = 2.160 critical value for the t-distribution with 13 degrees of freedom

213 Least Squares Line y=4.92x+10.28

214 (1-  )100% Confidence Limits for  +  x 0 : t  /2 is the  /2 critical value for the t-distribution with n - 2 degrees of freedom

215 95% Confidence Limits for  +  x 0 :

216 95% Confidence Limits for  +  x 0 Confidence limits

217 (1-  )100% Prediction Limits for y when x = x 0 : t  /2 is the  /2 critical value for the t-distribution with n - 2 degrees of freedom

218 95% Prediction Limits for y when x = x 0

219 95% Prediction Limits for y when x =  x 0 Prediction limits

220 Linear Regression Summary Hypothesis testing and Estimation

221 (1 –  )100% Confidence Limits for slope  : t  /2 critical value for the t-distribution with n – 2 degrees of freedom

222 Testing the slope The test statistic is: - has a t distribution with df = n – 2 if H 0 is true.

223 (1 –  )100% Confidence Limits for intercept  : t  /2 critical value for the t-distribution with n – 2 degrees of freedom

224 Testing the intercept The test statistic is: - has a t distribution with df = n – 2 if H 0 is true.

225 (1-  )100% Confidence Limits for  +  x 0 : t  /2 is the  /2 critical value for the t-distribution with n - 2 degrees of freedom

226 (1-  )100% Prediction Limits for y when x = x 0 : t  /2 is the  /2 critical value for the t-distribution with n - 2 degrees of freedom

227 Comparing k Populations Proportions The  2 test for independence

228

229 Situation We have two categorical variables R and C. The number of categories of R is r. The number of categories of C is c. We observe n subjects from the population and count x ij = the number of subjects for which R = i and C = j. R = rows, C = columns

230 Example Both Systolic Blood pressure (C) and Serum Cholesterol (R) were meansured for a sample of n = 1237 subjects. The categories for Blood Pressure are: <126127-146147-166167+ The categories for Cholesterol are: <200200-219220-259260+

231 Table: two-way frequency

232 The  2 test for independence Define = Expected frequency in the (i,j) th cell in the case of independence.

233 Use test statistic E ij = Expected frequency in the (i,j) th cell in the case of independence. H 0 : R and C are independent against H A : R and C are not independent Then to test x ij = observed frequency in the (i,j) th cell

234 Sampling distribution of test statistic when H 0 is true -  2 distribution with degrees of freedom = (r - 1)(c - 1) Critical and Acceptance Region Reject H 0 if : Accept H 0 if :

235

236 Standardized residuals degrees of freedom = (r - 1)(c - 1) = 9 Test statistic Reject H 0 using  = 0.05

237 Another Example This data comes from a Globe and Mail study examining the attitudes of the baby boomers. Data was collected on various age groups

238 One question with responses Are there differences in weekly consumption of alcohol related to age?

239 Table: Expected frequencies

240 Table: Residuals Conclusion: There is a significant relationship between age group and weekly alcohol use

241 Examining the Residuals allows one to identify the cells that indicate a departure from independence Large positive residuals indicate cells where the observed frequencies were larger than expected if independent Large negative residuals indicate cells where the observed frequencies were smaller than expected if independent

242 Another question with responses Are there differences in weekly internet use related to age? In an average week, how many times would you surf the internet?

243 Table: Expected frequencies

244 Table: Residuals Conclusion: There is a significant relationship between age group and weekly internet use

245 Echo (Age 20 – 29)

246 Gen X (Age 30 – 39)

247 Younger Boomers (Age 40 – 49)

248 Older Boomers (Age 50 – 59)

249 Pre Boomers (Age 60+)


Download ppt "Brief Review Probability and Statistics. Probability distributions Continuous distributions."

Similar presentations


Ads by Google