Presentation is loading. Please wait.

Presentation is loading. Please wait.

Review.

Similar presentations


Presentation on theme: "Review."— Presentation transcript:

1 Review

2 The Simple Regression Model
y = β0 + β1x + u

3 Some Terminology In the simple linear regression model, where y = β0 + β1x + u, we typically refer to y as the Dependent Variable, or Left-Hand Side Variable, or Explained Variable, or Regressand

4 Some Terminology, cont. In the simple linear regression of y on x, we typically refer to x as the Independent Variable, or Right-Hand Side Variable, or Explanatory Variable, or Regressor, or Covariate

5 Some Terminology, cont. Earnings = β0 + β1Educ + u
The parameters of the model, β0 and β1, are constants In this case they are the intercept and slope of a function relating education to earnings The error term, u, includes all other factors affecting earnings, including unobserved factors One could estimate the parameters of this model using data on a random sample of working people

6 A Simple Assumption We start by making a simple assumption about the error The average value of u, the error term, in the population is 0. That is, E(u) = 0 This is not a restrictive assumption, since we can always use β0 to normalize E(u) to 0 e.g. use β0 to normalize average ability to zero

7 Zero Conditional Mean We need to make a crucial assumption about how u and x are related We need to assume that the average value of u does not depend on the value of x x and u are independent, that is E(u|x) = E(u) = 0, which implies E(y|x) = β0 + β1x

8 Zero Conditional Mean To get some intuition as to what this means, think about the returns to education example Suppose ability (which is unobservable to us) was included in u Our assumption implies: E(ability|8 years ed.) = E(ability|12 years ed.) This would not be the case if those with more ability chose more education

9 Population Regression Function
E(y|x) as a linear function of x, where for any x the distribution of y is centered about E(y|x) Population Regression Function y f(y|x) . . E(y|x) = β0 + β1x x1 x2

10 Ordinary Least Squares
Note that I haven’t mentioned data yet. Basic idea of regression is to estimate the population parameters (β0 and β1) from a sample Let {(xi,yi): i=1, …,n} denote a random sample of size n from the population Since they are drawn from the population, for each observation in this sample, it will be the case that: yi = β0 + β1xi + ui

11 Population regression line, some actual draws from the population and the associated error terms
y y •Assume a population regression function (true relationship; para- meters unknown to us) •Actual outcomes differ from true relationship because of u •Want to estimate the true relationship based on data . 4 u4{ E(y|x) = β0 + β1x . }u3 y3 y . u2{ 2 .}u1 y 1 x1 x2 x3 x4 x

12 Deriving OLS Estimates
We need some way of estimating the true relationship using the sample data Can do this using some of our assumptions First we need to realize that our main assumption of E(u|x) = E(u) = 0 also implies Cov(x,u) = E(xu) = 0 Why? From basic probability we know that: Cov(x,u) = E(xu) – E(x)E(u) given E(u)=0 (by assumption) and Cov(x,u)=0 (by same assumption plus independence) then E(xu)=0

13 Deriving OLS continued
So we have that 1. E(u)=0 2. E(xu)=0 These can be rewritten in terms of x, y, β0 and β1 , since u = y – β0 – β1x 1. E(y – β0 – β1x) = 0 2. E[x(y – β0 – β1x)] = 0 These are called moment restrictions

14 Deriving OLS using M.O.M. So we have two equations (2 restrictions) and 2 unknowns (β0 , β1) We will solve using the sample counterparts of these two equations The method of moments approach to estimation implies imposing the population moment restrictions on the sample moments What does this mean? Recall that for E(X), the mean of a population distribution, a sample estimator of E(X) is simply the arithmetic mean of the sample

15 More Derivation of OLS n ∑ n ∑ 1A) 1 (y − βˆ − βˆ x ) = 0 2A) 1
We want to choose values of the parameters that will ensure that the sample versions of our moment restrictions are true The sample versions are as follows: n n ∑ 1A) 1 (y − βˆ − βˆ x ) = 0 i 0 1 i i =1 n n ∑ 2A) 1 x (y − βˆ i i 0 − βˆ x ) = 0 1 i i =1

16 βˆ More Derivation of OLS n ∑ n ∑ βˆ n ∑ solving for βˆ , = y − βˆ x
Given the definition of a sample mean, and properties of summation, we can rewrite equation 1(A) as follows: 1 n n n n ∑ n ∑ βˆ 1 ∑ y = βˆ x , or i n i i =1 i =1 i =1 1 n n x , or The equation passes through the means of x and y Need an expression of the estimate of β1 to get an n ∑ y = 1 nβˆ + βˆ ⎛ 1 1 ⎝⎜ n ∑ i ⎠⎟ i n i =1 i =1 y = βˆ + βˆ x 0 1 solving for βˆ , estimate of β βˆ = y − βˆ x 1

17 ∑ xi (yi − y ) = βˆ ( 1 i ) ∑(xi − x )(yi − y ) = βˆ (x − x ) ∑
More Derivation of OLS 2(A) can be rewritten dropping 1/n and plugging in our estimate of βˆ : ( 1 i ) − (y − βˆ x ) − βˆ x = 0 n xi yi 1 i =1 n ∑ xi (yi − y ) = βˆ i =1 n 1 ∑ xi (xi − x ) i =1 n ∑(xi − x )(yi − y ) = βˆ i =1 n 1 ∑ i =1 (x − x ) 2 i

18 ∑(xi − x )(yi − y ) ∑(xi − x ) βˆ = i =1 So the OLS estimated slope is
n ∑(xi − x )(yi − y ) βˆ = i =1 1 ∑(xi − x ) i =1 n 2 n provided that ∑(xi − x ) > 0 i =1 Of course, this can’t be less than zero. But if it’s zero, we’re out of luck. Intuitively, this should make sense. We need variation in x to discern a relationship between x and y. 2 16

19 Summary of OLS slope estimate
The slope estimate is the sample covariance between x and y divided by the sample variance of x If x and y are positively correlated, the slope will be positive If x and y are negatively correlated, the slope will be negative Only need x to vary in our sample to get estimate Example: If everyone in the population has the same education the denominator equals zero - uninteresting question

20 uˆi OLS Interpretation
Intuitively, OLS is fitting a line through the sample points such that the sum of squared residuals is as small as possible, hence the term least squares The residual, û, is an estimate of the error term, u, and is the difference between the actual data point, yi, and its fitted value. uˆi = yi − yˆi = yi − βˆ0 − βˆ x 1 i

21 Sample Regression Function, sample data points and the associated estimated error terms
Residual is difference between actual y and fitted value SRF is obtained for a given sample by minimizing sum of squared residuals Each new sample will generate a different function Unlike the PRF, which ixs fixed y 4 . y û4{ yˆ = βˆ0 + βˆ x 1 . } û3 y3 y . û2{ 2 }û1 . y1 x1 x2 x3 x4

22 ∑uˆi Algebraic Properties of OLS n ∑
These properties are basically by definition 1. The sum of the OLS residuals is zero (this is from the first foc) n ∑uˆi = 0 i =1 Thus, the sample average of the OLS residuals is zero as well 1 n n ∑ uˆ = 0 i 2. The sample covariance between the regressors and i =1 n the OLS residuals is zero (by foc2) ∑ xiuˆi = 0 3. The OLS regression line always goes through the i =1 mean of the sample y = βˆ + βˆ x 0 1

23 Decomposition of sum of squares
We can think of each observation being made up of two parts An explained part and an unexplained part yi = yˆi + uˆi We can then define the following: n Total Sum of Squares: SST ≡ ∑(yi − y ) 2 i =1 n Explained Sum of Squares: SSE ≡ ∑(yˆi − y ) i =1 Residual Sum of Squares: SSR ≡ ∑uˆi 2 2 i =1

24 ∑(yi − y ) = ∑⎡⎣ yi + (− yˆi + yˆi ) − y ⎤⎦
Proof that SST = SSE + SSR We can show that SST=SSE+SSR ∑(yi − y ) = ∑⎡⎣ yi + (− yˆi + yˆi ) − y ⎤⎦ = ∑⎡⎣(yi − yˆi ) + (yˆi − y )⎤⎦ 2 recall that uˆi = (yi − yˆi ) n 2 = ∑[uˆi + (yˆi − y )] i =1 n n n = ∑uˆi + 2∑uˆi (yˆi − y ) + ∑(yˆi − y ) 2 2 i =1 i =1 i =1 n = SSR + 2∑uˆi (yˆi − y ) + SSE i =1 n and we know that ∑uˆi (yˆi − y ) = 0, since Cov(uˆ,yˆ)=0 i=1 24

25 Goodness-of-Fit We can now tell how well our sample regression line fits our sample data (i.e. how well x explains y). Can compute the fraction of the total sum of squares (SST) that is explained by the model This is called the coefficient of determination or the R- squared of the regression R2 =SSE/SST=1-SSR/SST R2 always lies between 0 and 1. Example: If R2=0.65, this tells us that x explains 65% of the variation in y.

26 Properties of OLS Estimators
We want to know how our estimates of β0 and β1 relate to the population parameters Suppose that we get estimates of the parameters from one sample of the population, then take new samples and get a new estimate of parameters for each sample What are the sampling properties of our estimates of β0 and β1?

27 Unbiasedness of OLS y = β0 + β1x + u yi = β0 + β1xi + ui
We can establish the unbiasedness of OLS under a few simple assumptions 1. Assume the population model is linear in parameters as y = β0 + β1x + u 2. Assume we have a random sample of size n, {(xi, yi): i=1, 2, …, n}, from the population. We can rewrite the population model in terms of the random sample as yi = β0 + β1xi + ui for i=1,2,…,n. ui is the error for observation i.

28 Unbiasedness of OLS Assume there is at least some variation in the xi
Assume E(u|x) = 0 For a random sample this implies that E(ui|xi) = 0 for all i. This allows us to derive properties of the OLS estimators conditional on the sample. That is, we will assume that the x’s remain constant in repeated sampling (so that what changes in repeated sampling is the y’s and the u’s). The assumption that ui and xi are independent should not be taken for granted. When it is violated, the parameter estimates will be biased, in general. In order to think about unbiasedness, we need to rewrite our estimator in terms of the population parameter

29 ∑(xi − x )(yi − y ) = ∑(xi − x )yi
Unbiasedness of OLS (cont) Start with a simple rewrite of the formula as βˆ = ∑(xi − x ) yi , ∑(xi − x ) 1 2 We can do this rewrite because ∑(xi − x )(yi − y ) = ∑(xi − x )yi

30 ∑(xi − x )yi =∑(xi − x )(β0 + β1xi + ui )
Unbiasedness of OLS (cont) Substituting for yi to get in terms of population parameters the numerator becomes: ∑(xi − x )yi =∑(xi − x )(β0 + β1xi + ui ) n n n = ∑(xi − x )β0 + ∑(xi − x )β1xi + ∑(xi − x )ui i =1 i =1 i =1 n n n = β0 ∑(xi − x ) + β1 ∑(xi − x )xi + ∑(xi − x )ui i =1 i =1 i =1

31 ∑(xi − x ) = 0, ∑(xi − x ) xi = ∑(xi − x ) Unbiasedness of OLS (cont)
Recall from the appendices that, ∑(xi − x ) = 0, ∑(xi − x ) xi = ∑(xi − x ) Thus, the numerator can be rewritten as 2 ˆ i i =1 n n n n 2 i i 1 1 n n 2 2 i i 1

32 ⎜ ∑diui ⎟ ⎟ , ⎝⎜ ⎠⎟ Unbiasedness of OLS (cont) βˆ = β + ⎜
let di = (xi − x ), so that n ⎜ ∑diui ⎟ βˆ = β + ⎜ i =1 ⎟ , 1 1 ⎜ ∑di n 2 ⎝⎜ ⎠⎟ i =1 This tells us that the estimator equals the population slope plus a linear combination of the errors. Since the x’s are assumed to be constant in repeated samples, the second term only varies (in repeated samples) in the u’s. In general the estimator will differ from the true population value. But will it differ systematically? 32

33 Unbiasedness of OLS (cont)
Taking the expected value of the estimator: ⎡⎛ ⎞ ⎛ ⎞ E(βˆ ) = β + E ⎢ d u ⎥ = β + 1 ⎟ ∑ n 1 ⎟ ∑ n E(d u ) 1 1 ⎢⎝⎜ ∑(xi − x ) ⎠⎟ ⎢⎜ n i i 1 2 ⎟ ⎝⎜ ∑(xi − x ) ⎠⎟ n i i 2 ⎟ i =1 i =1 i =1 i =1 1 ⎟ ∑di E(ui ) = β1 + ⎜ n 1 ⎟ ∑di 0 =β1 n = β1 + ⎜ ⎜ ∑(xi − x ) ⎟ n n ⎝ ∑ 2 ⎟ 2 i =1 (x − x ) i =1 ⎝⎜ ⎠⎟ i i =1 i =1 We find that the slope coeff estimator is unbiased. It doesn’t vary systematically from the true population parameter.

34 Unbiasedness Summary Similarly, we could also prove that our estimate of β0 is unbiased (homework) Thus, the OLS estimates of β1 and β0 are unbiased Proof of unbiasedness depends on our 4 assumptions – if any assumption fails, then OLS is not necessarily unbiased. Remember unbiasedness is a description of the estimator – in a given sample we may be “near” or “far” from the true parameter.

35 Variance of the OLS Estimators
Now we know that the sampling distribution of our estimate is centered around the true parameter Want to think about how spread out this distribution is It’s much easier to think about this variance under an additional assumption, so Assume Var(u|x)=σ2 We’re assuming that the variance of the error term is constant over all values of x. This is referred to as assuming that the errors are homoskedastic.

36 Variance of OLS (cont) E(y | x) = β0 + β1x and Var(y | x) = σ 2
Var(u|x) = E(u2|x)-[E(u|x)]2 =σ2 E(u|x) = 0, and E(u|x)=E(u) so, σ2 = E(u2|x) = E(u2) = Var(u) Thus σ2 is also the unconditional variance of u σ is called the standard deviation of the error. For convenience, we can say: E(y | x) = β0 + β1x and Var(y | x) = σ 2

37 . . Homoskedastic Case Var(y|x) does not depend on x y x1 x2 f(y|x)
E(y|x) = β0 + β1x x1 x2

38 Heteroskedastic Case When Var(u|x) depends on x so does Var(y|x) f(y|x) . . . E(y|x) = β0 + β1x x1 x2 x3 x

39 Variance of OLS (cont) The error term in our returns to education example may exhibit heteroskedasticity Recall that y is the wage Low Educated: Likely to become tradespersons, factory workers. Low wage variance. High Educated: May become poets, musicians (low wage), investment bankers (high wage) or anything else. High wage variance. Later in the course, we will relax the homoskedasticity assumption. 39

40 ∑di ⎝⎜ ⎠⎟ Variance of OLS (cont) ∑ ∑ ⎝ ∑d 2 ⎠ ⎝ ∑d ⎠ ∑ d 2σ 2 ∑ d 2 ∑
Now, we will use the assumption to find the sampling variances of th⎛e OLS estimator⎞s ⎜ ∑ u ⎟ Var(βˆ ) = Var β + di i ⎜ ⎟ 1 1 n ∑di i =1 2 ⎝⎜ ⎠⎟ We are conditioning on x so, 2 2 = 1 ⎞ ⎛ ⎞ Var ( d u ) = 1 d 2Var (u ) i i 2 ⎟ ⎝ ∑d 2 ⎠ i i ⎝ ∑d ⎠ i i 2 2 = 1 = σ 2 1 2 ⎟ σ 2 σ 2 d 2σ 2 d 2 = = i i 2 i i i i

41 βˆ Some Intuition ∑ Under homoskedasticity, we now know that σ
2 Var(βˆ ) = 1 (x − x )2 i Larger σ2 means more variance in βˆ 1  More variation in unobservables (which are captured in u) makes it harder to discern the effect of x on y. Larger variation in x’s (bigger denominator) βˆ lowers variance of 1  More variation in the “treatment” (independent variable) makes it easier to discern the effect of x on y. Note also that larger sample increases denominator==>more precise estimator.

42 ∑ xi ∑(xi − x ) n∑(xi − x ) ∑ i Variance of OLS (Continued)
We could also solve to find the sampling variance ∑ xi Var (βˆ ) = σ n ∑ i 2 −1 2 σ x 2 2 = ∑(xi − x ) n∑(xi − x ) 2 2 These formulas would not hold in the presence of heteroskedasticity i.e. couldn’t rely on these variances for hypothesis testing

43 Variance of OLS Summary
We will primarily be interested in the slope estimate Its variance showed: The larger the error variance, σ2, the larger the variance of the slope estimate The larger the variability in the xi, the smaller the variance of the slope estimate As a result, a larger sample size should decrease the variance of the slope estimate (giving us a more precise estimate)

44 = 1 σˆ 2 n ∑ Estimating the Error Variance uˆ2 = SSR / n
Recall: σ2=E(u2) So an unbiased estimator of σ2 is (1/n)Σui2 Problem We don’t know what the error variance, σ2, is, because we don’t observe the errors, ui However, we do observe the residuals, ûi Why not replace the errors with the residuals to form an estimate of the error variance: n = 1 σˆ 2 n ∑ uˆ2 = SSR / n i i =1 44

45 Error Variance Estimate (cont) However, this would be a biased estimate because it does not account for the 2 restrictions placed on the OLS residuals ∑uˆi = 0 ∑ xiuˆi = 0 If we know n-2 of the residuals we can solve for the other 2 (there are n-2 degrees of freedom) So an unbiased estimate would be: n 1 σˆ 2 uˆ2 = = SSR / (n − 2) n − 2 ∑ i 45 i =1

46 uˆi uˆi Error Variance Estimate (cont) = yi − βˆ0 − βˆ x
Work through the proof on pg. 57 to prove that this is an unbiased estimator You will need to get the residuals as a function of the parameters uˆi = yi − βˆ0 − βˆ x 1 i substituting for yi = (β0 + β1xi + ui ) − βˆ0 − βˆ x uˆi 1 i Substitute this in and take the expected value of the error variance estimator

47 σˆ 2 Error Variance Estimate (cont) ∑ ∑
We can now also calculate estimates for the standard errors of the coefficient estimates If we define Standard Error of the regression (SER): σˆ = σˆ 2 We can use this to estimate the standard deviations of the coefficient estimates. σ 2 Var(βˆ ) = , so 1 (x − x )2 i σ sd(βˆ ) = 1 (xi − x ) 2

48 Error Variance Estimate (cont)
A natural estimator of this standard deviation is σˆ se(βˆ ) = 1 (xi − x ) 2 This is called the standard error of βˆ 1

49 Error Variance Estimate (cont)
Similarly for β0 ∑ xi se(βˆ ) = σˆ 2 2 n (x − x )2 i These standard errors will differ for each sample They will be used to construct test statistics and confidence intervals. We will do this in a more general framework (with more LHS variables) But this simple case is meant to illustrate the basic ideas as simply as possible. 49


Download ppt "Review."

Similar presentations


Ads by Google