Linear Regression with One Regressor.  Introduction  Linear Regression Model  Measures of Fit  Least Squares Assumptions  Sampling Distribution of.

Slides:



Advertisements
Similar presentations
Properties of Least Squares Regression Coefficients
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Multiple Regression Analysis
Chapter 12 Inference for Linear Regression
The Simple Regression Model
The Multiple Regression Model.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Linear Regression with Multiple Regressors
1 Chapter 2 Simple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Bivariate Regression Analysis
The Simple Linear Regression Model: Specification and Estimation
Chapter 10 Simple Regression.
Correlation and Simple Regression Introduction to Business Statistics, 5e Kvanli/Guynes/Pavur (c)2000 South-Western College Publishing.
9. SIMPLE LINEAR REGESSION AND CORRELATION
Chapter 12 Simple Regression
Linear Regression with One Regression
Econ Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 1. Estimation.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Chapter 4 Multiple Regression.
The Simple Regression Model
The Basics of Regression continued
The Simple Regression Model
FIN357 Li1 The Simple Regression Model y =  0 +  1 x + u.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Regression Analysis
Hypothesis Tests and Confidence Intervals in Multiple Regressors
Regression and Correlation Methods Judy Zhong Ph.D.
Nonlinear Regression Functions
Introduction to Linear Regression and Correlation Analysis
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Copyright © 2011 Pearson Addison-Wesley. All rights reserved. Chapter 4 Linear Regression with One Regressor.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Hypothesis Testing in Linear Regression Analysis
Inferences for Regression
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Review of Statistical Models and Linear Regression Concepts STAT E-150 Statistical Methods.
7.1 Multiple Regression More than one explanatory/independent variable This makes a slight change to the interpretation of the coefficients This changes.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u.
Y X 0 X and Y are not perfectly correlated. However, there is on average a positive relationship between Y and X X1X1 X2X2.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Chapter Three TWO-VARIABLEREGRESSION MODEL: THE PROBLEM OF ESTIMATION
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Chapter 4 The Classical Model Copyright © 2011 Pearson Addison-Wesley. All rights reserved. Slides by Niels-Hugo Blunch Washington and Lee University.
6. Simple Regression and OLS Estimation Chapter 6 will expand on concepts introduced in Chapter 5 to cover the following: 1) Estimating parameters using.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
1 Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 1. Estimation.
Review of Statistics.  Estimation of the Population Mean  Hypothesis Testing  Confidence Intervals  Comparing Means from Different Populations  Scatterplots.
BPS - 5th Ed. Chapter 231 Inference for Regression.
1 AAEC 4302 ADVANCED STATISTICAL METHODS IN AGRICULTURAL RESEARCH Part II: Theory and Estimation of Regression Models Chapter 5: Simple Regression Theory.
Bivariate Regression. Bivariate Regression analyzes the relationship between two variables. Bivariate Regression analyzes the relationship between two.
The simple linear regression model and parameter estimation
Linear Regression with One Regression
6. Simple Regression and OLS Estimation
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
The Simple Linear Regression Model: Specification and Estimation
Chapter 11: Simple Linear Regression
Chapter 5: The Simple Regression Model
The Simple Regression Model
Multiple Regression Analysis
CHAPTER 29: Multiple Regression*
6-1 Introduction To Empirical Models
Simple Linear Regression
CHAPTER 12 More About Regression
Correlation and Simple Linear Regression
Presentation transcript:

Linear Regression with One Regressor

 Introduction  Linear Regression Model  Measures of Fit  Least Squares Assumptions  Sampling Distribution of OLS Estimator

Introduction Empirical problem: Class size and educational output.  Policy question: What is the effect of reducing class size by one student per class? by 8 students/class?  What is the right output (performance) measure? parent satisfaction. student personal development. future adult welfare. future adult earnings. performance on standardized tests.

What do data say about class sizes and test scores? The California Test Score Data Set All K-6 and K-8 California school districts (n = 420) Variables:  5 th grade test scores (Stanford-9 achievement test, combined math and reading), district average.  Student-teacher ratio (STR) = number of students in the district divided by number of full-time equivalent teachers.

An initial look at the California test score data:

Question: Do districts with smaller classes (lower STR) have higher test scores? And by how much?

The class size/test score policy question:  What is the effect of reducing STR by one student/teacher on test scores ?  Object of policy interest:.  This is the slope of the line relating test score and STR.

This suggests that we want to draw a line through the Test Score v.s. STR scatterplot, but how?

Linear Regression: Some Notation and Terminology The population regression line is

 β 0 and β 1 are “population” parameters?  We would like to know the population value of β 1  We don’t know β 1, so we must estimate it using data.

The Population Linear Regression Model— general notation  X is the independent variable or regressor.  Y is the dependent variable.  β 0 = intercept.  β 1 = slope.

 u i = the regression error.  The regression error consists of omitted factors, or possibly measurement error in the measurement of Y. In general, these omitted factors are other factors that influence Y, other than the variable X.

Ex : The population regression line and the error term

Data and sampling The population objects (“parameters”)  0 and  1 are unknown; so to draw inferences about these unknown parameters we must collect relevant data. Simple random sampling: Choose n entities at random from the population of interest, and observe (record) X and Y for each entity Simple random sampling implies that {(X i, Y i )}, i = 1,…, n, are independently and identically distributed (i.i.d.). (Note: (X i, Y i ) are distributed independently of (X j, Y j ) for different observations i and j.)

The Ordinary Least Squares Estimator How can we estimate β 0 and β 1 from data? We will focus on the least squares (“ordinary least squares” or ”OLS”) estimator of the unknown parameters β 0 and β 1, which solves

The OLS estimator solves:  The OLS estimator minimizes the sum of squared difference between the actual values of Yi and the prediction (predicted value) based on the estimated line.  This minimization problem can be solved.  The result is the OLS estimators of β 0 and β 1.

Why use OLS, rather than some other estimator?  The OLS estimator has some desirable properties: under certain assumptions, it is unbiased (that is, E(β 1 )=β 1 ), and it has a tighter sampling distribution than some other candidate estimators of β 1.  This is what everyone uses— the common “language” of linear regression.

Why use OLS, rather than some other estimator?  OLS is a generalization of the sample average: if the “line” is just an intercept (no X), then the OLS estimator is just the sample average of Y 1,…Y n ( ).  Like, the OLS estimator has some desirable properties: under certain assumptions, it is unbiased (that is, E( ) =  1 ), and it has a tighter sampling distribution than some other candidate estimators of  1 (more on this later)  Importantly, this is what everyone uses – the common “language” of linear regression.

Derivation of the OLS Estimators and are the values of b 0 and b 1 the above two normal equations.

From equations (1) and (2), and divide each term by n, we have From (3),, substitute in (4) and collect terms, we have

Application to the California Test Score-Class Size data  Estimated slope = =  Estimated intercept = =  Estimated regression line: = ST R

Interpretation of the estimated slope and intercept  Districts with one more student per teacher on average have test scores that are 2.28 points lower.  That is, =  The intercept (taken literally) means that, according to this estimated line, districts with zero students per teacher would have a (predicted) test score of  This interpretation of the intercept makes no sense – it extrapolates the line outside the range of the data – in this application, the intercept is not itself economically meaningful.

Predicted values and residuals: One of the districts in the data set is Antelope, CA, for which ST R = and Score = 657.8

Measures of Fit A natural question is how well the regression line “fits” or explains the data. There are two regression statistics that provide complementary measures of the quality of fit:  The regression R 2 measures the fraction of the variance of Y that is explained by X; it is unitless and ranges between zero (no fit) and one (perfect fit).  The standard error of the regression (SER) measures the magnitude of a typical regression residual in the units of Y.

The R 2 : The regression R 2 is the fraction of the sample variance of Y i “explained” by the regression. from equations (1) and (2).

Definition of R 2 :  R 2 = 0 means ESS = 0.  R 2 = 1 means ESS = T SS.  0 ≤ R 2 ≤ 1.  For regression with a single X, R 2 = the square of the correlation coefficient between X and Y. (Exercise 4.12)

The Standard Error of the Regression (SER) The SER measures the spread of the distribution of u. The SER is (almost) the sample standard deviation of the OLS residuals: The second equality holds bacause

The SER:  has the units of u, which are the units of Y.  measures the average “size” of the OLS residual (the average “mistake” made by the OLS regression line)  The root mean squared error (RMSE) is closely related to the SER:  This measures the same thing as the SER—the minor difference is division by 1/n instead of 1/(n-2).

Technical note: why divide by n-2 instead of n-1?  Division by n-2 is a “degrees of freedom” correction— just like division by n-1 in, except that for the SER, two parameters have been estimated (β 0 and β 1, by and ), whereas in only one has been estimated (, by ).  When n is large, it makes negligible difference whether n, n-1, or n-2 are used— although the conventional formula uses n-2 when there is a single regressor.  For details, see Section 17.4.

Example of the R 2 and the SER R 2 = 0.05, SER=18.6 STR explains only a small fraction of the variation in test scores. Does this make sense? Does this mean the ST R is unimportant in a policy sense? No.

The Least Squares Assumptions  What, in a precise sense, are the properties of the OLS estimator? We would like it to be unbiased, and to have a small variance. Does it? Under what conditions is it an unbiased estimator of the true population parameters?  To answer these questions, we need to make some assumptions about how Y and X are related to each other, and about how they are collected (the sampling scheme).  These assumptions— there are three—are known as the Least Squares Assumptions.

The Least Squares Assumptions  The conditional distribution of u given X has mean zero, that is,. This implies that is unbiased.  (X i, Y i ), i= 1, …, n, are i.i.d. This is true if X, Y are collected by simple random sampling. This delivers the sampling distribution of and.  Large outliers in X and/or Y are rare. Technically, X and u have four moments, that is: E(X 4 )<∞ and E(u 4 )<∞. Outliers can result in meaningless values of.

Least squares assumption #1: E(u|X=x)= 0. For any given value of X, the mean of u is zero.

Example: Assumption #1 and the class size example. “Other factors” include  parental involvement  outside learning opportunities (extra math class,..)  home environment  family income is a useful proxy for many such factors So, E(u|X=x)0 means E(Family Income|STR) = constant (which implies that family income and STR are uncorrelated).

Least squares assumption #2: (X i, Y i ), i= 1, …, n, are i.i.d. This arises automatically if the entity (individual, district) is sampled by simple random sampling. The entity is selected then, for that entity, X and Y are observed (recorded). The main place we will encounter non-i.i.d. sampling is when data are recorded over time (“time series data”) – this will introduce some extra complications.

Least squares assumption #3: Large outliers are rare. Technical statement: E(X 4 )<∞ and E(u 4 )<∞.  A large outlier is an extreme value of X or Y.  On a technical level, if X and Y are bounded, then they have finite fourth moments. (Standardized test scores automatically satisfy this; STR, family income, etc. satisfy this too).  However, the substance of this assumption is that a large outlier can strongly influence the results.

OLS can be sensitive to an outlier  Is the lone point an outlier in X or Y ?  In practice, outliers often are data glitches (coding/recording problems)— so check your data for outliers! The easiest way is to produce a scatterplot.

Sampling Distribution of OLS Estimator The OLS estimator is computed from a sample of data; a different sample gives a different value of. This is the source of the “sampling uncertainty” of. We want to:  quantify the sampling uncertainty associated with.  use to test hypotheses such as β 1 = 0.  construct a confidence interval for β 1.

All these require figuring out the sampling distribution of the OLS estimator. Two steps to get there.  Probability framework for linear regression.  Distribution of the OLS estimator.

Probability Framework for Linear Regression The Probability framework for linear regression is summarized by the three least squares assumption.  Population population of interest (ex: all possible school districts)  Random variables: Y, X (ex: Test Score, STR)  Joint distribution of (Y, X) The population regression function is linear. E(u|X) = 0 X, Y have finite fourth moments.  Data collection by simple randomsampling {(X i, Y i )}, i= 1, …, n are i.i.d.

The Sampling Distribution of Like, has a sampling distribution.  What is E( )? (where is it centered?)  What is Var( )? (measure of sampling uncertainty)  What is its sampling distribution in small samples?  What is its sampling distribution in large samples?

The mean and variance of the sampling distribution of Thus because

is unbiased. Law of Iterated Expectations: E(Y) = E(E(Y|X)).

Thus

Because when n is large  is i.i.d. and has two moments. That is Var( ) < ∞. Thus is distributed when n is large. (central limit theorem)  is approximately equal to when n is large.  when n is large. Putting these together we have:

Large-n approximation to the distribution of : which is approximately distributed Because, we can write this as: is approximately distributed

The larger the variance of X, the smaller the variance of The math: where. The variance of X appears in its square in the denominator—so increasing the spread of X decreases the variance of β 1. The intuition If there is more variation in X, then there is more information in the data that you can use to fit the regression line. This is most easily seen in a figure.

The larger the variance of X, the smaller the variance of There are the same number of black and blue dots— using which would you get a more accurate regression line?

 Law of Large Number: Under certain conditions on Y 1, …, Y n, the sample average converges in probability to the population mean. If Y 1, …, Y n are i.i.d.,, and, then

 The central limit theorem. If Y 1, …, Y n are i.i.d. and then In other words, the asymptotic distribution of is N(0,1).

Slutsky’s theorem combines consistency and convergence in distribution. Suppose that an where a is a constant, and Then

 Continuous mapping theorem: If g is a continuous function, then

Another apporach to obtain an estimator: Apply Law of Large Number The least square assumption #1 implies Apply LLN we have

Replacing the population mean with sample average is called the analogy principle. This leads to the two normal equations in the bivariate least squares regression.

Summary for the OLS estimator : Under the three Least Squares Assumptions,  The exact (finite sample) sampling distribution of has mean ( is an unbiased estmator of ), and Var( ) is inversely proportional to n.  Other than its mean and variance, the exact distribution of is complicated and depends on the distribution of (X, u). . (law of large numbers)  is approximately distributed N(0, 1). (CLT)