CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Multiple Regression Analysis
The Multiple Regression Model.
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: slope dummy variables Original citation: Dougherty, C. (2012) EC220 -
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Heteroskedasticity The Problem:
ELASTICITIES AND DOUBLE-LOGARITHMIC MODELS
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
INTERPRETATION OF A REGRESSION EQUATION
Sociology 601, Class17: October 27, 2009 Linear relationships. A & F, chapter 9.1 Least squares estimation. A & F 9.2 The linear regression model (9.3)
Lecture 4 This week’s reading: Ch. 1 Today:
EC220 - Introduction to econometrics (chapter 2)
Valuation 4: Econometrics Why econometrics? What are the tasks? Specification and estimation Hypotheses testing Example study.
The Simple Linear Regression Model: Specification and Estimation
Chapter 10 Simple Regression.
Chapter 12 Simple Regression
SIMPLE LINEAR REGRESSION
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
© Christopher Dougherty 1999–2006 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We will now investigate the consequences of misspecifying.
SIMPLE LINEAR REGRESSION
EC220 - Introduction to econometrics (chapter 1)
1 INTERPRETATION OF A REGRESSION EQUATION The scatter diagram shows hourly earnings in 2002 plotted against years of schooling, defined as highest grade.
TESTING A HYPOTHESIS RELATING TO A REGRESSION COEFFICIENT This sequence describes the testing of a hypotheses relating to regression coefficients. It is.
Chapter 4 – Nonlinear Models and Transformations of Variables.
SLOPE DUMMY VARIABLES 1 The scatter diagram shows the data for the 74 schools in Shanghai and the cost functions derived from a regression of COST on N.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: precision of the multiple regression coefficients Original citation:
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: semilogarithmic models Original citation: Dougherty, C. (2012) EC220.
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
TOBIT ANALYSIS Sometimes the dependent variable in a regression model is subject to a lower limit or an upper limit, or both. Suppose that in the absence.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy variable classification with two categories Original citation:
1 PREDICTION In the previous sequence, we saw how to predict the price of a good or asset given the composition of its characteristics. In this sequence,
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy classification with more than two categories Original citation:
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
1 INTERACTIVE EXPLANATORY VARIABLES The model shown above is linear in parameters and it may be fitted using straightforward OLS, provided that the regression.
Confidence intervals were treated at length in the Review chapter and their application to regression analysis presents no problems. We will not repeat.
SIMPLE LINEAR REGRESSION
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Returning to Consumption
What does it mean? The variance of the error term is not constant
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
What is the MPC?. Learning Objectives 1.Use linear regression to establish the relationship between two variables 2.Show that the line is the line of.
F TEST OF GOODNESS OF FIT FOR THE WHOLE EQUATION 1 This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE 1 This sequence provides a geometrical interpretation of a multiple regression model with two.
Introduction to Linear Regression
1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u.
Y X 0 X and Y are not perfectly correlated. However, there is on average a positive relationship between Y and X X1X1 X2X2.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Simple regression model: Y =  1 +  2 X + u 1 We have seen that the regression coefficients b 1 and b 2 are random variables. They provide point estimates.
Chapter 5: Dummy Variables. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 We’ll now examine how you can include qualitative explanatory variables.
POSSIBLE DIRECT MEASURES FOR ALLEVIATING MULTICOLLINEARITY 1 What can you do about multicollinearity if you encounter it? We will discuss some possible.
(1)Combine the correlated variables. 1 In this sequence, we look at four possible indirect methods for alleviating a problem of multicollinearity. POSSIBLE.
COST 11 DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 This sequence explains how you can include qualitative explanatory variables in your regression.
I271B QUANTITATIVE METHODS Regression and Diagnostics.
11 Chapter 5 The Research Process – Hypothesis Development – (Stage 4 in Research Process) © 2009 John Wiley & Sons Ltd.
SEMILOGARITHMIC MODELS 1 This sequence introduces the semilogarithmic model and shows how it may be applied to an earnings function. The dependent variable.
© Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL The output above shows the result of regressing EARNINGS, hourly earnings in dollars, on S, years.
1 REPARAMETERIZATION OF A MODEL AND t TEST OF A LINEAR RESTRICTION Linear restrictions can also be tested using a t test. This involves the reparameterization.
F TESTS RELATING TO GROUPS OF EXPLANATORY VARIABLES 1 We now come to more general F tests of goodness of fit. This is a test of the joint explanatory power.
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE In this sequence we will investigate the consequences of including an irrelevant variable.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
Chapter 20 Linear and Multiple Regression
Chapter 11 Simple Regression
Seminar in Economics Econ. 470
The Multiple Regression Model
Introduction to Econometrics, 5th edition
Presentation transcript:

CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression analysis Taehyun Jung CIRCLE, Lund University December For Survey of Quantitative Research, NORSI

CIRCLE, Lund University, Sweden 2 Objectives of this session

CIRCLE, Lund University, Sweden 3 Contents

CIRCLE, Lund University, Sweden 4 Bivariate Linear Regression Model

CIRCLE, Lund University, Sweden 5 Bivariate Linear Regression Model

CIRCLE, Lund University, Sweden 6

Least squares 7 residuals

CIRCLE, Lund University, Sweden Highest R2 8

CIRCLE, Lund University, Sweden Unbiasedness 9

CIRCLE, Lund University, Sweden  Would you prefer to obtain your estimate by making a single random draw out of an unbiased sampling distribution with a small variance or out of an unbiased sampling distribution with a large variance?  Best unbiased estimator is efficient  BLUE: Best linear unbiased estimator Efficiency 10

CIRCLE, Lund University, Sweden Ordinary Least Squares 11

CIRCLE, Lund University, Sweden 12

CIRCLE, Lund University, Sweden  The discrepancies between the actual and fitted values of Y are known as the residuals. – Note that the values of the residuals are not the same as the values of the disturbance term 13

CIRCLE, Lund University, Sweden Deriving linear regression coefficients 14 Conditions for Minimizing RSS

CIRCLE, Lund University, Sweden Deriving linear regression coefficients (cont’d) 15

CIRCLE, Lund University, Sweden We chose the parameters of the fitted line so as to minimize the sum of the squares of the residuals. As a result, we derived the expressions for b1 and b2. 16 XXnXn X1X1 Y b0b0 b1b1 True model: Fitted line:

CIRCLE, Lund University, Sweden hourly earnings in 2002 plotted against years of schooling, defined as highest grade completed, for a sample of 540 respondents from the National Longitudinal Survey of Youth. 17

CIRCLE, Lund University, Sweden  In this case there is only one variable, S, and its coefficient is _cons, in Stata, refers to the constant. The estimate of the intercept is Interpretation of a regression equation 18. reg EARNINGS S Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | _cons |

CIRCLE, Lund University, Sweden  hourly earnings increase by $2.46 for each extra year of schooling.  Literally, the constant indicates that an individual with no years of education would have to pay $13.93 per hour to be allowed to work. – Nonsense! – the only function of the constant term is to enable you to draw the regression line at the correct height on the scatter diagram Interpretation of a regression equation 19

CIRCLE, Lund University, Sweden  You can see that the t statistic for the coefficient of S is enormous. We would reject the null hypothesis that schooling does not affect earnings at the 1% significance level (critical value about 2.59). Testing a hypothesis relating to a regression coefficient 20. reg EARNINGS S Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | _cons |

CIRCLE, Lund University, Sweden  – x ≤ b2 ≤ x – The critical value of t at the 5% significance level with 538 degrees of freedom is  ≤ b2 ≤ Testing a hypothesis relating to a regression coefficient: Confidence intervals 21. reg EARNINGS S Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | _cons |

CIRCLE, Lund University, Sweden  The null hypothesis that we are going to test is that the model has no explanatory power  k is the number of parameters in the regression equation, which at present is just 2.  n – k is, as with the t statistic, the number of degrees of freedom  F is a monotonically increasing function of R2 – Why do we perform the test indirectly, through F, instead of directly through R2? After all, it would be easy to compute the critical values of R2 from those for F Hypotheses concerning goodness of fit are tested via the F statistic 22

CIRCLE, Lund University, Sweden For simple regression analysis, the F statistic is the square of the t statistic. 23

CIRCLE, Lund University, Sweden Calculation of F statistic 24. reg EARNINGS S Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | _cons |

CIRCLE, Lund University, Sweden OLS Assumptions 25

CIRCLE, Lund University, Sweden Assumptions for OLS 1: 26

CIRCLE, Lund University, Sweden A.2There is some variation in the regressor in the sample. 27

CIRCLE, Lund University, Sweden A.3The disturbance term has zero expectation 28

CIRCLE, Lund University, Sweden  We assume that the disturbance term is homoscedastic, meaning that its value in each observation is drawn from a distribution with constant population variance.  Once we have generated the sample, the disturbance term will turn out to be greater in some observations, and smaller in others, but there should not be any reason for it to be more erratic in some observations than in others. A.4The disturbance term is homoscedastic 29

CIRCLE, Lund University, Sweden  OLS estimation still gives unbiased coefficient estimates, but they are no longer BLUE.  This implies that if we still use OLS in the presence of heteroskedasticity, our standard errors could be inappropriate and hence any inferences we make could be misleading.  Whether the standard errors calculated using the usual formulae are too big or too small will depend upon the form of the heteroskedasticity. Consequences of Using OLS in the Presence of Heteroskedasticity 30

CIRCLE, Lund University, Sweden Multiple Regression 31

CIRCLE, Lund University, Sweden  an earnings function model where hourly earnings, EARNINGS, depend on years of schooling (highest grade completed), S, and years of work experience, EXP. 32 Note that the interpretation of the model does not depend on whether S and EXP are correlated or not However we do assume that the effects of S and EXP on EARNINGS are additive. The impact of a difference in S on EARNINGS is not affected by the value of EXP, or vice versa.

CIRCLE, Lund University, Sweden  The expression for b1 is a straightforward extension of the expression for it in simple regression analysis.  However, the expressions for the slope coefficients are considerably more complex than that for the slope coefficient in simple regression analysis. Calculating regression coefficients 33

CIRCLE, Lund University, Sweden  It indicates that earnings increase by $2.68 for every extra year of schooling and by $0.56 for every extra year of work experience.  Intercept: Obviously, this is impossible. The lowest value of S in the sample was 6. We have obtained a nonsense estimate because we have extrapolated too far from the data range Interpretation of a regression equation 34. reg EARNINGS S EXP Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | EXP | _cons |

CIRCLE, Lund University, Sweden Properties of the multiple regression coefficients. Only A.2 is different. 35

CIRCLE, Lund University, Sweden  the inclusion of the new term has had a dramatic effect on the coefficient of EXP  The high correlation causes the standard error of EXP to be larger than it would have been if EXP and EXPSQ had been less highly correlated, warning us that the point estimate is unreliable Multicollinearity 36. reg EARNINGS S EXP EXPSQ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | EXP | EXPSQ | _cons | reg EARNINGS S EXP EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | EXP | _cons | cor EXP EXPSQ (obs=540) | EXP EXPSQ EXP | EXPSQ |

CIRCLE, Lund University, Sweden  When high correlations among the explanatory variables lead to erratic point estimates of the coefficients, large standard errors and unsatisfactorily low t statistics, the regression is said to be suffering from multicollinearity. – the standard errors and t tests remain valid.  Multicollinearity may also be caused by an approximate linear relationship among the explanatory variables. When there are only 2, an approximate linear relationship means there will be a high correlation, but this is not always the case when there are more than 2.  Note that, multicollinearity does not cause the regression coefficients to be biased. 37

CIRCLE, Lund University, Sweden  Reduce the variance of the disturbance term by including further relevant variables in the model  Increase the number of observations  Increase MSD(X2) (the variation in the explanatory variables). – For example, if you were planning a household survey with the aim of investigating how expenditure patterns vary with income, you should make sure that the sample included relatively rich and relatively poor households as well as middle-income households.  Reduce  Combine the correlated variables  Drop some of the correlated variables – However, this approach to multicollinearity is dangerous. It is possible that some of the variables with insignificant coefficients really do belong in the model and that the only reason their coefficients are insignificant is because there is a problem of multicollinearity 38

CIRCLE, Lund University, Sweden  Use common sense and economic theory.  Avoid Type III errors – Producing the right answer to the wrong question is called a type III error – place relevance before mathematical elegance  know the context – Do not perform ignorant statistical analyses  inspect the data – place data cleanliness ahead of econometric godliness  Keep it sensibly simple – Do not talk Greek without knowing the English translation  look long and hard at your results – apply the laugh test  beware the costs of data mining – E.g. tailoring one’s specification to the data, resulting in a specification that is misleading  Be prepared to compromise – Should a proxy be used? Can sample attrition be ignored?  Do not confuse significance with substance  Report a sensitivity analysis Kennedy’s 10 commandments of applied econometrics 39