Presentation is loading. Please wait.

Presentation is loading. Please wait.

Quantitative Methods – Week 5: Linear Regression Analysis

Similar presentations


Presentation on theme: "Quantitative Methods – Week 5: Linear Regression Analysis"— Presentation transcript:

1 Quantitative Methods – Week 5: Linear Regression Analysis
Roman Studer Nuffield College

2 Homework (II) Many factors (variables) are potentially associated with the drop in crime rates in the US. Where do you find correlations between a variable and the falling crime rate? Which variables are positively, which ones negatively correlated with the crime rate variable? Where do you not find any correlation? Explanations? Concept/Variable Measured Variable Correlation Strong economy Unemployment rate, poverty level Reliance on prisons Number of prisoners Education Educational Attainment Legalizing abortion Number of abortions How did you solve the lag problem with the abortion variable?

3 Simple Linear Regression: Introduction
As with correlation, we are still looking at the relationship between two (ratio level) variables, but now we do not treat them symmetrically anymore, but make a distinction between the two: influences y x Dependent variable Independent variable Explained variable Explanatory variable As a consequence, the question that can be anwered with correlation analysis and regression analysis are different: Correlation Regression Is there an association between X and Y? How exactly does Y change if X changes? How strong? Positive or negative? How much of the change in Y is explained?

4 Simple Linear Regression: Introduction (II)
Therefore, we move from issues of mere association to issues of causality… HOWEVER: A simple linear regression cannot establish causation! We assume that causality runs from x to y Sometimes this assumption is questionable, then more sophisticated models and tests are needed Simultaneous equation models Two-stage least square regressions Causality tests So we have to be careful when making such assumptions. Let’s look at some pairs of variables. Which is the explanatory variable and which the dependent variable? In which cases might you expect a mutual interaction between the two variables? Height and weight Birth rate and marriage rate Rainfall and crop yield Rate of unemployment and level of relief expenditure in British parishes Government spending on welfare programmes and the voting share of left-liberal parties CO2 emissions and global warming

5 Simple Linear Regression: Introduction (III)
Straight line (“linear”) relationship between two variables The relationship between two variables can take a nonlinear form Kuznets (1955) hypothesised that the relationship between the level of economic development (x) and income inequality (y) takes the form of a nonlinear (inverted-U shaped) form: Income Inequality Economic Development Multiple or multivariate regression includes two or more explanatory variables

6 The Equation of a Straight Line
The equation of a straight line: Y = a + bX Example: y=2+3x

7 The Equation of a Straight Line (II)
Equation of a straight line: Y=a+bX Y and X are the dependent and the independent variable respectively and y1, y2, …, yn and x1, x2, …, xn their values a is the intercept. It determines the level at which the straight line crosses the vertical axis, i.e. it gives the value of y when x=0 b measures the slope of the line Positive relationship: b>0 No relationship: b=0 Negative Relationship b<0 If x increases by one unit, y will increase by b units

8 Fitting the Regression Line
How do we find the line, which is the “best fit”, i.e. the line that describes the linear relationship between X and Y the best This is an issue as in the real world, relationships between two variables never follow a completely linear pattern…

9 Fitting the Regression Line (II)
The regression line predicts the values of Y based on the values of X. Thus, the best line will minimise the deviation between the predicted and the actual values (the error, e) =YUK - ŶUK Regression line IP=a+bWage

10 Fitting the Regression Line (III)
However, to avoid the problem that positive and negative deviations cancel out, we look at squared deviations (Yi – Ŷi)2 Also, as we want to minimise the total errors, we are interested in minimising the sum of all squared errors Therefore, the regression line it the line that…. “minimizes the sum of the squares of the vertical deviation of all pairs of the values of X and Y from the regression line” This estimation procedure is known as ordinary least squares regression (OLS). Its formal derivation yields the two formulae needed to calculate the regression line, i.e. a formula for the intercept (a) and a formula for the slope (b):

11 Fitting the Regression Line (IV)
With these two formulae, we can actually easily calculate the regression line for small datasets by hand. However, for large datasets and when we have more than one explanatory variable, we use the Stata to do it for us

12 The Goodness of Fit Once we get the “best fit” regression line, we still want to know “how good a fit” this line really is: How much of the variation in Y is explained by this regression line? The measure to describe the explanatory power of a regression is the coefficient of determination r2, which is equal to the square of the correlation coefficient It is a measure of the success with which the movements in Y are explained by the movements in X In particular it measures how much of the derivation of Y from the mean of Y is explained by the regression

13 The Goodness of Fit (II)
This is the interactive part of the class…. Please explain the concept depicted in the following graph in your own words… Regression line x

14 The Goodness of Fit (III)
Total variation = explained variation + unexplained variation TSS ESS USS R²=ESS/TSS Explained Sum of Squares/Total Sum of Squares

15 The Goodness of Fit (IV)
R2 gives the proportion of the sample variation in y that is explained by x R2 =0.35 means that the explanatory variable explains 35% of the variation in the dependent variable R2 ranges between 0 and 1 The higher R2 the better the fit of the regression line to the data R2 can be used to compare the explanatory power of different regression models (with the same dependent variable) R2 has to be interpreted in the light of what we would expect A low R2 does not necessarily mean that an OLS regression equation is useless

16 Computer Class: Regression Analysis

17 Exercises Weimar elections: Unemployment and votes for the Nazi
A) Descriptive Statistics Get the dataset about the Weimar election of 1932 at Look at the variables (votes for the Nazi party, level of unemployment) in turn Get a first visualisation of the data; does it look normally distributed? Compute the mean, median, standard deviation, coefficient of variation, kurtosis and skewness for the variable Make a scatter plot; Do you think the two variables are associated? How and how strongly? B) Regression Analysis Which one is the dependent variable? Estimate the regression line What is the interpretation? What is the explanatory power of the regression? Draw a scatter plot and add the regression line

18 Exercises (II) 2. Weimar elections: Nazi votes and the share of Catholics  Do the same exercises as with the unemployment rate A) Descriptive Statistics B) Regression Analysis

19 Appendix: STATA Commands
regress depvar indepvars Linear regression; the dependent and independent variables are indicated by the order. The dependent variable depvar comes first, then the independent variable(s) indepvar follow predict yhat, xb Calculates the linear prediction for each observation, i.e. yhati= a+b*indepvari predict res, resid The option “resid” behind the comma let STATA calculate and save the residuals, i.e. resi=yi- yhati generate newvar=exp Creates a new variable. STATA provides numerous functions, see STATA’s Help on “Functions and expressions”/“math functions

20 Homework Readings: Feinstein & Thomas, Ch. 5 Problem Set 4:
Finish the exercises from today’s computer class if you haven’t done so already. Include all the results and answers in the file you send me On the next slide you see part of the macroeconomic dataset we used in week 3. Look at the variables “GDP per head” and “education”, assuming that GDP per head is the dependent variable and education the explanatory variable. Calculate by hand: The regression coefficient, b The intercept, a The total sum of squares The explained sum of squares The residual sum of squares The coefficient of determination (hint: graph the values first by hand and then look at tables 4.1 and 4.2) Download the complete data set at (“macro data”) and calculate the regression line and the coefficient of determination Interpret the results Do the results differ from the correlation results?

21 Homework (II) Country GDP per Head Agriculture Education Norway 54360
1.60 81 Switzerland 49660 1.40 49 United States 39430 1.20 83 Brazil 3340 10.10 21 Iran 2340 13.70


Download ppt "Quantitative Methods – Week 5: Linear Regression Analysis"

Similar presentations


Ads by Google