Before the class starts: Login to a computer Read the Data analysis assignment 1 on MyCourses If you use Stata: Start Stata Start a new do file Open the PDF documentation about regression If you use RStudio: Start RStudio Start a new R script Open R in Action, chapter 8
Linear model and its estimator(s)
Regression analysis
Idea of regression analysis Concepts: Dependent variable ( explained variable, response variable, predicted variable, regressand ) Independent variable ( explanatory variable, control variable, predictor variable, regressor ) Objective: Explain the variation in the dependent variables by using the variation in the independent variables For example Explain patient satisfaction with physician productivity, physician quality, and physician accesibility
Model y = β 0 + β 1 x 1 + β 2 x 2 + … + β k x k + u Example Patient satisfaction = β 0 + β 1 physician productivity + β 2 physician quality + β 3 physician accesibility + u x1x1 x2x2 xkxk … y u β1β1 β2β2 βkβk β0β0
Graphical illustration of linear regression One dependent and one or more independent variables Explains conditional mean of the dependent variable The dependent variable should be normally distributed around the mean The variance (width) of the dependent variable should not depend on the independent variables Wooldridge, J. M. (2009). Introductory econometrics: a modern approach (4th ed.). Mason, OH: South Western, Cengage Learning. (p. 26)
Interpretation of the model Patient satisfaction = β 0 + β 1 physician productivity + β 2 physician quality + β 3 physician accesibility + u Ceteris paribus (holding other variables constant), one unit increase in physician productivity is associated with β 1 increase in patient satisfaction
Goodness of fit: R 2 and adjusted R 2 R2R2 the proportion of variance explained “coefficient of determination” positively biased, can only go up Adjusted R 2 Penalizes for large number of variables and small sample size Not unbiased either
Example data
Estimation Model: prestige= β 0 + β 1 education + u The estimates β 0 and β 1 define the regression line The rule that is used to obtain estimates given the data is called estimator
Estimation Model: prestige= β 0 + β 1 education + u Properties of a good estimator of β 0 and β 1 Estimates using population data equal population values (consistency) Estimates are correct on average (unbiasedness) Variance of the estimates is smaller than variance of estimates from alternative estimators (efficiency) Estimates are normally distributed (or at least have a known distribution, normality)
Estimation Model: prestige= β 0 + β 1 education + u One good rule: “Choose β 0 and β 1 so that the sum of squared residuals is as small as possible” This is know as the ordinary least squares (OLS) estimator. Linear model with OLS estimator is known as OLS regression Residuals is the difference between fitted value and observed value: the part of data not explained by the model
Summary of the assumptions 1.All relationships are linear 2.Independence of observations (No perfect collinearity and non-zero variances of independent variables) 4.Error term has expected value of zero given any values of independent variables 5.Error term has equal variance given any values of independent variables 6.Error term is normally distributed Important to check after estimation (post-estimation diagnostics)
Regression with excel
Data analysis assignment 1
Task Do a regression analysis with a statistical software of your choice using the Prestige dataset used in the class. Try to explain income with the other variables. You should first explain income itself and then, if you see it necessary, to explain the logarithm of income. The part about logarithm transformation in Wooldridge's book is really worth reading. Document your thought process: how did you explore the data, how you checked the assumptions, and how the model evolved.
How to get your analysis file started Stata Load the data following the instructions Explore the data using e.g. describe, summarize, inspect, codebook, graph matrix, and stem RStudio Load the data following the instructions Load the psych, car, and texreg packages by adding library command to start of the R file. (If a package is not found, you need to install it) Explore the data using e.g. describe, lowerCor, corr.test, and scatterplotMatrix
How to submit your answer Stata Set your working directory Start your do file with log using assingment1, replace text End your do file with log close After each graph add graph export plotX.pdf Open the Word document template from MyCourses Copy-paste the content of assignment1.log to the document template and insert the exported figures into right places. In word, write comments in normal style and use headings where appropriate RStudio Compile a notebook in MS Word format In word, write comments in normal style and use headings where appropriate
Regress income on prestige, education, and share of women Stata regress income prestige educat percwomn estimates store m1 Rstudio m1 <- lm(income ~ prestige + educat + percwomn, data = Prestige) summary(m1)
Source | SS df MS Number of obs = F( 3, 98) = Model | e Prob > F = Residual | R-squared = Adj R-squared = Total | e Root MSE = income | Coef. Std. Err. t P>|t| [95% Conf. Interval] prestige | educat | percwomn | _cons | Call: lm(formula = income ~ education + prestige + women, data = Prestige) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) education prestige e-06 *** women e-08 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2575 on 98 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 98 DF, p-value: < 2.2e-16 Stata RStudio
Extract fitted values and residuals Stata Use the predict command Plot the distributions using the kdensity command RStudio Use the residuals and fitted command Plot the distributions using the plot(density()) command combination
Diagnose the model using the following list of plots PlotStata commandR command Getting helphelp regress postestimation diagnostics plots Chapter 8 of R in Action Q-Q plot of studentized residuals qqPlotqnorm Residual-versus-fitted- plot rvfplotresidualPlot Component-plus- residual plot cprplotcrPlots Added-variables plotsavplotsavPlots Residual-versus- leverage plots Lvr2plotinfluencePlot
Modify the model and or data Stata Delete outliers with drop Apply log transformation of variables Repeat the regression model Apply diagnostic plots RStudio Delete outliers with subset Apply log transformation of variables Repeat the regression model Apply diagnostic plots
Extract fitted values and residuals Stata Use the predict command Plot the distributions using the kdensity command RStudio Use the residuals and fitted command Plot the distributions using the plot(density()) command combination
Optional: add categorical variable type Stata Add i.type to regression model RStudio Add type to regression model
Report several models as one table Stata Use estimates table m1 m2 m3… RStudio Use screenreg(list(m1, m2, m3, …))
Simulation demonstration: heteroskedasticity