Bivariate Relationships 17.871. Testing associations (not causation!) Continuous data  Scatter plot (always use first!)  (Pearson) correlation coefficient.

Slides:



Advertisements
Similar presentations
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Advertisements

Econ 140 Lecture 81 Classical Regression II Lecture 8.
EPI 809/Spring Probability Distribution of Random Error.
Simple Linear Regression and Correlation
More on Regression Spring The Linear Relationship between African American Population & Black Legislators.
Linear Regression: Making Sense of Regression Results
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
1 Nonlinear Regression Functions (SW Chapter 8). 2 The TestScore – STR relation looks linear (maybe)…
TigerStat ECOTS Understanding the population of rare and endangered Amur tigers in Siberia. [Gerow et al. (2006)] Estimating the Age distribution.
Sociology 601 Class 17: October 28, 2009 Review (linear regression) –new terms and concepts –assumptions –reading regression computer outputs Correlation.
Sociology 601, Class17: October 27, 2009 Linear relationships. A & F, chapter 9.1 Least squares estimation. A & F 9.2 The linear regression model (9.3)
Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: exercise 3.5 Original citation: Dougherty, C. (2012) EC220 - Introduction.
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model.
Valuation 4: Econometrics Why econometrics? What are the tasks? Specification and estimation Hypotheses testing Example study.
Chapter 13 Multiple Regression
Chapter 12 Multiple Regression
Statistics for Business and Economics
Sociology 601 Class 28: December 8, 2009 Homework 10 Review –polynomials –interaction effects Logistic regressions –log odds as outcome –compared to linear.
Describing Bivariate Relationships Spring 2006.
The Simple Regression Model
Regression Example Using Pop Quiz Data. Second Pop Quiz At my former school (Irvine), I gave a “pop quiz” to my econometrics students. The quiz consisted.
Introduction to Regression Analysis Straight lines, fitted values, residual values, sums of squares, relation to the analysis of variance.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Regression Forced March Spring Regression quantifies how one variable can be described in terms of another.
Sociology 601 Class 23: November 17, 2009 Homework #8 Review –spurious, intervening, & interactions effects –stata regression commands & output F-tests.
Interpreting Bi-variate OLS Regression
1 Regression and Calibration EPP 245 Statistical Analysis of Laboratory Data.
Back to House Prices… Our failure to reject the null hypothesis implies that the housing stock has no effect on prices – Note the phrase “cannot reject”
TESTING A HYPOTHESIS RELATING TO A REGRESSION COEFFICIENT This sequence describes the testing of a hypotheses relating to regression coefficients. It is.
Descriptive measures of the strength of a linear association r-squared and the (Pearson) correlation coefficient r.
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
Simple Linear Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Chapter 11 Simple Regression
Returning to Consumption
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
Statistics for Business and Economics Chapter 10 Simple Linear Regression.
Addressing Alternative Explanations: Multiple Regression
MultiCollinearity. The Nature of the Problem OLS requires that the explanatory variables are independent of error term But they may not always be independent.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
What is the MPC?. Learning Objectives 1.Use linear regression to establish the relationship between two variables 2.Show that the line is the line of.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
© 2001 Prentice-Hall, Inc. Statistics for Business and Economics Simple Linear Regression Chapter 10.
Introduction to Linear Regression
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
Warsaw Summer School 2015, OSU Study Abroad Program Regression.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Biostat 200 Lecture Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: exercise 5.2 Original citation: Dougherty, C. (2012) EC220 - Introduction.
Special topics. Importance of a variable Death penalty example. sum death bd- yv Variable | Obs Mean Std. Dev. Min Max
STAT E100 Section Week 12- Regression. Course Review - Project due Dec 17 th, your TA. - Exam 2 make-up is Dec 5 th, practice tests have been updated.
1 CHANGES IN THE UNITS OF MEASUREMENT Suppose that the units of measurement of Y or X are changed. How will this affect the regression results? Intuitively,
1 BINARY CHOICE MODELS: LINEAR PROBABILITY MODEL Economists are often interested in the factors behind the decision-making of individuals or enterprises,
1 REPARAMETERIZATION OF A MODEL AND t TEST OF A LINEAR RESTRICTION Linear restrictions can also be tested using a t test. This involves the reparameterization.
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
11-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Before the class starts: Login to a computer Read the Data analysis assignment 1 on MyCourses If you use Stata: Start Stata Start a new do file Open the.
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
Describing Bivariate Relationships
QM222 Class 8 Section A1 Using categorical data in regression
The slope, explained variance, residuals
QM222 Class 15 Section D1 Review for test Multicollinearity
Common Statistical Analyses Theory behind them
Presentation transcript:

Bivariate Relationships

Testing associations (not causation!) Continuous data  Scatter plot (always use first!)  (Pearson) correlation coefficient (rare, should be rarer!)  (Spearman) rank-order correlation coefficient (rare)  Regression coefficient (common) Discrete data  Cross tabulations  Differences in means, box plots  χ 2  Gamma, Beta, etc.

Continuous DV, continuous EV Dependent Variable: DV Explanatory (or independent) Variable: EV Example: What is the relationship between Black percent in state legislatures and black percent in state populations

Regression interpretation Three key things to learn (today) 1. Where does regression come from 2. To interpret the regression coefficient 3. To interpret the confidence interval We will learn how to calculate confidence intervals in a couple of weeks

Linear Relationship between African American Population & Black Legislators Black % in state legislatures Black % in state population

The linear relationship between two variables Regression quantifies how one variable can be described in terms of another

Linear Relationship between African American Population & Black Legislators ^ ^ Black % in state legislatures Black % in state population

How did we get that line? 1. Pick a value of Y i YiYi Black % in state legis. Black % in state population

How did we get that line? 2. Decompose Y i into two parts Black % in state legis. Black % in state population

How did we get that line? 3. Label the points YiYi YiYi ^ εiεi Y i -Y i ^ “residual” Black % in state legis. Black % in state population

What is ε i ? (sometimes u i ) Wrong functional form Measurement error Stochastic component in Y

The Method of Least Squares YiYi YiYi ^ εiεi Y i -Y i ^

Solve for ^ Remember this for the problem set!

Regression commands in STATA reg depvar expvars  E.g., reg y x  E.g., reg beo bpop Making predictions from regression lines  predict newvar  predict newvar, resid newvar will now equal ε i

Black elected officials example. reg beo bpop Source | SS df MS Number of obs = F( 1, 39) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = beo | Coef. Std. Err. t P>|t| [95% Conf. Interval] bpop | _cons | Interpretation: a one percentage point increase in black population leads to a.36 increase in black percentage in the legislature Always include interpretation in your presentations and papers

The Linear Relationship between African American Population & Black Legislators Black % in state legislatures (Y) Black % in state population (X)

More regression examples

Temperature and Latitude scatter JanTemp latitude, mlabel(city)

. reg jantemp latitude Source | SS df MS Number of obs = F( 1, 18) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = jantemp | Coef. Std. Err. t P>|t| [95% Conf. Interval] latitude | _cons | Interpretation: a one point increase in latitude is associated with a 2.3 decrease in average temperature (in Fahrenheit).

scatter JanTemp latitude, mlabel(city) || lfit JanTemp latitude or often better scatter JanTemp latitude, mlabel(city) m(i) || lfit JanTemp latitude How to add a regression line: Stata command: lfit

Presenting regression results Brief aside First, show scatter plot  Label data points (if possible)  Include best-fit line Second, show regression table  Assess statistical significance with confidence interval or p-value  Assess robustness to control variables (internal validity: nonrandom selection)

Bush vote and Southern Baptists

Coefficient interpretation: A one percentage point increase in Baptist percentage is associated with a.26 increase in Bush vote share at the state level.. reg bush sbc_mpct [aw=votes] (sum of wgt is e+08) Source | SS df MS Number of obs = F( 1, 48) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = bush | Coef. Std. Err. t P>|t| [95% Conf. Interval] sbc_mpct | _cons |

Coefficient interpretation: A one percentage point increase in Baptist percentage is associated with a.26 increase in Bush vote share at the state level. Confidence interval interpretation The 95% confidence interval lies between.18 and.34. Interpreting confidence interval. reg bush sbc_mpct [aw=votes] (sum of wgt is e+08) Source | SS df MS Number of obs = F( 1, 48) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = bush | Coef. Std. Err. t P>|t| [95% Conf. Interval] sbc_mpct | _cons |

. reg loss gallup Source | SS df MS Number of obs = F( 1, 15) = 5.70 Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = Seats | Coef. Std. Err. t P>|t| [95% Conf. Interval] gallup | _cons | Coefficient interpretation: A one percentage point increase in presidential approval is associated with an average of 1.28 more seats won by the president’s party. Confidence interval interpretation The 95% confidence interval lies between.14 and 2.43.

Additional regression in bivariate relationship topics Residuals Comparing coefficients Functional form Goodness of fit (R 2 and SER) Correlation Discrete DV, discrete EV Using the appropriate graph/table

Residuals

e i = Y i – B 0 – B 1 X i

One important numerical property of residuals The sum of the residuals is zero

. reg jantemp latitude Source | SS df MS Number of obs = F( 1, 18) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = jantemp | Coef. Std. Err. t P>|t| [95% Conf. Interval] latitude | _cons | predict py (option xb assumed; fitted values). predict ry, resid Generating predictions and residuals

gsort -ry. list city jantemp py ry | city jantemp py ry | | | 1. | PortlandOR | 2. | SanFranciscoCA | 3. | LosAngelesCA | 4. | PhoenixAZ | 5. | NewYorkNY | | | 6. | MiamiFL | 7. | BostonMA | 8. | NorfolkVA | 9. | BaltimoreMD | 10. | SyracuseNY | | | 11. | MobileAL | 12. | WashingtonDC | 13. | MemphisTN | 14. | ClevelandOH | 15. | DallasTX | | | 16. | HoustonTX | 17. | KansasCityMO | 18. | PittsburghPA | 19. | MinneapolisMN | 20. | DuluthMN |

Use residuals to diagnose potential problems

. reg loss gallup if year>1946 Source | SS df MS Number of obs = F( 1, 12) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = seats | Coef. Std. Err. t P>|t| [95% Conf. Interval] gallup | _cons | reg loss gallup Source | SS df MS Number of obs = F( 1, 15) = 5.70 Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = Seats | Coef. Std. Err. t P>|t| [95% Conf. Interval] gallup | _cons |

scatter loss gallup, mlabel(year) || lfit loss gallup || lfit loss gallup if year >1946

Comparing regression coefficients As a general rule:  Code all your variables to vary between 0 and 1 That is, minimum = 0, maximum = 1  Regression coefficients then represent the effect of shifting from the minimum to the maximum.  This allows you to more easily compare the relative importance of coefficients.

How to recode variables to 0-1 scale Party ID example: pid7 (on pset)  Usually varies from 1 (strong Republican) to 8 (strong Democrat) sometimes 0 needs to be recoded to missing (“.”). Stata code?  replace pid7 = ?  replace pid7 = (pid7-1)/7

Regression interpretation with 0-1 scale Continue with pid7 example  regress natlecon pid7 (both recoded to 0-1 scales)  pid7 coefficient: b = -.46 (CCES data from 2006)  Interpretation? Shifting from being a strong Republican to a strong Democrat corresponds with a.46 drop in evaluations of the national economy (on the one-point national economy scale)

Functional Form

About the Functional Form Linear in the variables vs. linear in the parameters  Y = a + bX + e (linear in both)  Y = a + bX + cX 2 + e (linear in parms.)  Y = a + X b + e (linear in variables, not parms.) Regression must be linear in parameters

The Linear and Curvilinear Relationship between African American Population & Black Legislators scatter beo pop || qfit beo pop

Log transformations Y = a + bX + eb = dY/dX, or b = the unit change in Y given a unit change in X Typical case Y = a + b lnX + eb = dY/(dX/X), or b = the unit change in Y given a % change in X Log explanatory variable ln Y = a + bX + eb = (dY/Y)/dX, or b = the % change in Y given a unit change in X Log dependent variable ln Y = a + b ln X + eb = (dY/Y)/(dX/X), or b = the % change in Y given a % change in X (elasticity) Economic production

Goodness of regression fit

How “good” is the fitted line? Goodness-of-fit is often not relevant to research Goodness-of-fit receives too much emphasis Focus on  Substantive interpretation of coefficients (most important)  Statistical significance of coefficients (less important) Confidence interval Standard error of a coefficient t-statistic: coeff./s.e. Nevertheless, you should know about  Standard Error of the Regression (SER) Standard Error of the Estimate (SEE) Also called Regrettably called Root Mean Squared Error (Rout MSE) in Stata  R-squared (R 2 ) Often not informative, use sparingly

Standard Error of the Regression picture YiYi YiYi ^ εiεi Y i -Y i ^ Add these up after squaring

Standard Error of the Regression (SER) or Standard Error of the Estimate or Root Mean Squared Error (Root MSE) d.f. equals n minus the number of estimate coefficients (Bs). In bivariate regression case, d.f. = n-2.

SER interpretation called “Root MSE” in Stata On average, in-sample predictions will be off the mark by about one standard error of the regression. reg beo bpop Source | SS df MS Number of obs = F( 1, 39) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = beo | Coef. Std. Err. t P>|t| [95% Conf. Interval] bpop | _cons |

(Y i -Y i ) ^ R 2 : A less useful measure of fit Y _ (Y i -Y) ^ 0 10 beo bpop beo Fitted values

Y _ (Y i -Y) (Y i -Y i ) ^ (Y i -Y) ^ _ _ 0 10 R 2 : A less useful measure of fit

R-squared Also called “coefficient of determination”

Interpreting SER (Root MSE) : On average, in-sample predictions about Bush’s vote share will be off the mark by about 7.6% Interpreting R 2 Regression model explains about 19.8% of the variation in Bush vote. Interpreting SER (Root MSE) and R 2. reg bush sbc_mpct Source | SS df MS Number of obs = F( 1, 48) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = bush | Coef. Std. Err. t P>|t| [95% Conf. Interval] sbc_mpct | _cons |

Correlation

Measures how closely data points fall along the line Varies between -1 and 1 (compare with Tufte p. 102) Corr(BushPct 00,BushPct 04 ) =0.96 =

Warning: Don’t correlate often! Correlation only measures linear relationship Correlation is sensitive to variance Correlation usually doesn’t measure a theoretically interesting quantity Same criticisms apply to R 2, which is the squared correlation between predictions and data points. Instead, focus on regression coefficients (slopes)

Discrete DV, discrete EV Crosstabs χ 2 Gamma, Beta, etc.

Example What is the relationship between abortion sentiments and vote choice? The abortion scale: 1. BY LAW, ABORTION SHOULD NEVER BE PERMITTED. 2. THE LAW SHOULD PERMIT ABORTION ONLY IN CASE OF RAPE, INCEST, OR WHEN THE WOMAN'S LIFE IS IN DANGER. 3. THE LAW SHOULD PERMIT ABORTION FOR REASONS OTHER THAN RAPE, INCEST, OR DANGER TO THE WOMAN'S LIFE, BUT ONLY AFTER THE NEED FOR THE ABORTION HAS BEEN CLEARLY ESTABLISHED. 4. BY LAW, A WOMAN SHOULD ALWAYS BE ABLE TO OBTAIN AN ABORTION AS A MATTER OF PERSONAL CHOICE.

Abortion and vote choice in 2006

Use the appropriate graph/table Continuous DV, continuous EV  E.g., vote share by income growth  Use scatter plot Continuous DV, discrete and unordered EV  E.g., vote share by religion or by union membership  Box plot, dot plot Discrete DV, discrete EV  No graph: Use crosstabs ( tabulate )

Additional slides

Intuition behind regression Covariance Cov(BushPct 00,BushPct 04 ) = x’=0.588-x y’=0.609-y