QM222 Class 8 Section A1 Using categorical data in regression

Slides:



Advertisements
Similar presentations
Dummy Variables and Interactions. Dummy Variables What is the the relationship between the % of non-Swiss residents (IV) and discretionary social spending.
Advertisements

Qualitative predictor variables
AMMBR - final stuff xtmixed (and xtreg) (checking for normality, random slopes)
AMMBR from xtreg to xtmixed (+checking for normality, random slopes)
CHOW TEST AND DUMMY VARIABLE GROUP TEST
Inference for Regression Today we will talk about the conditions necessary to make valid inference with regression We will also discuss the various types.
Lecture 4 This week’s reading: Ch. 1 Today:
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model.
Introduction to Regression Analysis Straight lines, fitted values, residual values, sums of squares, relation to the analysis of variance.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: precision of the multiple regression coefficients Original citation:
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: the effects of changing the reference category Original citation: Dougherty,
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy classification with more than two categories Original citation:
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
1 INTERACTIVE EXPLANATORY VARIABLES The model shown above is linear in parameters and it may be fitted using straightforward OLS, provided that the regression.
MultiCollinearity. The Nature of the Problem OLS requires that the explanatory variables are independent of error term But they may not always be independent.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
What is the MPC?. Learning Objectives 1.Use linear regression to establish the relationship between two variables 2.Show that the line is the line of.
Introduction to Linear Regression
Biostat 200 Lecture Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: exercise 5.2 Original citation: Dougherty, C. (2012) EC220 - Introduction.
Chapter 5: Dummy Variables. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 We’ll now examine how you can include qualitative explanatory variables.
STAT E100 Section Week 12- Regression. Course Review - Project due Dec 17 th, your TA. - Exam 2 make-up is Dec 5 th, practice tests have been updated.
RAMSEY’S RESET TEST OF FUNCTIONAL MISSPECIFICATION 1 Ramsey’s RESET test of functional misspecification is intended to provide a simple indicator of evidence.
1 CHANGES IN THE UNITS OF MEASUREMENT Suppose that the units of measurement of Y or X are changed. How will this affect the regression results? Intuitively,
1 BINARY CHOICE MODELS: LINEAR PROBABILITY MODEL Economists are often interested in the factors behind the decision-making of individuals or enterprises,
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
QM222 Class 19 Section D1 Tips on your Project
QM222 Class 12 Section D1 1. A few Stata things 2
Chapter 14 Introduction to Multiple Regression
QM222 Class 9 Section A1 Coefficient statistics
QM222 Class 11 Section D1 1. Review and Stata: Time series data, multi-category dummies, etc. (chapters 10,11) 2. Capturing nonlinear relationships (Chapter.
business analytics II ▌appendix – regression performance the R2 
QM222 Class 10 Section D1 1. Goodness of fit -- review 2
QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project QM222 Fall 2016 Section D1.
From t-test to multilevel analyses Del-2
assignment 7 solutions ► office networks ► super staffing
QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)
Advanced Quantitative Techniques
Review Multiple Regression Multiple-Category Dummy Variables
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
QM222 Class 11 Section A1 Multiple Regression
QM222 Class 19 Omitted Variable Bias pt 2 Different slopes for a single variable QM222 Fall 2017 Section A1.
Multiple Regression Analysis and Model Building
QM222 Class 18 Omitted Variable Bias
QM222 Class 9 Section D1 1. Multiple regression – review and in-class exercise 2. Goodness of fit 3. What if your Dependent Variable is an 0/1 Indicator.
Regression.
QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit.
QM222 A1 Nov. 27 More tips on writing your projects
The slope, explained variance, residuals
QM222 Class 14 Today’s New topic: What if the Dependent Variable is a Dummy Variable? QM222 Fall 2017 Section A1.
QM222 Your regressions and the test
Regression.
Regression.
QM222 Class 15 Section D1 Review for test Multicollinearity
Regression.
Regression Chapter 8.
Regression.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
MGS 3100 Business Analysis Regression Feb 18, 2016
Introduction to Econometrics, 5th edition
Presentation transcript:

QM222 Class 8 Section A1 Using categorical data in regression And if time, beginning on coefficient statistics QM222 Fall 2017 Section A1

To-dos Assignment 2 is due on Wednesday… But you lose only a tiny fraction of points for each day late so it’s better to hand in a completed assignment. QM222 Fall 2017 Section A1

Today: We learn how to make a categorical variable (with 2 categories) into a dummy variable, a variable that is one if in one category, zero otherwise We learn to make new variables in Stata We learn how to make a dummy variable in Stata, using logical statements We run a regression using this dummy variable, and interpret the coefficient on this variable. We start learning about statistics about coefficients. QM222 Fall 2017 Section A1

Dummy variables (also called indicator variables, binary variables) Dummy variables take a value of one if a condition is true (that is, a given observation falls into a category) and zero otherwise. In the Brookline condo data, we know the StreetName Let’s say that we believe that whether a condo is on Beacon Street (or not) will change its price. Using data on streets, we can construct a dummy variable, making beaconstreet=1 if a condo is located on Beacon Street, and beaconstreet=0 if located elsewhere. Note: In this example there are TWO categories: On Beacon or not. We make ONE dummy variable. QM222 Fall 2017 Section A1

Interpreting a Regression with an Dummy Variable We write down the following linear regression model: 𝑃𝑟𝑖𝑐𝑒 = 𝑏 0 + 𝑏 1 𝑏𝑒𝑎𝑐𝑜𝑛𝑠𝑡𝑟𝑒𝑒𝑡 To understand the interpretation of the coefficients, let’s start with the calculation of the following predictions: Price of condos on Beacon Street (beaconstreet=1): 𝑃𝑟𝑖𝑐𝑒 = 𝑏 0 + 𝑏 1 * beaconstreet = 𝑏 0 + 𝑏 1 *1 = 𝑏 0 + 𝑏 1 Price of condos located elsewhere (beaconstreet =0): 𝑃𝑟𝑖𝑐𝑒 = 𝑏 0 + 𝑏 1 * beaconstreet = 𝑏 0 + 𝑏 1 *0 = 𝑏 0 In other words, the regression with the dummy beaconstreet will give us the value of prices n bBeacon Street (when beaconstreet=1) and not on Beacon Street (when BeaconsStreet=0). We call NOT being on Beacon St. the reference category. It is what happens when the dummy is NOT TRUE, is not equal to 1. QM222 Fall 2017 Section A1

Today: We learn how to make a categorical variable (with 2 categories) into a dummy variable, a variable that is one if in one category, zero otherwise We learn to make new variables in Stata We learn how to make a dummy variable in Stata, using logical statements We run a regression using this dummy variable, and interpret the coefficient on this variable. We start learning about statistics about coefficients. QM222 Fall 2017 Section A1

Open Brookline Condo data set in Stata (Other materials/brookline_condo.dta) QM222 Fall 2017 Section A1

Making new variables in Stata Stata commands can only be lower case. Stata variable names are sensitive to case (lower case/upper case). It is easiest if you keep names in lower case and keep out spaces. Try to keep the names relatively short (so they all print out in lists) How do you make new numerical variables in Stata? In Stata: generate newvar = (here put in a formula using PEMDAS, numbers, and variable names) QM222 Fall 2017 Section A1

Making new variables in Stata For instance, you might want to create a variable for the average size per room. DO IT! gen roomsize=size/Rooms Stata tip: you can generally abbreviate Stata commands. Here, I always use gen instead of generate But how do we generate the variable beaconstreet that is equal to one IF something is true? You need a logical statement! Stata (like Excel) uses logical statements starting with the word if added to a command QM222 Fall 2017 Section A1

Logical (if) statements In Stata logical statements (only), you can use these “operators”: == ( double equal signs) equals (use in logical statements only) & and | or != not equal to < > <= >= obvious Example: sum wage if agep>=25 QM222 Fall 2017 Section A1

Making an dummy variable for Beacon Street First browse the data to see how StreetName is coded. DO IT! gen beaconstreet = 1 if StreetName == “BEACON ST” Browse again. Why are there so many missing values? Because we didn’t tell the computer what to do if StreetName is NOT “BEACON ST” What to do instead? replace beaconstreet = 0 if StreetName != “BEACON ST” Or, first start by: gen beaconstreet=0 then replace beaconstreet=1 if StreetName == “BEACON ST” DO IT! QM222 Fall 2017 Section A1

Making an dummy variable for Beacon Street Optional You can also make a dummy variable in one step, since Stata (like many programs) will put in a “1” if a logical statement is true and a 0 if it is false. Here you could type: generate beaconstreet = streetname == “BEACON ST” QM222 Fall 2017 Section A1

Today: We learn how to make a categorical variable (with 2 categories) into a dummy variable, a variable that is one if in one category, zero otherwise We learn to make new variables in Stata We learn how to make a dummy variable in Stata, using logical statements We run a regression using this dummy variable, and interpret the coefficient on this variable. We start learning about statistics about coefficients. QM222 Fall 2017 Section A1

Now run a regression of price on beaconstreet regress price beaconstreet DO IT! QM222 Fall 2017 Section A1

(I changed the font here to Courier New) . regress price beaconstreet Source | SS df MS Number of obs = 1,085 -------------+---------------------------------- F(1, 1083) = 3.31 Model | 2.2855e+11 1 2.2855e+11 Prob > F = 0.0689 Residual | 7.4673e+13 1,083 6.8951e+10 R-squared = 0.0031 -------------+---------------------------------- Adj R-squared = 0.0021 Total | 7.4902e+13 1,084 6.9098e+10 Root MSE = 2.6e+05 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- beaconstreet | -46969.18 25798.41 -1.82 0.069 -97589.71 3651.345 _cons | 520728.9 8435.427 61.73 0.000 504177.2 537280.5 Write the regression equation: What is the predicted price of a condo on Beacon Street? What is the predicted price of a condo that’s not on Beacon Street? What is the difference in prices between those on Beacon St. and NOT? QM222 Fall 2017 Section A1

(I changed the font here to Courier New) . regress price beaconstreet Source | SS df MS Number of obs = 1,085 -------------+---------------------------------- F(1, 1083) = 3.31 Model | 2.2855e+11 1 2.2855e+11 Prob > F = 0.0689 Residual | 7.4673e+13 1,083 6.8951e+10 R-squared = 0.0031 -------------+---------------------------------- Adj R-squared = 0.0021 Total | 7.4902e+13 1,084 6.9098e+10 Root MSE = 2.6e+05 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- beaconstreet | -46969.18 25798.41 -1.82 0.069 -97589.71 3651.345 _cons | 520728.9 8435.427 61.73 0.000 504177.2 537280.5 Write the regression equation: price = 520729 – 46969 Beacon_Street What is the predicted price of a condo on Beacon Street? 520729–46969=$473,760 What is the predicted price of a condo that’s not on Beacon Street? $520,729 What is the difference in prices between those on Beacon St. and NOT? $46,969 YOU PICK UP THE COEFFICIENT ON THE DUMMY ONLY IF THE DUMMY=1 QM222 Fall 2017 Section A1

Challenge questions (for team) . regress price beaconstreet Source | SS df MS Number of obs = 1,085 -------------+---------------------------------- F(1, 1083) = 3.31 Model | 2.2855e+11 1 2.2855e+11 Prob > F = 0.0689 Residual | 7.4673e+13 1,083 6.8951e+10 R-squared = 0.0031 -------------+---------------------------------- Adj R-squared = 0.0021 Total | 7.4902e+13 1,084 6.9098e+10 Root MSE = 2.6e+05 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- beaconstreet | -46969.18 25798.41 -1.82 0.069 -97589.71 3651.345 _cons | 520728.9 8435.427 61.73 0.000 504177.2 537280.5 What regression would you get if you made a dummy variable =1 if the condo is NOT on Beacon Street (notonbeacon)? The intercept (constant _cons) would be: The coefficient would be: 46969 -46969 520729 473760 QM222 Fall 2017 Section A1

Today: We learn how to make a categorical variable (with 2 categories) into a dummy variable, a variable that is one if in one category, zero otherwise We learn to make new variables in Stata We learn how to make a dummy variable in Stata, using logical statements We run a regression using this dummy variable, and interpret the coefficient on this variable. We start learning about statistics about coefficients. QM222 Fall 2017 Section A1

How certain are we that the coefficients we measured are accurate in light of the fact that we have limited numbers of observations? QM222 Fall 2017 Section A1

Let’s remember means and standard deviations with normally distributed variables Approximately 68% (or around 2/3rds) of a variable’s values are within one standard deviation of the mean. We call this this the 68% confidence interval (CI), because 68% of the time, the value falls in this range. Approximately 95% of the values are within two standard deviations of the mean. We call this this the 95% confidence interval, 2.0 is just 1.96 rounded. Use either! Do problem sets on your own – it is the best way to learn the material. Mistakes on problem sets are not excessively penalized There may be a pop quiz on the problem set in section when it is due (with p=.5) QM222 Fall 2017 Section A1

Central Limit Theorem (QM221) The Central Limit Theorem tells us that if you took many samples from a population, the sample means are always distributed according to a normal distribution curve The average of the sample means (across many samples) is the same as the population mean (μ) The standard deviation of the sample means (across many samples) is the standard error (se) -3SE -2SE -1SE μ +1SE +2SE +3SE QM222 Fall 2017 Section A1

Standard errors more generally Sample means have a standard error that tells you how much the means vary if you had lots of different samples. Any statistic estimated on a sample has a standard error that tells you how much that statistic would vary if you had lots of different samples. Regression coefficients also have standard errors. We are (approximately) 68% certain that the true regression coefficient (if estimated on the entire population) will be within one standard error of the estimated coefficient. We are (approximately) 95% certain that the true regression coefficient (if estimated on the entire population) will be within two standard errors of the estimated coefficient. QM222 Fall 2017 Section A1

Standard errors of coefficients price = 12934 + 407.45 size Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 3232.35 Model | 5.6104e+13 1 5.6104e+13 Prob > F = 0.0000 Residual | 1.8798e+13 1083 1.7357e+10 R-squared = 0.7490 -------------+------------------------------ Adj R-squared = 0.7488 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05   ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 Next to each coefficient is a standard error. We are approximately 68% certain that the true coefficient (with an infinitely very large sample) is within one standard error of this coefficient. 407.45 +/- 7.167 We are approximately 95% certain that the true coefficient (with an infinitely very large sample) is within two standard errors of this coefficient. 407.45 +/- 2 * 7.167 QM222 Fall 2017 Section A1

Standard errors of coefficients price = 12934 + 407.45 size Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 3232.35 Model | 5.6104e+13 1 5.6104e+13 Prob > F = 0.0000 Residual | 1.8798e+13 1083 1.7357e+10 R-squared = 0.7490 -------------+------------------------------ Adj R-squared = 0.7488 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05   ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 NOTE: The Regression output give you the 95% confidence interval! QM222 Fall 2017 Section A1

How is your project coming? People who work with specific data sets can get together to share the learning about using it. ADD HEALTH users: You need my help to read your data. But first you need to list all of the variables that you could possibly want, and which wave it is in. ACS users: You need a TA to run the do-file (file ending in .do) that you get when you download it for Stata QM222 Fall 2017 Section A1