Independent X – variables that take on only a limited number of values are termed categorical variables, dummy variables, or indicator variables. Common examples are: time periods in which there is a price surge or bubble; months of the year; days of the week; gender; educational level.
Executive Annual_Salary 0 $38,450 0 $50,912 0 $29,356 0 $27,750 1 $109,285 0 $48,442 0 $40,207 0 $42,331 1 $87,489 0 $26,118
Salary is the dependent variable that is to be estimated. Executive is a categorical variable that has only two values: 0 – represents that the employee is not an executive; 1 – represents that the employee is an executive. A linear regression model may be constructed: Salary = a + b * Executive
Regression Statistics Multiple R0.874 R Square0.764 Adjusted R Square0.760 Standard Error Observations52 CoefficientsStandard Errort Stat Intercept Executive Salary = a + b * Executive Highly Significant Average Salary for non- Executive: $37,514 Average Salary for Executive: $90,601
The effect of a categorical variable is to add an additional constant amount to the y- intercept for the subset of points included in the category. Graphically, it creates a separate regression line for each category. The slope of the line is constant but the y-intercepts vary.
Regression Statistics Multiple R0.177 R Square0.031 Adjusted R Square0.012 Standard Error Observations52 CoefficientsStandard Errort Stat Intercept Gender Salary = a + b * Gender Not Significant Average Salary for 0-Gender: $52,126 Average Salary for 1-Gender: $43,646
Regression Statistics Multiple R0.650 R Square0.422 Adjusted R Square0.411 Standard Error Observations52 CoefficientsStandard Errort Stat Intercept Education Salary = a + b * Education Significant Average Salary for 0 Education: $11,656 Additional Value per year: $8,448
The Education variable has multiple values: 0, 2, 4, 6, 8. This variable was used directly in the regression estimation. Although it appeared categorical, it was in fact used as a numerical variable. Implicit in the use of any explanatory variable is that its effect is linearly increasing or decreasing. For the education variable, this would mean that the effect on Salary of having a two-year degree would be exactly ½ of the effect of having a four- year degree. This linearity may be questionable.
Linearity would imply “incorrectly” that dropping out after three years of college, in salary terms, would result in a loss of “only” $8448, compared with finishing one’s Bachelor’s degree. But, since degrees are really “0” and “1”, a better approach is to consider each level of degree as a separate categorical variable.
If the linearity of a limited value variable is questionable, then the variable may be better modeled by constructing a series of indicator or dummy variables that each represents exactly one value: Education0, Education2, Education4, Education6, Education8. In this way, the effect of each level can be considered independently. This technique frequently occurs with time variables, i.e. months. One should not implictly assume that the monthly effect in December (12) is 12 times as large as the monthly effect in January (10.
R2Y-InterceptTstat YintSlopeTstat Slope Education014% Education211% Education40% Education62% Education827%
The previous results illustrate some values, but also obscure other values. The results show that having less than a Bachelor’s degree has a significant $30,000 effect on average salary at this company. The lack of statistical significance for the Bachelor’s degree and Master’s degree variables obscures the fact that it is unreasonable to use these variables alone since it would conflate the salaries for Ph.D.’s with the salaries of those with no college, when it is clear that at this company the two groups do not earn anything near the same salary.
SUMMARY OUTPUT Regression Statistics Multiple R0.709 R Square0.503 Adjusted R Square0.461 Standard Error17748 Observations52 CoefficientsStandard Errort Stat Intercept Education Education Education Education