Modeling Possibilities Example 11.3 Possible Gender Discrimination in Salary at Fifth National Bank of Springfield Modeling Possibilities
Objective To use StatPro’s multiple regression procedure to analyze whether the back discriminates against females in terms of salary.
BANK.XLS The Fifth National Bank of Springfield is facing a gender-discrimination suit. The charge is that its female employees receive substantially smaller salaries than its male employees. The bank’s employee database is listed in this file. Here is a partial list of the data.
Variables For each of the 208 employees, the data set includes the following variables: EducLev: education level, a categorical variable with categories 1 (finished high school), 2 (finished some college courses), 3 (obtianed a bachelor’s degree), 4 (took some graduate courses) and 5 (obtained a graduate degree) JobGrade: a categorical variable indicating the current job level, the possible levels being from 1-6 (6 is highest) YrHired: year employee was hired YrBorn: year employee was born Gender: a categorical variable with values “Female” and “Male”
Variables -- continued YrsPrior: number of years of work experience at another bank prior to working at Fifth National PCJob: a dummy variable with value 1 if the employee’s current job is computer-related and value 0 otherwise Salary: current annual salary in thousands of dollars Do the data provide evidence that females are discriminated against in terms of salary?
Naïve Approach A naïve approach to the problem is to compare the average salaries of the males and females. The average of all salaries is $39,922, the average female salary is $37,210, and the average male salary is $45,505. The difference between the averages is statistically different. The females are definitely earning less, but perhaps there is a reason. The question is whether the differences between the average salaries is still evident after taking other attributes into account. A perfect task for regression.
Dummy Variables Some potential explanatory variables are categorical and cannot be measured on a quantitative scale. However, we often need to use these variables because they are related to the response variable. The trick is to create dummy variables, also called indicator or 0-1 variables. These are variables that indicate the category a given observation is in.
Dummy Variables -- continued To create dummy variables we can use an IF statement or we can use StatPro’s Dummy variable procedure. The Dummy variable procedure is usually easier particularly when there are multiple categories. Once the dummy variables are created, we can combine the variables if we like by simply adding the columns to get the dummy for the new category.
Regression Analysis In this example we create dummy variables for Gender, and EducLev. Then we can run a regression analysis with Salary as the response variable, using any combination of numerical and dummy explanatory variables. We must follow two rules: We shouldn’t use any of the original categorical variables that the dummies are based on. We should use one less dummy than the number of categories for any categorical variable.
Regression Analysis -- continued This second rule is a technical one. If we violate it the software will give us an error message. For example, Ed_1-Ed_6, any five of these variables can be used. The omitted dummy then corresponds to the reference category. As we will see the interpretation of the dummy variable coefficients are all relevant to this reference category. To get used to dummy variables in regression analysis we will proceed in several stages.
Regression Analysis -- continued We first estimate a regression equation with only one variable. The output is shown in this table. The resulting equation is Predicated Salary = 45.505 - 8.26Female
Regression Analysis -- continued To interpret this equation recall that Female has only two possible values, 0 and 1. If we substitute 1 then the predicted salary equals 37.209 and if we substitute 0 the predicated salary is 45.505. These are the average salaries of females and males. Therefore the interpretation of the -8.926 coefficient of the Female dummy variable is straightforward.
Regression Analysis -- continued The above equation only tells part of the story, it ignores all information except for gender. We expand this equation by adding the experience variables. The output is shown in this table.
Regression Analysis -- continued The corresponding equation is Predicted Salary = 35.492 + 0.998YrsExper + 0.131YrsPrior - 8.080Female It is useful to write two separate equations, one for females and one for males Predicted Salary = 27.412 + 0.988YrsExper + 0.131YrsPrior Predicted Salary = 35.492 + 0.988YrsExper + 0.131YrsPrior We interpret the coefficient -8.080 of the Female dummy variable as the average salary disadvantage for females relative to males after controlling for job experience. But there is still more story to tell.
Regression Analysis -- continued We next add education level to the equation by including four of the five education level dummies. Although any four could be used, we use Ed_2 to Ed_5, so that the lowest level becomes the reference category. We would expect this to lead to positive coefficients for these dummies, which are easier to interpret. The resulting output is shown in the table on the next slide.
Regression Analysis -- continued
Regression Analysis -- continued The estimated regression equations is now Predicated Salary=26.613 + 1.033YrsExper + 0.362YrsPrior - 4.501Female + 0.160Ed_2 + 4.765Ed_3 + 7.320Ed_4 +11.770Ed_5 There are now two categorical variables involved, gender and educational level. However, we can still write a separate equation for any combination of categories by setting the dummies to the appropriate values.
Regression Analysis -- continued For example, the equation for females at the fifth education level is found by setting Female=1 and Ed_5=1 and setting the other job dummies equal to 0. The equation formed is PredictedSalary = 33.882 + 1.033YrsExper + 0.362YrsPrior We interpret this equation as follows: For either gender and any education level, the expected increase in salary for one extra year of experience with Fifth National of $1033; the expected increase in salary for one extra year of prior experience with another bank is $362.
Regression Analysis -- continued The coefficients of the education dummies indicate the average increase in salary an employee can expect relative to the reference (lowest) education level. The key coefficient, the negative $4501 for females, indicates the average salary disadvantage for females relative to males, given that they have the same experience levels and the same education levels. One further explanation for gender differences in salary might be job grade. Perhaps females tend to be in lower job grades, which would help explain why they get lower salaries on average.
Regression Analysis -- continued One way to check this is with a pivot table, as shown below, where we put job grade in the row area, gender in the column area, and request counts, displayed as percentages of columns. Clearly, females tend to be concentrated at the lower job grades.
Regression Analysis -- continued This certainly helps to explain why females get lower salaries on average. We can go one step further to see the effect of job grade on salary by including the dummies for job grade in the equation, along with the other variables we have included so far. As with the education dummies, we use the lowest job grad as the reference category and include only the five dummies for the other categories.
Regression Analysis -- continued While we’re at it, we include the other two potential explanatory variables to the equation: Age, coded as 95 minus YrBorn, and HasPCJob, a dummy based on the PCJob categorical variable. The regression output is shown on the next slide. As expected, the coefficients of the job grade dummies are all positive, and they increase as the job grade increases – it pays to be in the higher job grades.
Regression Analysis -- continued
Regression Analysis -- continued The effect of age appears to be minimal, and there appears to be a “bonus” of close to $5000 for having a PC-related job. The R2 value has now increased to 76.5%, and the penalty for being a female has decreased to $2555 – still large but not as large as before. However, even if this penalty, the coefficient of Female in this last equation, is considered “small,” is it convincing evidence against the argument for gender discrimination?
Regression Analysis -- continued We believe the answer is “no.” We have used variations in job grades to reduce the penalty for being female. But the remaining question is then: Why are females predominantly in the low job grades? Perhaps this is the real source of gender discrimination. Perhaps management is not advancing the females as quickly as it should, which naturally results in lower salaries for females.