|
Home Ā» Uncategorized Ā» Using Boosted Regression to Quantify Wage Gaps Across Genders: A Primer

Services

Econ One’s expert economists have experience across a wide variety of services including antitrust, class certification, damages, financial markets and securities, intellectual property, international arbitration, labor and employment, and valuation and financial analysis.

Resources

Econ One’s resources including blogs, cases, news, and more provide a collection of materials from Econ One’s experts.

Blog
Get an Inside look at Economics with the experts.
Managing Director
Education

Ph.D. in Statistics, University of California, Los Angeles

M.S. in Statistics, University of California, Los Angeles

B.A. in Mathematics/Economics, Claremont McKenna College

Econ One, August 2008 – Present

University of Pennsylvania, 2007 – 2008

University of California Los Angeles, 2007 – 2008

Self-Employed Statistical Consultant, 2004 – 2008

RAND Statistics Group, 2006

Lockheed Martin Missiles and Space, 2001 – 2003

U.S. District Court

State Court

Arbitration

Private Mediation

Share this Article
March 14, 2025

Using Boosted Regression to Quantify Wage Gaps Across Genders: A Primer

Author(s): Brian Kriegler

While this blog focuses on gender wage disparities between men and women, the methods described herein could be extended to non-binary, transgender, and other gender-diverse individuals.

Table of Contents

Introduction

Boosted regression, also known as boosting or generalized boosted models, is a statistical data mining tool that has proven highly effective in modeling an outcome variable as a function of a set of predictor variables. This non-parametric, data-adaptive technique allows the practitioner to uncover both linear and nonlinear relationships within data.

Furthermore, a series of boosted regression model diagnostics aid in quantifying (i) the importance of a given predictor variable, (ii) the relationship between the outcome variable and each predictor variable (e.g., linear, stepwise, piecewise, etc.), and (iii) the extent to which the predictor variables interact with one another.

In this blog post, we discuss the application of boosted regression as a means for evaluating wage gaps across genders. Actual data from an anonymized case study are used to demonstrate how to interpret boosted regression output.

Boosted regression modeling entails an iterative process in which the model grows little by little. They can be run using computational programs such as R or Stata. Textbooks covering boosted regression include but are not limited to ā€œThe Elements of Statistical Learningā€ by Hastie, Tibshirani, and Friedman (2001), as well as ā€œStatistical Learning from a Regression Perspectiveā€ by Richard A. Berk (2008).

The steps described below allow the data to identify the relationship of each predictor variable with the outcome variable, capture potential interactions, and reveal which predictor variables are most important.Ā  Here’s how it works:

  1. Start with a simple guess

    • The model makes an initial prediction, like a rough estimate.
    • This first guess is often considerably basic and not very accurate.
  1. Calculate the initial differences between the actual values and each prediction

    • These initial differences are commonly referred to as ā€œinitial errorsā€ or ā€œinitial residuals.ā€
  1. Train a small model to fix the initial residuals

    • A new small model (usually a decision tree) is trained to focus on the errors from the first guess.
    • This small model is evaluated to see if it helps correct additional residuals.
  1. Repeat the process

    • Another small model is added, again focusing on the remaining residuals.
    • With each new step, the model updates the predictions.
    • A learning rate (also known as the shrinkage rate) is applied at each step to control how much influence each new small model has on the final prediction.Ā  Practitioners typically set the learning rate to be between 0 and 10 percent.
  1. Combine all the small models

    • The final prediction is made by combining multiple small models, each of which provides an update (i.e., boost) after the previous step.
    • Each one contributes only a little, but together they create a strong, accurate model.
    • The learning rate prevents individual models from having too much impact, ensuring gradual improvements and reducing the risk of overfitting.
  1. Repeat the process a large number of times

    • Practitioners typically will set the number of decision trees to be between 1,000 and 5,000.
    • The only cost to adding decision trees is more computational run time.
  1. Measure the cumulative error after each iteration

    • One common technique is ā€œcross-validation.ā€
    • The cumulative error is computed (i) across all observations in the original dataset and (ii) across various slices of the original dataset.
  1. Identify a sensible number of iterations

    • This is the number of iterations yielding the lowest cumulative error in Step 7.
    • Initially, the cumulative error will trend downward, during which the model is still growing and improving.
    • Eventually, the cumulative error will change directions and trend upward.Ā  A boosted model with ā€œtoo manyā€ iterations is overly specific to the original data.

How Boosted Regression Can be Used in Gender Wage Gap Analysis

In an analysis of employees’ earnings, boosted regression can be used to model wages as a function of job attributes along with gender.

A boosted regression model can be informative in a number of respects. For example:

  • It can model the annual earnings among executives at a firm as a by-product of available predictor variables in the data
  • It can quantify the difference in earnings across genders
  • It can compare earnings across subdivisions of the data, g., by geographic region and gender, year and gender, etc.

An Example Involving Executive Pay

Consider a dataset that includes the following pieces of information about executives at a company that has offices scattered across the country:

    • Annual earnings
    • Calendar year
    • Location
    • Productivity
    • Gender

In the case study below, boosted regression reveals a substantial gender wage gap between men and women among executives after accounting for differences across productivity, geography, and annual adjustments.1

How Well Did Boosted Regression Fit the Data?

Once the boosted regression model is constructed, one analytical task is to assess how well the model fits the data. This entails (i) calculating each predicted (i.e., estimated) outcome in the dataset, and (ii) comparing the predicted outcomes to the corresponding actual outcomes. The graph below shows that predicted earnings tracks actual earnings among executives at this company.2

Earnings as a Function of Gender and Productivity

The boosted regression model diagnostics reveal that earnings increases as productivity improves. The graph below suggests that for a given level of productivity, the average wage gap between men and women in this example ranges from $23,000 to $38,000.

Earnings Across Genders at Each Office Location

The next graph shows the average difference in earnings across genders at each of the six office locations, holding productivity and calendar year constant. On average, the wage gap between men and women in this example is between $20,000 and $42,000.

Earnings Across Genders Year Over Year

The graphs by gender and year reveal that earnings increased from 2018 to 2022 and was followed by slightly lower earnings in 2023 and 2024. On average, the wage gap between men and women in this example is between $25,000 and $30,000 year over year.

How Influential is Each Predictor Variable?

Next, we examine the relative influence of each predictor variable in the boosted model. For a given number of iterations, the importance of a given predictor variable is measured based on how much the inclusion of that variable improves the boosted model’s performance. This is expressed as a percentage, where the total importance across all variables adds up to 100.

In this case, productivity is the most influential variable, accounting for 80% of the total improvement in model fit. The second most influential variable, geographic location, contributes approximately 15%, followed by fiscal year at 4%. Together, these three variables explain over 99% of the total influence.

Although gender accounts for less than one percent of the model’s error reduction, the previously discussed graphs suggest a wage gap between men and women amounts to tens of thousands of dollars. How much of the observed differences in earnings across genders is due to chance? Is this wage gap statistically significant? In a future blog post, we will explore a methodology for answering this question and revisit our case study.

Conclusion

Boosted regression offers a data-adaptive tool for analyzing an outcome variable as a by-product of a given set of predictor variables. This algorithmic technique can be applied to gender wage gap analyses, providing detailed insights into the factors that drive wage disparities. By modeling wages as a function of various job attributes along with gender, we can uncover complex relationships and quantify the impact of different predictors.

Frequently Asked Questions

What is boosted regression?

Boosted regression is an iterative process that enhances a model by correcting errors through a series of smaller models. This approach has proven to be effective at providing a representative depiction of the data.

How can boosted regression help in analyzing the gender wage gap?

Boosted regression can be used to model wages as a function of job attributes along with gender. This approach helps quantify the relationship between wages and gender, as well as the interaction between job attributes and gender.

How does boosted regression quantify the importance of different predictors?

Boosted regression quantifies the relative importance of each predictor based on the percent reduction in error. A predictor with a relatively high percent reduction in error is considered to have a greater impact on the accuracy of the model.

Can boosted regression be applied to other types of wage gap analyses?

Boosted regression is indeed versatile and can be effectively used to analyze wage disparities across various demographics and job attributes. For example, the method could be used to compare wages across races and/or age brackets.

References

1 In this instance, the boosted regression model was constructed using the ā€œgbmā€ library in R. The total number of iterations was set to 2,000, and a learning rate of 1 percent was applied. Subsequently, the cross-validation technique described in Step 8 suggested that the cumulative error was at a minimum after 790 iterations.

2 The R-squared value from a simple linear regression of predicted earnings (generated using boosted regression) against actual earnings is approximately 80%.

Latest Related Resources and Insights