Blog

Conozca la economía desde dentro con los expertos.

Brian Kriegler

Director General

Educación

Doctorado en Estadística, Universidad de California, Los Ángeles

Máster en Estadística, Universidad de California, Los Ángeles

Licenciatura en Matemáticas/Economía, Claremont McKenna College

Experiencia laboral

Econ One, Agosto 2008 - Presente

Universidad de Pensilvania, 2007 - 2008

Universidad de California en Los Ángeles, 2007 - 2008

Consultor estadístico autónomo, 2004 - 2008

RAND Statistics Group, 2006

Lockheed Martin Misiles y Espacio, 2001 - 2003

Experiencia testimonial

Tribunal de distrito de EE.UU.

Tribunal del Estado

Arbitraje

Mediación privada

Servicios

Certificación de clase

Daños y perjuicios

Trabajo y empleo

Salarios y horas

Análisis de daños y perjuicios

Certificación de clase

Comparte este artículo

Marzo 14, 2025

Using Boosted Regression to Quantify Wage Gaps Across Genders: A Primer

Author(s): Brian Kriegler

While this blog focuses on gender wage disparities between men and women, the methods described herein could be extended to non-binary, transgender, and other gender-diverse individuals.

Introducción

Boosted regression, also known as boosting or generalized boosted models, is a statistical data mining tool that has proven highly effective in modeling an outcome variable as a function of a set of predictor variables. This non-parametric, data-adaptive technique allows the practitioner to uncover both linear and nonlinear relationships within data.

Furthermore, a series of boosted regression model diagnostics aid in quantifying (i) the importance of a given predictor variable, (ii) the relationship between the outcome variable and each predictor variable (e.g., linear, stepwise, piecewise, etc.), and (iii) the extent to which the predictor variables interact with one another.

In this blog post, we discuss the application of boosted regression as a means for evaluating wage gaps across genders. Actual data from an anonymized case study are used to demonstrate how to interpret boosted regression output.

Boosted regression modeling entails an iterative process in which the model grows little by little. They can be run using computational programs such as R or Stata. Textbooks covering boosted regression include but are not limited to “The Elements of Statistical Learning” by Hastie, Tibshirani, and Friedman (2001), as well as “Statistical Learning from a Regression Perspective” by Richard A. Berk (2008).

The steps described below allow the data to identify the relationship of each predictor variable with the outcome variable, capture potential interactions, and reveal which predictor variables are most important. Here’s how it works:

Start with a simple guess

- The model makes an initial prediction, like a rough estimate.
- This first guess is often considerably basic and not very accurate.

Calculate the initial differences between the actual values and each prediction

- These initial differences are commonly referred to as “initial errors” or “initial residuals.”

Train a small model to fix the initial residuals

- A new small model (usually a decision tree) is trained to focus on the errors from the first guess.
- This small model is evaluated to see if it helps correct additional residuals.

Repeat the process

- Another small model is added, again focusing on the remaining residuals.
- With each new step, the model updates the predictions.
- A learning rate (also known as the shrinkage rate) is applied at each step to control how much influence each new small model has on the final prediction. Practitioners typically set the learning rate to be between 0 and 10 percent.

Combine all the small models

- The final prediction is made by combining multiple small models, each of which provides an update (i.e., boost) after the previous step.
- Each one contributes only a little, but together they create a strong, accurate model.
- The learning rate prevents individual models from having too much impact, ensuring gradual improvements and reducing the risk of overfitting.

Repeat the process a large number of times

- Practitioners typically will set the number of decision trees to be between 1,000 and 5,000.
- The only cost to adding decision trees is more computational run time.

Measure the cumulative error after each iteration

- One common technique is “cross-validation.”
- The cumulative error is computed (i) across all observations in the original dataset and (ii) across various slices of the original dataset.

Identify a sensible number of iterations

- This is the number of iterations yielding the lowest cumulative error in Step 7.
- Initially, the cumulative error will trend downward, during which the model is still growing and improving.
- Eventually, the cumulative error will change directions and trend upward. A boosted model with “too many” iterations is overly specific to the original data.

How Boosted Regression Can be Used in Gender Wage Gap Analysis

In an analysis of employees’ earnings, boosted regression can be used to model wages as a function of job attributes along with gender.

A boosted regression model can be informative in a number of respects. For example:

It can model the annual earnings among executives at a firm as a by-product of available predictor variables in the data
It can quantify the difference in earnings across genders
It can compare earnings across subdivisions of the data, g., by geographic region and gender, year and gender, etc.

An Example Involving Executive Pay

Consider a dataset that includes the following pieces of information about executives at a company that has offices scattered across the country:

- Annual earnings
- Calendar year
- Ubicación
- Productivity
- Gender

In the case study below, boosted regression reveals a substantial gender wage gap between men and women among executives after accounting for differences across productivity, geography, and annual adjustments.¹

How Well Did Boosted Regression Fit the Data?

Once the boosted regression model is constructed, one analytical task is to assess how well the model fits the data. This entails (i) calculating each predicted (i.e., estimated) outcome in the dataset, and (ii) comparing the predicted outcomes to the corresponding actual outcomes. The graph below shows that predicted earnings tracks actual earnings among executives at this company.²

Earnings as a Function of Gender and Productivity

The boosted regression model diagnostics reveal that earnings increases as productivity improves. The graph below suggests that for a given level of productivity, the average wage gap between men and women in this example ranges from $23,000 to $38,000.

Earnings Across Genders at Each Office Location

The next graph shows the average difference in earnings across genders at each of the six office locations, holding productivity and calendar year constant. On average, the wage gap between men and women in this example is between $20,000 and $42,000.

Earnings Across Genders Year Over Year

The graphs by gender and year reveal that earnings increased from 2018 to 2022 and was followed by slightly lower earnings in 2023 and 2024. On average, the wage gap between men and women in this example is between $25,000 and $30,000 year over year.

How Influential is Each Predictor Variable?

Next, we examine the relative influence of each predictor variable in the boosted model. For a given number of iterations, the importance of a given predictor variable is measured based on how much the inclusion of that variable improves the boosted model’s performance. This is expressed as a percentage, where the total importance across all variables adds up to 100.

In this case, productivity is the most influential variable, accounting for 80% of the total improvement in model fit. The second most influential variable, geographic location, contributes approximately 15%, followed by fiscal year at 4%. Together, these three variables explain over 99% of the total influence.

Although gender accounts for less than one percent of the model’s error reduction, the previously discussed graphs suggest a wage gap between men and women amounts to tens of thousands of dollars. How much of the observed differences in earnings across genders is due to chance? Is this wage gap statistically significant? In a future blog post, we will explore a methodology for answering this question and revisit our case study.

Conclusión

Boosted regression offers a data-adaptive tool for analyzing an outcome variable as a by-product of a given set of predictor variables. This algorithmic technique can be applied to gender wage gap analyses, providing detailed insights into the factors that drive wage disparities. By modeling wages as a function of various job attributes along with gender, we can uncover complex relationships and quantify the impact of different predictors.

Preguntas frecuentes

What is boosted regression?

Boosted regression is an iterative process that enhances a model by correcting errors through a series of smaller models. This approach has proven to be effective at providing a representative depiction of the data.

How can boosted regression help in analyzing the gender wage gap?

Boosted regression can be used to model wages as a function of job attributes along with gender. This approach helps quantify the relationship between wages and gender, as well as the interaction between job attributes and gender.

How does boosted regression quantify the importance of different predictors?

Boosted regression quantifies the relative importance of each predictor based on the percent reduction in error. A predictor with a relatively high percent reduction in error is considered to have a greater impact on the accuracy of the model.

Can boosted regression be applied to other types of wage gap analyses?

Boosted regression is indeed versatile and can be effectively used to analyze wage disparities across various demographics and job attributes. For example, the method could be used to compare wages across races and/or age brackets.

Referencias

¹ In this instance, the boosted regression model was constructed using the “gbm” library in R. The total number of iterations was set to 2,000, and a learning rate of 1 percent was applied. Subsequently, the cross-validation technique described in Step 8 suggested that the cumulative error was at a minimum after 790 iterations.

² The R-squared value from a simple linear regression of predicted earnings (generated using boosted regression) against actual earnings is approximately 80%.

Servicios: Trabajo y Empleo, Salarios y Horas

Últimos recursos y opiniones

Blogs

Julio 14, 2026

Wage and Hour Analysis in Employment Litigation: The Labor and Employment Expert’s...

Servicios: Trabajo y Empleo, Salarios y Horas

Blogs

Julio 13, 2026

Expert Evidence in International Arbitration: Tribunal Expectations and Practice

Servicios: Arbitraje internacional

Blogs

Julio 7, 2026

Fire Truck Antitrust Litigation: An Economic Perspective on Market Power and Pricing

Services: Antitrust, Damages Analysis

Industrias: Industria manufacturera e industrial

Blog

Brian Kriegler

Comparte este artículo

Marzo 14, 2025

Using Boosted Regression to Quantify Wage Gaps Across Genders: A Primer

Author(s): Brian Kriegler

Índice

Introducción

Start with a simple guess

Calculate the initial differences between the actual values and each prediction

Train a small model to fix the initial residuals

Repeat the process

Combine all the small models

Repeat the process a large number of times

Measure the cumulative error after each iteration

Identify a sensible number of iterations

How Boosted Regression Can be Used in Gender Wage Gap Analysis

An Example Involving Executive Pay

How Well Did Boosted Regression Fit the Data?

Earnings as a Function of Gender and Productivity

Earnings Across Genders at Each Office Location

Earnings Across Genders Year Over Year

How Influential is Each Predictor Variable?

Conclusión

Preguntas frecuentes

What is boosted regression?

How can boosted regression help in analyzing the gender wage gap?

How does boosted regression quantify the importance of different predictors?

Can boosted regression be applied to other types of wage gap analyses?

Servicios: Trabajo y Empleo, Salarios y Horas

Últimos recursos y opiniones

Blogs

Wage and Hour Analysis in Employment Litigation: The Labor and Employment Expert’s...

Blogs

Expert Evidence in International Arbitration: Tribunal Expectations and Practice

Blogs

Fire Truck Antitrust Litigation: An Economic Perspective on Market Power and Pricing