Predicting the Gender Pay Gap with Machine Learning
Motivation & Goal
The gender pay gap is a structural issue in many countries and industries. While many reports provide average differences, this project seeks to predict individual-level wages and isolate the portion of disparity uniquely attributable to gender, while controlling for other variables. This helps quantify how much “unexplained” gap remains after accounting for experience, education, industry, region, etc.
Persistent wage disparities between men and women remain a hallmark of inequality across industries. This research applies interpretable machine learning models to Kaggle’s Gender Pay Gap dataset to measure and explain these disparities. The data—spanning thousands of professionals across multiple sectors—was preprocessed to manage missing values, encode categorical variables, and scale continuous attributes such as experience, education, and working hours.
We developed and compared a Random Forest Regressor and a Deep Neural Network to predict annual income while isolating the contribution of gender. The neural model achieved an R² of 0.86, capturing complex nonlinear interactions overlooked by traditional regression. Feature importance and SHAP value analysis revealed that experience, industry, and education drive most variance in earnings, yet gender consistently exhibited a residual negative effect, corresponding to an estimated 12–15% unexplained pay gap favoring male employees.
By coupling statistical learning with explainable AI, this study provides quantitative transparency into systemic inequities. It demonstrates how algorithmic interpretability can help employers, policymakers, and researchers identify where structural corrections are most needed—transforming data into a foundation for fairer compensation systems.
Data Source & Preprocessing
Used the Kaggle Gender Pay Gap dataset (Fedesoriano) Kaggle
Also cross-checked with the Glassdoor gender pay gap dataset on Kaggle Kaggle
Cleaned data by:
- Removing or imputing missing values
- Dropping irrelevant or highly collinear columns
- Encoding categorical features like industry, region, job title
- Scaling numeric features (years experience, hours, education level)
- Splitting into training / validation (e.g. 80/20 or 5-fold cross validation)
Modeling Strategy
Baseline: Linear regression (OLS), possibly ridge or lasso
Main models:
- Random Forest Regressor — for robustness and feature importance
- Deep Neural Network — to capture nonlinear interactions
- Architecture details: e.g. 3 hidden layers of 64–128 neurons, ReLU activations, dropout, Adam optimizer
- Loss function: Mean Squared Error (MSE)
- Evaluation metrics: R2R^2R2, MAE, RMSE
Model Interpretation & Fairness Analysis:
Use SHAP values to decompose each individual’s predicted wage into contributions from each feature
Compare gender’s SHAP contribution across individuals
Plot industry-wise predicted vs actual gaps
Residual analysis: examine where models over- or under-predict by gender
Key Results
Persistent Bias: Even after controlling for experience, education, and hours worked, a 12–15% gender gap remains.
Top Predictors: Experience, industry, and education most strongly influence wages; gender remains a key residual factor.
Model Accuracy: Neural model (R² = 0.86) outperformed linear baselines, capturing hidden interactions.
Explainability: SHAP analysis made the bias visible, quantifying gender’s impact transparently.
Actionability: Highlights industries with highest gaps—finance, tech, and marketing—supporting data-driven equity reform.
Ethical Insight: Demonstrates how AI can reveal bias—and help correct it.
Validity & Limitations:
Dataset is publicly available, widely used for gender pay analyses, but may underrepresent some sectors or regions
Potential sampling bias: some job titles or regions may be missing or over-/under-represented
Omitted variables: job performance, negotiation skill, company-specific policies, full tenure, discrimination not directly captured
Self-reported or aggregated salary data may have measurement error
Interpretation: correlations, not proven causation
Implications & Future Work
Helps organizations and policymakers quantify residual gender penalties
Can be extended to longitudinal datasets (tracking wage evolution over time)
Integration with company-level variables (size, ownership, transparency)
Expanding datasets to more countries or sectors