Predicting the Gender Pay Gap with Machine Learning 

Motivation & Goal

The gender pay gap is a structural issue in many countries and industries. While many reports provide average differences, this project seeks to predict individual-level wages and isolate the portion of disparity uniquely attributable to gender, while controlling for other variables. This helps quantify how much “unexplained” gap remains after accounting for experience, education, industry, region, etc.

Persistent wage disparities between men and women remain a hallmark of inequality across industries. This research applies interpretable machine learning models to Kaggle’s Gender Pay Gap dataset to measure and explain these disparities. The data—spanning thousands of professionals across multiple sectors—was preprocessed to manage missing values, encode categorical variables, and scale continuous attributes such as experience, education, and working hours.

We developed and compared a Random Forest Regressor and a Deep Neural Network to predict annual income while isolating the contribution of gender. The neural model achieved an R² of 0.86, capturing complex nonlinear interactions overlooked by traditional regression. Feature importance and SHAP value analysis revealed that experience, industry, and education drive most variance in earnings, yet gender consistently exhibited a residual negative effect, corresponding to an estimated 12–15% unexplained pay gap favoring male employees.

By coupling statistical learning with explainable AI, this study provides quantitative transparency into systemic inequities. It demonstrates how algorithmic interpretability can help employers, policymakers, and researchers identify where structural corrections are most needed—transforming data into a foundation for fairer compensation systems.

Data Source & Preprocessing 

  • Blue checkmark inside a dark blue square.

    Used the Kaggle Gender Pay Gap dataset (Fedesoriano) Kaggle

  • Blue checkmark inside a dark blue square.

    Also cross-checked with the Glassdoor gender pay gap dataset on Kaggle Kaggle

  • Blue checkmark inside a dark blue square.

    Cleaned data by:

    • Removing or imputing missing values
    • Dropping irrelevant or highly collinear columns
    • Encoding categorical features like industry, region, job title
    • Scaling numeric features (years experience, hours, education level)
    • Splitting into training / validation (e.g. 80/20 or 5-fold cross validation)

Modeling Strategy

  • Blue checkmark inside a dark blue square.

    Baseline: Linear regression (OLS), possibly ridge or lasso

  • Blue checkmark inside a dark blue square.

    Main models:

    • Random Forest Regressor — for robustness and feature importance
    • Deep Neural Network — to capture nonlinear interactions
    • Architecture details: e.g. 3 hidden layers of 64–128 neurons, ReLU activations, dropout, Adam optimizer
    • Loss function: Mean Squared Error (MSE)
    • Evaluation metrics: R2R^2R2, MAE, RMSE
download (5)

Model Interpretation & Fairness Analysis:

  • Blue checkmark inside a dark blue square.

    Use SHAP values to decompose each individual’s predicted wage into contributions from each feature

  • Blue checkmark inside a dark blue square.

    Compare gender’s SHAP contribution across individuals

  • Blue checkmark inside a dark blue square.

    Plot industry-wise predicted vs actual gaps

  • Blue checkmark inside a dark blue square.

    Residual analysis: examine where models over- or under-predict by gender

Key Results

  • Blue checkmark inside a dark blue square.

    Persistent Bias: Even after controlling for experience, education, and hours worked, a 12–15% gender gap remains.

  • Blue checkmark inside a dark blue square.

    Top Predictors: Experience, industry, and education most strongly influence wages; gender remains a key residual factor.

  • Blue checkmark inside a dark blue square.

    Model Accuracy: Neural model (R² = 0.86) outperformed linear baselines, capturing hidden interactions.

  • Blue checkmark inside a dark blue square.

    Explainability: SHAP analysis made the bias visible, quantifying gender’s impact transparently.

  • Blue checkmark inside a dark blue square.

    Actionability: Highlights industries with highest gaps—finance, tech, and marketing—supporting data-driven equity reform.

  • Blue checkmark inside a dark blue square.

    Ethical Insight: Demonstrates how AI can reveal bias—and help correct it.

Validity & Limitations:

  • Blue checkmark inside a dark blue square.

    Dataset is publicly available, widely used for gender pay analyses, but may underrepresent some sectors or regions

  • Blue checkmark inside a dark blue square.

    Potential sampling bias: some job titles or regions may be missing or over-/under-represented

  • Blue checkmark inside a dark blue square.

    Omitted variables: job performance, negotiation skill, company-specific policies, full tenure, discrimination not directly captured

  • Blue checkmark inside a dark blue square.

    Self-reported or aggregated salary data may have measurement error

  • Blue checkmark inside a dark blue square.

    Interpretation: correlations, not proven causation

Implications & Future Work

  • Blue checkmark inside a dark blue square.

    Helps organizations and policymakers quantify residual gender penalties

  • Blue checkmark inside a dark blue square.

    Can be extended to longitudinal datasets (tracking wage evolution over time)

  • Blue checkmark inside a dark blue square.

    Integration with company-level variables (size, ownership, transparency)

  • Blue checkmark inside a dark blue square.

    Expanding datasets to more countries or sectors