Description
The evaluation module currently provides 21 metrics across rating accuracy (RMSE, MAE, R-squared, etc.), ranking quality (Precision@k, Recall@k, NDCG@k, MAP, etc.), and beyond-accuracy dimensions (diversity, novelty, serendipity, coverage). However, it has no metrics for fairness or popularity bias -- two increasingly important evaluation dimensions in recommendation systems research.
Fairness in recommendations is a growing concern with regulatory implications (e.g., the EU AI Act requires fairness assessments for AI systems). Popularity bias is one of the most well-studied biases in recommender systems, where algorithms disproportionately recommend already-popular items at the expense of niche/long-tail content.
I'd like to contribute 10 new metrics across these two categories:
Popularity Bias Metrics (5):
| Metric |
Description |
Reference |
| Average Recommendation Popularity (ARP) |
Average popularity of recommended items |
Abdollahpouri et al., FLAIRS 2019 |
| Average Percentage of Long Tail Items (APLT) |
Fraction of long-tail items in recommendations |
Abdollahpouri et al., FLAIRS 2019 |
| Average Coverage of Long Tail Items (ACLT) |
Coverage of long-tail catalog |
Abdollahpouri et al., FLAIRS 2019 |
| Popularity Lift |
Popularity amplification ratio (reco vs train) |
Abdollahpouri, PhD Thesis 2020 |
| Gini Index |
Inequality of item recommendation frequency |
Standard inequality measure |
Fairness Metrics (5):
| Metric |
Description |
Reference |
| Group Metric Disparity |
Meta-metric: any existing metric's gap across user groups |
Li et al., WWW 2021 |
| Demographic Parity |
Equal recommendation rates across groups |
Burke et al., FAccT 2018 |
| Equal Opportunity Difference |
Recall@k gap across groups |
Adapted from Hardt et al., NeurIPS 2016 |
| Calibration Error |
KL divergence between user preferences and reco distribution |
Steck, RecSys 2018 |
| Exposure Fairness |
Gini of exposure across item providers |
Singh and Joachims, KDD 2018 |
Expected behavior with the suggested feature
- A new file
recommenders/evaluation/python_evaluation_fairness.py with Python/pandas implementations of all 10 metrics, following the existing coding conventions (decorators, docstrings with citations, constants from recommenders.utils.constants)
- A new
SparkFairnessEvaluation class in spark_evaluation.py with PySpark versions of the popularity bias metrics (following the SparkDiversityEvaluation pattern)
- Unit tests with boundary condition checks (e.g., Gini=0 for uniform distribution, calibration error=0 when preference distribution matches recommendation distribution)
- An example notebook
examples/03_evaluate/fairness_and_bias_evaluation.ipynb demonstrating the metrics, comparing a biased vs fair recommender
- Zero new dependencies -- uses only numpy, pandas, and sklearn (already required)
Proposed implementation details:
| File |
Action |
recommenders/utils/constants.py |
Modify: add DEFAULT_GROUP_COL, DEFAULT_PROVIDER_COL, DEFAULT_LONG_TAIL_THRESHOLD |
recommenders/evaluation/python_evaluation_fairness.py |
Create: 10 metric functions + decorators + helpers |
recommenders/evaluation/spark_evaluation.py |
Modify: add SparkFairnessEvaluation class |
tests/unit/recommenders/evaluation/conftest.py |
Modify: add fairness_data fixture |
tests/unit/recommenders/evaluation/test_python_evaluation_fairness.py |
Create: 22 unit tests with boundary checks |
tests/unit/recommenders/evaluation/test_spark_evaluation_fairness.py |
Create: Spark parity tests |
examples/03_evaluate/fairness_and_bias_evaluation.ipynb |
Create: example notebook |
Willingness to contribute
Other Comments
References:
- H. Abdollahpouri, R. Burke, B. Mobasher. "Managing Popularity Bias in Recommender Systems with Personalized Re-ranking." FLAIRS 2019
- H. Steck. "Calibrated Recommendations." RecSys 2018
- A. Singh, T. Joachims. "Fairness of Exposure in Rankings." KDD 2018
- R. Burke, N. Sonboli, A. Ordonez-Gauger. "Balanced Neighborhoods for Multi-Sided Fairness in Recommendation." FAccT 2018
- Y. Li et al. "User-oriented Fairness in Recommendation." WWW 2021
- M. Hardt, E. Price, N. Srebro. "Equality of Opportunity in Supervised Learning." NeurIPS 2016
I have a working implementation ready with 22 passing tests and 0 regressions against the existing test suite. Happy to share early for feedback before opening the PR.
Description
The evaluation module currently provides 21 metrics across rating accuracy (RMSE, MAE, R-squared, etc.), ranking quality (Precision@k, Recall@k, NDCG@k, MAP, etc.), and beyond-accuracy dimensions (diversity, novelty, serendipity, coverage). However, it has no metrics for fairness or popularity bias -- two increasingly important evaluation dimensions in recommendation systems research.
Fairness in recommendations is a growing concern with regulatory implications (e.g., the EU AI Act requires fairness assessments for AI systems). Popularity bias is one of the most well-studied biases in recommender systems, where algorithms disproportionately recommend already-popular items at the expense of niche/long-tail content.
I'd like to contribute 10 new metrics across these two categories:
Popularity Bias Metrics (5):
Fairness Metrics (5):
Expected behavior with the suggested feature
recommenders/evaluation/python_evaluation_fairness.pywith Python/pandas implementations of all 10 metrics, following the existing coding conventions (decorators, docstrings with citations, constants fromrecommenders.utils.constants)SparkFairnessEvaluationclass inspark_evaluation.pywith PySpark versions of the popularity bias metrics (following theSparkDiversityEvaluationpattern)examples/03_evaluate/fairness_and_bias_evaluation.ipynbdemonstrating the metrics, comparing a biased vs fair recommenderProposed implementation details:
recommenders/utils/constants.pyrecommenders/evaluation/python_evaluation_fairness.pyrecommenders/evaluation/spark_evaluation.pytests/unit/recommenders/evaluation/conftest.pytests/unit/recommenders/evaluation/test_python_evaluation_fairness.pytests/unit/recommenders/evaluation/test_spark_evaluation_fairness.pyexamples/03_evaluate/fairness_and_bias_evaluation.ipynbWillingness to contribute
Other Comments
References:
I have a working implementation ready with 22 passing tests and 0 regressions against the existing test suite. Happy to share early for feedback before opening the PR.