[FEATURE] Add fairness and popularity bias evaluation metrics


### Description

The evaluation module currently provides 21 metrics across rating accuracy (RMSE, MAE, R-squared, etc.), ranking quality (Precision@k, Recall@k, NDCG@k, MAP, etc.), and beyond-accuracy dimensions (diversity, novelty, serendipity, coverage). However, it has **no metrics for fairness or popularity bias** -- two increasingly important evaluation dimensions in recommendation systems research.

**Fairness** in recommendations is a growing concern with regulatory implications (e.g., the EU AI Act requires fairness assessments for AI systems). **Popularity bias** is one of the most well-studied biases in recommender systems, where algorithms disproportionately recommend already-popular items at the expense of niche/long-tail content.

I'd like to contribute **10 new metrics** across these two categories:

**Popularity Bias Metrics (5):**

| Metric | Description | Reference |
| ------ | ----------- | --------- |
| Average Recommendation Popularity (ARP) | Average popularity of recommended items | Abdollahpouri et al., FLAIRS 2019 |
| Average Percentage of Long Tail Items (APLT) | Fraction of long-tail items in recommendations | Abdollahpouri et al., FLAIRS 2019 |
| Average Coverage of Long Tail Items (ACLT) | Coverage of long-tail catalog | Abdollahpouri et al., FLAIRS 2019 |
| Popularity Lift | Popularity amplification ratio (reco vs train) | Abdollahpouri, PhD Thesis 2020 |
| Gini Index | Inequality of item recommendation frequency | Standard inequality measure |

**Fairness Metrics (5):**

| Metric | Description | Reference |
| ------ | ----------- | --------- |
| Group Metric Disparity | Meta-metric: any existing metric's gap across user groups | Li et al., WWW 2021 |
| Demographic Parity | Equal recommendation rates across groups | Burke et al., FAccT 2018 |
| Equal Opportunity Difference | Recall@k gap across groups | Adapted from Hardt et al., NeurIPS 2016 |
| Calibration Error | KL divergence between user preferences and reco distribution | Steck, RecSys 2018 |
| Exposure Fairness | Gini of exposure across item providers | Singh and Joachims, KDD 2018 |

### Expected behavior with the suggested feature

- A new file `recommenders/evaluation/python_evaluation_fairness.py` with Python/pandas implementations of all 10 metrics, following the existing coding conventions (decorators, docstrings with citations, constants from `recommenders.utils.constants`)
- A new `SparkFairnessEvaluation` class in `spark_evaluation.py` with PySpark versions of the popularity bias metrics (following the `SparkDiversityEvaluation` pattern)
- Unit tests with boundary condition checks (e.g., Gini=0 for uniform distribution, calibration error=0 when preference distribution matches recommendation distribution)
- An example notebook `examples/03_evaluate/fairness_and_bias_evaluation.ipynb` demonstrating the metrics, comparing a biased vs fair recommender
- Zero new dependencies -- uses only numpy, pandas, and sklearn (already required)

**Proposed implementation details:**

| File | Action |
| ---- | ------ |
| `recommenders/utils/constants.py` | Modify: add DEFAULT_GROUP_COL, DEFAULT_PROVIDER_COL, DEFAULT_LONG_TAIL_THRESHOLD |
| `recommenders/evaluation/python_evaluation_fairness.py` | Create: 10 metric functions + decorators + helpers |
| `recommenders/evaluation/spark_evaluation.py` | Modify: add SparkFairnessEvaluation class |
| `tests/unit/recommenders/evaluation/conftest.py` | Modify: add fairness_data fixture |
| `tests/unit/recommenders/evaluation/test_python_evaluation_fairness.py` | Create: 22 unit tests with boundary checks |
| `tests/unit/recommenders/evaluation/test_spark_evaluation_fairness.py` | Create: Spark parity tests |
| `examples/03_evaluate/fairness_and_bias_evaluation.ipynb` | Create: example notebook |

### Willingness to contribute

- [x] Yes, I can contribute for this issue independently.
- [ ] Yes, I can contribute for this issue with guidance from Recommenders community.
- [ ] No, I cannot contribute at this time.

### Other Comments

**References:**
1. H. Abdollahpouri, R. Burke, B. Mobasher. "Managing Popularity Bias in Recommender Systems with Personalized Re-ranking." FLAIRS 2019
2. H. Steck. "Calibrated Recommendations." RecSys 2018
3. A. Singh, T. Joachims. "Fairness of Exposure in Rankings." KDD 2018
4. R. Burke, N. Sonboli, A. Ordonez-Gauger. "Balanced Neighborhoods for Multi-Sided Fairness in Recommendation." FAccT 2018
5. Y. Li et al. "User-oriented Fairness in Recommendation." WWW 2021
6. M. Hardt, E. Price, N. Srebro. "Equality of Opportunity in Supervised Learning." NeurIPS 2016

I have a working implementation ready with 22 passing tests and 0 regressions against the existing test suite. Happy to share early for feedback before opening the PR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add fairness and popularity bias evaluation metrics #2283

Description

Expected behavior with the suggested feature

Willingness to contribute

Other Comments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Description	Reference
Average Recommendation Popularity (ARP)	Average popularity of recommended items	Abdollahpouri et al., FLAIRS 2019
Average Percentage of Long Tail Items (APLT)	Fraction of long-tail items in recommendations	Abdollahpouri et al., FLAIRS 2019
Average Coverage of Long Tail Items (ACLT)	Coverage of long-tail catalog	Abdollahpouri et al., FLAIRS 2019
Popularity Lift	Popularity amplification ratio (reco vs train)	Abdollahpouri, PhD Thesis 2020
Gini Index	Inequality of item recommendation frequency	Standard inequality measure

Metric	Description	Reference
Group Metric Disparity	Meta-metric: any existing metric's gap across user groups	Li et al., WWW 2021
Demographic Parity	Equal recommendation rates across groups	Burke et al., FAccT 2018
Equal Opportunity Difference	Recall@k gap across groups	Adapted from Hardt et al., NeurIPS 2016
Calibration Error	KL divergence between user preferences and reco distribution	Steck, RecSys 2018
Exposure Fairness	Gini of exposure across item providers	Singh and Joachims, KDD 2018

File	Action
`recommenders/utils/constants.py`	Modify: add DEFAULT_GROUP_COL, DEFAULT_PROVIDER_COL, DEFAULT_LONG_TAIL_THRESHOLD
`recommenders/evaluation/python_evaluation_fairness.py`	Create: 10 metric functions + decorators + helpers
`recommenders/evaluation/spark_evaluation.py`	Modify: add SparkFairnessEvaluation class
`tests/unit/recommenders/evaluation/conftest.py`	Modify: add fairness_data fixture
`tests/unit/recommenders/evaluation/test_python_evaluation_fairness.py`	Create: 22 unit tests with boundary checks
`tests/unit/recommenders/evaluation/test_spark_evaluation_fairness.py`	Create: Spark parity tests
`examples/03_evaluate/fairness_and_bias_evaluation.ipynb`	Create: example notebook

[FEATURE] Add fairness and popularity bias evaluation metrics #2283

Description

Description

Expected behavior with the suggested feature

Willingness to contribute

Other Comments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions