Beyond Logit Adjustment: A Residual Decomposition Framework for Long-Tailed Reranking
Method • Results • Installation • Reproducing Results • Citation
Long-tailed classification methods like logit adjustment apply a fixed per-class offset to correct frequency bias. But the correction needed between two classes varies across inputs — a fixed offset that helps one image may hurt another.
REPAIR decomposes the Bayes-optimal residual correction into:
- A classwise component (learned per-class offset with empirical Bayes shrinkage)
- A pairwise component (input-dependent correction based on shortlist competition)
r_y(x, S) = g_y(x) + a_y + (1/(k-1)) * sum_{j in S\{y}} theta^T phi(x, y, j)
base classwise pairwise
The theory predicts when pairwise correction helps (threshold crossings between class pairs) and when it doesn't (class-separable settings). Experiments confirm gains arise precisely where the theory predicts.
|
Key idea: When the same label pair (e.g., leopard vs. cat) requires different corrections across inputs, no fixed offset can be optimal for all. REPAIR adds a lightweight pairwise term that adapts to the competition structure on the shortlist. Pairwise features
Training: Conditional log-likelihood on a calibration set, L-BFGS with L2 regularization. The base model is never modified. |
Theoretical contributions:
|
| Dataset | Base | LogitAdj | Classwise | REPAIR | Rare Base | Rare REPAIR |
|---|---|---|---|---|---|---|
| iNaturalist (8142 cls) | 47.0 | 49.0 | 48.8 | 49.0 | 37.5 | 42.2 |
| ImageNet-LT (1000 cls) | 54.4 | 61.5 | 61.8 | 62.0 | 47.4 | 60.8 |
| Places-LT (365 cls) | 44.7 | 44.5 | 45.1 | 45.5 | 46.3 | 48.2 |
| GMDB (508 diseases) | 64.0 | 70.6 | 67.4 | 72.3 | 62.9 | 79.6 |
| RareBench (508, OOD) | 59.4 | 62.5 | 66.2 | 85.6 | 41.2 | 81.2 |
-
Near-class-separable regime (vision benchmarks): REPAIR provides small but consistent gains (+0.2–0.4 Hit@1) — the classwise correction already captures most of the recoverable gap.
-
Non-class-separable regime (rare disease): REPAIR provides large gains. On GMDB, +4.9 Hit@1 and +12.7 Rare Hit@1 over Classwise. On RareBench (OOD text-only), +19.4 Hit@1 over Classwise (bootstrap 95% CI: [+9.4, +37.5], P=1.000).
-
Quintile analysis: Gains of REPAIR over Classwise increase monotonically from Q1 (low threshold dispersion) to Q5 (high), confirming the theory's predictions.
| Dataset | Method | k=5 | k=10 | k=20 | k=50 |
|---|---|---|---|---|---|
| GMDB | Classwise | 77.4 | 67.4 | 56.3 | 46.0 |
| REPAIR | 81.5 | 72.3 | 60.5 | 49.3 | |
| RareBench | Classwise | 71.7 | 66.2 | 67.7 | 58.7 |
| REPAIR | 79.3 | 85.6 | 79.0 | 73.5 |
REPAIR consistently outperforms Classwise across all shortlist sizes on all datasets.
pip install -r requirements.txt
# numpy, scipy, matplotlib
# PyTorch required for GMDB/RareBench (classifier weight loading)REPAIR/
├── repair/ # Core algorithm
│ ├── core.py # fit(), compute_phi(), apply_scores(), evaluate()
│ └── shrinkage.py # Empirical Bayes shrinkage for classwise offsets
├── experiments/ # Experiment scripts (one per table/figure)
│ ├── run_main_table.py # Table 1: Main results (5 datasets, 5-seed)
│ ├── ablation.py # Component ablation (CW-only, PW-only, REPAIR)
│ ├── k_sensitivity.py # Table 6: Sensitivity to shortlist size k
│ ├── unconditional.py # Table 7: Unconditional results
│ ├── bootstrap_ci.py # Table 9: RareBench bootstrap 95% CI
│ ├── synthetic.py # Figure 2: Synthetic validation
│ └── contradictory.py # Contradictory pair analysis
├── figures/ # Figure generation
│ ├── plot_style.py # Shared matplotlib style
│ ├── gen_validation.py # Figure 2 (synthetic)
│ ├── gen_quintile.py # Figure 3 (quintile analysis)
│ ├── gen_ablation.py # Figure 7 (ablation bars)
│ └── gen_k_sensitivity.py # k-sensitivity plots
└── requirements.txt
Set REPAIR_DATA_ROOT to the directory containing pre-extracted logits:
export REPAIR_DATA_ROOT=/path/to/dataExpected layout:
$REPAIR_DATA_ROOT/
├── results/imagenet_lt/imagenet_lt_logits.npz
├── results/inat/logits.npz
├── results/backbone/resnet152_places_lt_logits.npz
├── results/rarebench/rarebench_qwen_logits.npz
├── results/gmdb_classifier_norms.npy
├── qwen-vl-inference/results/inference_output/v2_kway_08b_final_{cal,test}.logits.npz
├── qwen-vl-finetune/qwenvl/data/v2_manifest.csv
├── qwen-vl-finetune/qwenvl/data/v2_disease_cards.json
└── data/imagenet_lt/ImageNet_LT_train.txt
All commands run from the repository root:
# Table 1: Main results
python -m experiments.run_main_table
# Component ablation
python -m experiments.ablation
# Table 6: k-sensitivity
python -m experiments.k_sensitivity
# Table 7: Unconditional results (requires k_sensitivity output)
python -m experiments.unconditional
# Table 9: RareBench bootstrap CI
python -m experiments.bootstrap_ci
# Synthetic experiments + figures
python -m experiments.synthetic
python -m figures.gen_validation
# Quintile analysis figures
python -m figures.gen_quintile
# Ablation bar charts (requires ablation output)
python -m figures.gen_ablation
# Contradictory pair analysis
python -m experiments.contradictory inat
python -m experiments.contradictory imagenet
python -m experiments.contradictory places
python -m experiments.contradictory gmdb
python -m experiments.contradictory rarebenchIf you find this work useful, please cite our paper.
@misc{wang2026logitadjustmentresidualdecomposition,
title={Beyond Logit Adjustment: A Residual Decomposition Framework for Long-Tailed Reranking},
author={Zhanliang Wang and Hongzhuo Chen and Quan Minh Nguyen and Mian Umair Ahsan and Kai Wang},
year={2026},
eprint={2604.01506},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2604.01506},
}This project is released under the MIT License.