checks: add retrieval quality checks#2451
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive set of retrieval quality metrics, including Recall@K, Precision@K, HitRate@K, MRR, NDCG@K, and InfAP, along with corresponding unit tests. Feedback suggests refining the _as_sequence helper to handle None values correctly, adjusting the Precision@K calculation to use a standard denominator, and renaming the InfAP metric to AveragePrecision for better alignment with information retrieval terminology.
| @Check.register("inf_ap") | ||
| class InfAP[InputType, OutputType, TraceType: Trace]( # pyright: ignore[reportMissingTypeArgument] |
There was a problem hiding this comment.
The metric implemented here is standard Average Precision (AP). In IR literature, Inferred Average Precision (InfAP) refers to a specific estimator designed for incomplete relevance judgments (where some documents are unjudged). Since this implementation assumes strict exact-ID matching against a provided set (complete judgment), it should be renamed to AveragePrecision to avoid confusion with the specialized InfAP metric.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Closes #2445
Summary
RecallAtK,PrecisionAtK,HitRateAtK,MRR,NDCGAtK, andInfAPthreshold, JSONPath keys for relevant/retrieved IDs, andkwhere applicablegiskard.checksduplicate retrieved IDs, missing keys, and registry validation
Scope
This PR implements the strict exact-ID matching strategy first. Cosine similarity, LLM-judged relevance, and
documentation updates can be added in follow-up PRs.
Testing
uv run -m pytest -q libs/giskard-checks/tests/builtin/test_retrieval.pyuv run -m pytest -q libs/giskard-checks/tests/builtinuv run ruff check ...uv run basedpyright ...