CDE utilization fix by lionisakis · Pull Request #44 · terrierteam/pyterrier_dr

lionisakis · 2026-01-24T16:46:24Z

Pull Request Summary

This pull request adds masking and selection support to the numpy- and torch-based retrievers in pyterrier_dr.flex, and fixes several compatibility and usability issues in the CDE bi-encoder.

Retriever Enhancements

NumpyRetriever
Adds an optional mask parameter (including factory methods) to support per-document score weighting or filtering during brute-force retrieval. The mask is applied at scoring time, enabling flexible document selection or downweighting.
TorchRetriever
Adds an optional index_select parameter (including factory methods) to restrict retrieval to a specified subset of document indices, handled efficiently on the target device.

CDE Bi-Encoder Fixes and Improvements

Fixes compatibility issues with the sentence encoder by correcting progress bar handling, batching, and second-stage retrieval.
Standardizes progress display in encode_context, encode_queries, and encode_docs via the verbose flag.
Restores the original prompt behavior recommended by the CDE authors: removes injected prompts and prepends the original prompt text directly to inputs, aligning with the reference implementation.

Documentation and Refactoring

Improves docstrings and parameter documentation, particularly for the new mask and index_select options.
Applies minor refactoring and style cleanups to improve readability and maintainability.

Overall, these changes improve retrieval flexibility, fix CDE encoder correctness issues, and align prompt handling with the original CDE design.

Index select

Removed show_progress parameter from model.encode calls.

Added an optional mask parameter to the _np_retriever function for enhanced retrieval capabilities.

Add optional mask parameter to scoring method for document weighting.

cmacdonald · 2026-01-24T18:33:46Z

whats the use case for masking?

lionisakis · 2026-01-24T18:37:46Z

A use case can be a situation where you prefer not to re-index your entire dataset, such as when you need to remove a specific set of document IDs.”

cmacdonald · 2026-01-24T18:38:56Z

Why not just remove them from the retrieved set as a postfilter in the pipeline?

lionisakis · 2026-01-24T18:44:40Z

For example, if you want to keep a top-k of 200 results, a post-filter does not guarantee that you will actually return 200 items. The system may first retrieve 200 results from a mixed set, and the post-filter will then reduce that list to fewer than 200. If masking is applied during indexing, only the allowed documents are retrieved in the first place, which guarantees a full top-k result. Thus, this behavior can be unfair when evaluating metrics at top-k 200. Even if you increase your first top-k to 300, this issue still exists. This can also increase when you want a larger top-k. This issue depends on how many documents you do not want to include.

cmacdonald · 2026-01-24T18:32:40Z

pyterrier_dr/cde.py

        if len(texts) == 0:
            return np.empty(shape=(0, 0))
+
+        # print("texts", texts, type(texts))


cmacdonald · 2026-01-26T09:25:10Z

pyterrier_dr/cde.py

+            ["search_documents: " + t for t in texts],
            dataset_embeddings=self.cache.context(),
-            show_progress=show_progress,
+            show_progress_bar=show_progress,


why show_progress -> show_progress_bar

cmacdonald · 2026-01-26T09:25:59Z

pyterrier_dr/flex/torch_retr.py

+        # Load numpy-backed vectors (memory-mapped)
+        dvecs, _ = self.payload(return_docnos=False)
+
+        # Important: frombuffer avoids an extra copy


isnt this just formatting? whats the change here?

cmacdonald · 2026-01-26T09:27:48Z

pyterrier_dr/flex/torch_retr.py

+            np.stack(inp['query_vec'])
+        ).to(self.torch_vecs)
+
+        tv = (


i think tv needs an explanatory comment

cmacdonald · 2026-01-26T09:28:56Z

i think this should have been two separate PRs, no?

cmacdonald · 2026-01-26T11:48:46Z

further to my review, can we have some test cases that show that the masking is working as expected

seanmacavaney and others added 15 commits January 7, 2024 16:36

initial implementation

272d12f

misc fixes

9f1636a

fix

183ed46

np_retriever mask (similar function)

1f192ba

fix

236e01a

fix

e341c1d

fix

f34d6ba

Merge branch 'main' into index_select

3d4c263

Merge pull request #1 from lionisakis/index_select

8eb5393

Index select

Remove show_progress from encode methods

fe7b0f2

Removed show_progress parameter from model.encode calls.

Refactor encoding methods to include verbose option

834fc67

Add mask parameter to _np_retriever function

87c6e49

Added an optional mask parameter to the _np_retriever function for enhanced retrieval capabilities.

Update torch_retr.py

44c04fd

Change index_select type to Optional[np.ndarray]

3c01686

Enhance scoring method with optional mask parameter

1e61e9c

Add optional mask parameter to scoring method for document weighting.

cmacdonald requested changes Jan 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDE utilization fix#44

CDE utilization fix#44
lionisakis wants to merge 15 commits intoterrierteam:mainfrom
lionisakis:main

lionisakis commented Jan 24, 2026

Uh oh!

cmacdonald commented Jan 24, 2026

Uh oh!

lionisakis commented Jan 24, 2026

Uh oh!

cmacdonald commented Jan 24, 2026

Uh oh!

lionisakis commented Jan 24, 2026

Uh oh!

cmacdonald Jan 24, 2026

Uh oh!

cmacdonald Jan 26, 2026

Uh oh!

cmacdonald Jan 26, 2026

Uh oh!

cmacdonald Jan 26, 2026

Uh oh!

cmacdonald commented Jan 26, 2026

Uh oh!

cmacdonald commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lionisakis commented Jan 24, 2026

Pull Request Summary

Retriever Enhancements

CDE Bi-Encoder Fixes and Improvements

Documentation and Refactoring

Uh oh!

cmacdonald commented Jan 24, 2026

Uh oh!

lionisakis commented Jan 24, 2026

Uh oh!

cmacdonald commented Jan 24, 2026

Uh oh!

lionisakis commented Jan 24, 2026

Uh oh!

cmacdonald Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

cmacdonald Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

cmacdonald Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

cmacdonald Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

cmacdonald commented Jan 26, 2026

Uh oh!

cmacdonald commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants