Conversation
Index select
Removed show_progress parameter from model.encode calls.
Added an optional mask parameter to the _np_retriever function for enhanced retrieval capabilities.
Add optional mask parameter to scoring method for document weighting.
|
whats the use case for masking? |
|
A use case can be a situation where you prefer not to re-index your entire dataset, such as when you need to remove a specific set of document IDs.” |
|
Why not just remove them from the retrieved set as a postfilter in the pipeline? |
|
For example, if you want to keep a top-k of 200 results, a post-filter does not guarantee that you will actually return 200 items. The system may first retrieve 200 results from a mixed set, and the post-filter will then reduce that list to fewer than 200. If masking is applied during indexing, only the allowed documents are retrieved in the first place, which guarantees a full top-k result. Thus, this behavior can be unfair when evaluating metrics at top-k 200. Even if you increase your first top-k to 300, this issue still exists. This can also increase when you want a larger top-k. This issue depends on how many documents you do not want to include. |
| if len(texts) == 0: | ||
| return np.empty(shape=(0, 0)) | ||
|
|
||
| # print("texts", texts, type(texts)) |
| ["search_documents: " + t for t in texts], | ||
| dataset_embeddings=self.cache.context(), | ||
| show_progress=show_progress, | ||
| show_progress_bar=show_progress, |
There was a problem hiding this comment.
why show_progress -> show_progress_bar
| # Load numpy-backed vectors (memory-mapped) | ||
| dvecs, _ = self.payload(return_docnos=False) | ||
|
|
||
| # Important: frombuffer avoids an extra copy |
There was a problem hiding this comment.
isnt this just formatting? whats the change here?
| np.stack(inp['query_vec']) | ||
| ).to(self.torch_vecs) | ||
|
|
||
| tv = ( |
There was a problem hiding this comment.
i think tv needs an explanatory comment
|
i think this should have been two separate PRs, no? |
|
further to my review, can we have some test cases that show that the masking is working as expected |
Pull Request Summary
This pull request adds masking and selection support to the numpy- and torch-based retrievers in
pyterrier_dr.flex, and fixes several compatibility and usability issues in the CDE bi-encoder.Retriever Enhancements
NumpyRetriever
Adds an optional
maskparameter (including factory methods) to support per-document score weighting or filtering during brute-force retrieval. The mask is applied at scoring time, enabling flexible document selection or downweighting.TorchRetriever
Adds an optional
index_selectparameter (including factory methods) to restrict retrieval to a specified subset of document indices, handled efficiently on the target device.CDE Bi-Encoder Fixes and Improvements
encode_context,encode_queries, andencode_docsvia theverboseflag.Documentation and Refactoring
maskandindex_selectoptions.Overall, these changes improve retrieval flexibility, fix CDE encoder correctness issues, and align prompt handling with the original CDE design.