Skip to content

CDE utilization fix#44

Open
lionisakis wants to merge 15 commits intoterrierteam:mainfrom
lionisakis:main
Open

CDE utilization fix#44
lionisakis wants to merge 15 commits intoterrierteam:mainfrom
lionisakis:main

Conversation

@lionisakis
Copy link

Pull Request Summary

This pull request adds masking and selection support to the numpy- and torch-based retrievers in pyterrier_dr.flex, and fixes several compatibility and usability issues in the CDE bi-encoder.

Retriever Enhancements

  • NumpyRetriever
    Adds an optional mask parameter (including factory methods) to support per-document score weighting or filtering during brute-force retrieval. The mask is applied at scoring time, enabling flexible document selection or downweighting.

  • TorchRetriever
    Adds an optional index_select parameter (including factory methods) to restrict retrieval to a specified subset of document indices, handled efficiently on the target device.

CDE Bi-Encoder Fixes and Improvements

  • Fixes compatibility issues with the sentence encoder by correcting progress bar handling, batching, and second-stage retrieval.
  • Standardizes progress display in encode_context, encode_queries, and encode_docs via the verbose flag.
  • Restores the original prompt behavior recommended by the CDE authors: removes injected prompts and prepends the original prompt text directly to inputs, aligning with the reference implementation.

Documentation and Refactoring

  • Improves docstrings and parameter documentation, particularly for the new mask and index_select options.
  • Applies minor refactoring and style cleanups to improve readability and maintainability.

Overall, these changes improve retrieval flexibility, fix CDE encoder correctness issues, and align prompt handling with the original CDE design.

@cmacdonald
Copy link
Collaborator

whats the use case for masking?

@lionisakis
Copy link
Author

A use case can be a situation where you prefer not to re-index your entire dataset, such as when you need to remove a specific set of document IDs.”

@cmacdonald
Copy link
Collaborator

Why not just remove them from the retrieved set as a postfilter in the pipeline?

@lionisakis
Copy link
Author

For example, if you want to keep a top-k of 200 results, a post-filter does not guarantee that you will actually return 200 items. The system may first retrieve 200 results from a mixed set, and the post-filter will then reduce that list to fewer than 200. If masking is applied during indexing, only the allowed documents are retrieved in the first place, which guarantees a full top-k result. Thus, this behavior can be unfair when evaluating metrics at top-k 200. Even if you increase your first top-k to 300, this issue still exists. This can also increase when you want a larger top-k. This issue depends on how many documents you do not want to include.

if len(texts) == 0:
return np.empty(shape=(0, 0))

# print("texts", texts, type(texts))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code

["search_documents: " + t for t in texts],
dataset_embeddings=self.cache.context(),
show_progress=show_progress,
show_progress_bar=show_progress,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why show_progress -> show_progress_bar

# Load numpy-backed vectors (memory-mapped)
dvecs, _ = self.payload(return_docnos=False)

# Important: frombuffer avoids an extra copy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt this just formatting? whats the change here?

np.stack(inp['query_vec'])
).to(self.torch_vecs)

tv = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think tv needs an explanatory comment

@cmacdonald
Copy link
Collaborator

i think this should have been two separate PRs, no?

@cmacdonald
Copy link
Collaborator

further to my review, can we have some test cases that show that the masking is working as expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants