Skip to content

Ann-bench: fix unsafe lazy blobs#828

Merged
rapids-bot[bot] merged 7 commits intobranch-25.06from
fix-ann-bench-unsafe-lazy-blobs
Apr 25, 2025
Merged

Ann-bench: fix unsafe lazy blobs#828
rapids-bot[bot] merged 7 commits intobranch-25.06from
fix-ann-bench-unsafe-lazy-blobs

Conversation

@achirkin
Copy link
Copy Markdown
Contributor

@achirkin achirkin commented Apr 17, 2025

The ann-bench dataset uses lazy-loading blobs to move data between storage and host and device memory.
The data may be moved between memory spaces at the moment some properties/pointers are requested.

In the search throughput mode, this sometimes causes a problem: two concurrent benchmark threads access the same property and concurrently modify the state of the blobs, which leads to various segfaults.

The fix is to guard the critical sections with a mutex lock. There shouldn't be any impact on benchmark QPS results.
Only one method, dataset->dim() is accessed within the benchmark loop. To avoid locking the mutex in this method, this PR changes the way dim() is evaluated; it's cached in dim_ variable while still maintaining the behavior of loading it either from the query set or the base set depending on what is available/accessed first.

@achirkin achirkin added bug Something isn't working non-breaking Introduces a non-breaking change labels Apr 17, 2025
@achirkin achirkin self-assigned this Apr 17, 2025
@achirkin achirkin requested a review from a team as a code owner April 17, 2025 13:33
@github-actions github-actions Bot added the cpp label Apr 17, 2025
@achirkin achirkin moved this to In Progress in Unstructured Data Processing Apr 17, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.00%. Comparing base (b801218) to head (2b43309).
Report is 4 commits behind head on branch-25.06.

Additional details and impacted files
@@              Coverage Diff              @@
##           branch-25.06     #828   +/-   ##
=============================================
  Coverage         84.00%   84.00%           
=============================================
  Files                18       18           
  Lines               125      125           
=============================================
  Hits                105      105           
  Misses               20       20           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tarang-jain
Copy link
Copy Markdown
Contributor

Thanks for the update. It is a smart idea that we can put locks on the dataset layer (dataset.hpp) and still ensure thread safety in helpers in the blob layer (blob.hpp). IIRC currently the blob header is not used anywhere directly apart from dataset.hpp, so I do not have any concerns with this change.

Copy link
Copy Markdown
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Artem, thank you for this fix! Overall it is fine, but there is one issue.

Comment thread cpp/bench/ann/src/common/dataset.hpp
@achirkin achirkin requested a review from tfeher April 22, 2025 09:26
Copy link
Copy Markdown
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the updates. This works, but do we need all the added complexity?

When I asked to cache dim, I was only thinking of to store it in a variable outside the benchmark loop (like it is done for query_set).

@achirkin
Copy link
Copy Markdown
Contributor Author

I actually thought this is a little bit of a simplification :) two variables base_set_accessed_, query_set_accessed_ are replaced with a single dim_, and a helper function in place of plain assignments clearly states what is it doing for the program logic.

Copy link
Copy Markdown
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually thought this is a little bit of a simplification

Ok, I accept your point above. Still I believe the helper function is doing more than what it should.

Comment thread cpp/bench/ann/src/common/dataset.hpp Outdated
@achirkin achirkin requested a review from tfeher April 25, 2025 08:18
Copy link
Copy Markdown
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the updates, LGTM!

@tfeher
Copy link
Copy Markdown
Contributor

tfeher commented Apr 25, 2025

/merge

@rapids-bot rapids-bot Bot merged commit 24da9aa into branch-25.06 Apr 25, 2025
67 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Unstructured Data Processing Apr 25, 2025
jamxia155 pushed a commit to jamxia155/cuvs that referenced this pull request Apr 25, 2025
The ann-bench dataset uses lazy-loading blobs to move data between storage and host and device memory.
The data may be moved between memory spaces at the moment some properties/pointers are requested.

In the search throughput mode, this sometimes causes a problem: two concurrent benchmark threads access the same property and concurrently modify the state of the blobs, which leads to various segfaults.

The fix is to guard the critical sections with a mutex lock. There shouldn't be any impact on benchmark QPS results.
Only one method, `dataset->dim()` is accessed within the benchmark loop. To avoid locking the mutex in this method, this PR changes the way `dim()` is evaluated; it's cached in `dim_` variable while still maintaining the behavior of loading it either from the query set or the base set depending on what is available/accessed first.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tarang Jain (https://github.com/tarang-jain)
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#828
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cpp non-breaking Introduces a non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants