Skip to content

Prevent duplicate requests to the same URLs#2067

Merged
thomas-zahner merged 4 commits intolycheeverse:masterfrom
thomas-zahner:prevent-duplicate-requests
Mar 3, 2026
Merged

Prevent duplicate requests to the same URLs#2067
thomas-zahner merged 4 commits intolycheeverse:masterfrom
thomas-zahner:prevent-duplicate-requests

Conversation

@thomas-zahner
Copy link
Member

This is done by locking a Mutex for each Uri.
Previously duplicate Uris were sometimes checked, depending on the timing on when the other duplicates were cached.

This is done by locking a Mutex for each Uri.
Previously duplicate Uris were sometimes checked, depending on the
timing on when the other duplicates were cached.
@thomas-zahner thomas-zahner requested review from katrinafyi and mre March 2, 2026 10:32
@thomas-zahner thomas-zahner mentioned this pull request Mar 2, 2026
This prevents backoff and rate limiting cache hits
Copy link
Member

@katrinafyi katrinafyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it should work. I noticed this could happen while looking at this code for #2035 but I thought that the benefit was small.

This is because because there is already the top-level cache in check.rs which would prevent most duplicated requests where the second request starts after the first has finished. This duplication only happens when the same URI is picked up simultaneously by 2 tasks and they're both in-progress at the same time.

But in that case, adding a mutex just blocks the second task until the first completes and won't increase overall throughput - the blocked task still counts towards max_concurrency. It does slightly reduce the number of requests, but host concurrency already applies and already keeps it reasonable.

To increase throughput, I think we'd need something higher up at the check.rs level. It should keep track of URLs in-progress and, if duplicates are seen, it should divert them to a side-channel and wait using something like https://docs.rs/tokio/latest/tokio/sync/struct.SetOnce.html#method.wait

But anyway.... the PR is fine :)

Comment on lines 153 to 157
let _permit = self.acquire_semaphore().await;
let uri_mutex = self.acquire_uri_mutex(&uri);
let _uri_guard = uri_mutex.lock().await;

if let Some(cached) = self.get_cached_status(&uri, needs_body) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first cache check could be done even earlier, before the semaphore acquire since it doesn't need to be limited by host concurrency.

Also, acquire_uri_mutex is named similar to acquire_semaphore but it does different things - the semaphore also gains the lock, but the mutex one doesn't. Could acquire_uri_mutex be changed to just return the lock guard?

Copy link
Member Author

@thomas-zahner thomas-zahner Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first cache check could be done even earlier

Ah true, see 78634dd

Could acquire_uri_mutex be changed to just return the lock guard?

Yeah, it would be nice to call the function in a single line, but I didn't get that to work unfortunately.

Oh right, the name might be weird. What about 0c6bece?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache check looks good now.

Is there a reason to prefer a function that returns the mutex rather than a function that returns the lock guard? (To make the lifetimes work, I think you'd need to use lock_owned)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, that's awesome! So yeah, the reason for it was that I couldn't get it to work because it tried using lock instead of lock_owned. See 86c7f2a

@thomas-zahner
Copy link
Member Author

Looks like it should work. I noticed this could happen while looking at this code for #2035 but I thought that the benefit was small.

Ah cool. Yeah, you are totally right. But at the same time I think it's not that unlikely to happen. The chance that URL's are duplicated, potentially across multiple files, is quite big. Especially if you consider that the Host struct applies not for all URLs but only for the specific host/subdomain.

To increase throughput, I think we'd need something higher up at the check.rs level.

Yeah, thanks for the idea. True, it does not increase throughput at all. But it should save resources might make it a bit faster, especially if encountering rate limiting.

@thomas-zahner thomas-zahner force-pushed the prevent-duplicate-requests branch from 0c6bece to 86c7f2a Compare March 3, 2026 09:51
Copy link
Member

@katrinafyi katrinafyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

@thomas-zahner
Copy link
Member Author

Thank you for reviewing and helping me with the lock_owned trick!

@thomas-zahner thomas-zahner merged commit a3591de into lycheeverse:master Mar 3, 2026
7 checks passed
@mre mre mentioned this pull request Feb 25, 2026
@katrinafyi katrinafyi mentioned this pull request Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants