Prevent duplicate requests to the same URLs#2067
Prevent duplicate requests to the same URLs#2067thomas-zahner merged 4 commits intolycheeverse:masterfrom
Conversation
This is done by locking a Mutex for each Uri. Previously duplicate Uris were sometimes checked, depending on the timing on when the other duplicates were cached.
This prevents backoff and rate limiting cache hits
katrinafyi
left a comment
There was a problem hiding this comment.
Looks like it should work. I noticed this could happen while looking at this code for #2035 but I thought that the benefit was small.
This is because because there is already the top-level cache in check.rs which would prevent most duplicated requests where the second request starts after the first has finished. This duplication only happens when the same URI is picked up simultaneously by 2 tasks and they're both in-progress at the same time.
But in that case, adding a mutex just blocks the second task until the first completes and won't increase overall throughput - the blocked task still counts towards max_concurrency. It does slightly reduce the number of requests, but host concurrency already applies and already keeps it reasonable.
To increase throughput, I think we'd need something higher up at the check.rs level. It should keep track of URLs in-progress and, if duplicates are seen, it should divert them to a side-channel and wait using something like https://docs.rs/tokio/latest/tokio/sync/struct.SetOnce.html#method.wait
But anyway.... the PR is fine :)
| let _permit = self.acquire_semaphore().await; | ||
| let uri_mutex = self.acquire_uri_mutex(&uri); | ||
| let _uri_guard = uri_mutex.lock().await; | ||
|
|
||
| if let Some(cached) = self.get_cached_status(&uri, needs_body) { |
There was a problem hiding this comment.
The first cache check could be done even earlier, before the semaphore acquire since it doesn't need to be limited by host concurrency.
Also, acquire_uri_mutex is named similar to acquire_semaphore but it does different things - the semaphore also gains the lock, but the mutex one doesn't. Could acquire_uri_mutex be changed to just return the lock guard?
There was a problem hiding this comment.
There was a problem hiding this comment.
Cache check looks good now.
Is there a reason to prefer a function that returns the mutex rather than a function that returns the lock guard? (To make the lifetimes work, I think you'd need to use lock_owned)
There was a problem hiding this comment.
Oooh, that's awesome! So yeah, the reason for it was that I couldn't get it to work because it tried using lock instead of lock_owned. See 86c7f2a
Ah cool. Yeah, you are totally right. But at the same time I think it's not that unlikely to happen. The chance that URL's are duplicated, potentially across multiple files, is quite big. Especially if you consider that the
Yeah, thanks for the idea. True, it does not increase throughput at all. But it should save resources might make it a bit faster, especially if encountering rate limiting. |
Thanks to @katrinafyi for the trick with lock_owned()
0c6bece to
86c7f2a
Compare
|
Thank you for reviewing and helping me with the |
This is done by locking a Mutex for each Uri.
Previously duplicate Uris were sometimes checked, depending on the timing on when the other duplicates were cached.