Skip to content

Cache for deduplicating top-level in-progress requests#2107

Draft
katrinafyi wants to merge 24 commits intolycheeverse:masterfrom
rina-forks:in-flight-url-cache
Draft

Cache for deduplicating top-level in-progress requests#2107
katrinafyi wants to merge 24 commits intolycheeverse:masterfrom
rina-forks:in-flight-url-cache

Conversation

@katrinafyi
Copy link
Member

@katrinafyi katrinafyi commented Mar 27, 2026

This improves the top-level cache by teaching it to understand "in-progress" requests and abstracts it into a generic Cache struct. See the docs for the new types: https://in-flight-url-cache.lychee-docs-katrinafyi.pages.dev/lychee_lib/cache/

This means that when duplicate URLs are collected from sources (e.g., due to different positions in the file), there will be only one actual request to that URL. Additionally, this means that repeated requests can be sent to a side-queue to wait for the initial request to finish, freeing up the main worker tasks for real link checks. (This is an improvement compared to the Host cache, which I commented on here: #2067 (review))

This is old work which I had started but never finished. I'm just posting it now to remind me to work on it and to gather opinions :) Comments are appreciated.

todo:

  • fix prints and stuff
  • clean up wording.
  • think about storing a richer status in the in-memory cache, to avoid OK (Cached) messages without --cache.
  • think about the role of the hostpool cache. i think it can be simplified a lot, because after this change the only duplicates it has to handle are requests to different fragments on the same URL.
  • clean up handler tasks into separate functions
  • change into_completed_entries to not panic

maybe later:

  • think about applying the cache to more Arc<Mutex> places, e.g. fragment cache
  • think about caching based on checked http url, after remaps, rather than before remaps.

related work:

katrinafyi and others added 24 commits March 4, 2026 20:39
This reverts commit 6522758.
This reverts commit 9f86155.
This reverts commit 2fdac22.
After source tracking, adding a Request to a HashSet is ineffective
because it contains a unique location for each request. As an added
side-effect, removing the deduplication means URLs will be sent to
the stream in file order, which is nice.
- some idea of cache key hierarchy. maybe stored within the cache?
- the computation is allowed to fail with network error. or maybe we just
  store the error in the cache too.
- or maybe this cache should *only* store requests with fragments.
  that sounds okay
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant