Conversation
katrinafyi
left a comment
There was a problem hiding this comment.
Would the DashSet lead to significant non-determinism? If a failing URL is in multiple input files, it would only be reported in one file. The user would fix that, re-run lychee, and then see a new failure for the same URL in a different file. This seems bad.
Another approach would be to change the counter tracking so a cached entry does not increment the other counters (at https://github.com/lycheeverse/lychee/blob/fe604ba9adffa0e59554c6d65641cb300127f0d0/lychee-bin/src/formatters/stats/response.rs )
|
@katrinafyi By reading your comment, I realized that I went completely in the wrong direction. The ask was not to reintroduce the DashSet, but to count unique URLs. Thanks for pointing me in the right direction. I now changed the PR to do exactly that. See 75d9f9f So now we have: $ cargo run -- fixtures/double-count/
[...]
🔍 6 Total (in 4ms) 🔗 3 Unique ✅ 6 OK 🚫 0 Errors@afalhambra-hivemq Would you have time to try out and check if this matches your expectations? |
|
Ooh, the number of unique urls is nice. |
|
I wonder if it's possible to use the Status::Cached enum variant to determine whether to increment or not. It would change the metric from "unique" to "non-cached". I just think it would be nice to avoid another hashset because we already have caches at several layers. |
I agree it would be nice to avoid another HashSet. I tried displaying $ cargo run -- fixtures/double-count/
Compiling lychee v0.23.0 (/home/cklein/elastisys/lychee/lychee-bin)
Finished `dev` profile [unoptimized + debuginfo] target(s) in 2.30s
Running `target/debug/lychee fixtures/double-count/`
6/6 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links
🔍 6 Total (in 4ms) 🔗 3 Unique ♻️ 0 Cached ✅ 6 OK 🚫 0 Errors
$ cargo run -- https://elastisys.io/ https://elastisys.io/welkin/
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.27s
Running `target/debug/lychee 'https://elastisys.io/' 'https://elastisys.io/welkin/'`
216/216 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links [WARN] lychee detected 3 redirects. You might want to consider replacing redirecting URLs with the resolved URLs. Run lychee in verbose mode (-v/--verbose) to see details about the redirections.
🔍 216 Total (in 8s 582ms) 🔗 180 Unique ♻️ 24 Cached ✅ 213 OK 🚫 0 Errors 🔀 3 RedirectsNotice that I see a few options moving forward:
I'm leaning towards 2 now and 3 later. |
|
If the goal was to avoid the hash set, displaying both non-unique and cached counts still keeps the hash set around. As a first guess, the different count could be due to multiple simultaneous requests to the same URL. Those get served by the second-layer cache (#2067) rather than the top-level cache, and the second later cache doesn't return Status::Cached for its cache hits. |
Thanks for the explanation. I'm getting the feeling that "unique links" and "non-cached links" have different semantics. "unique links" measures the structure of the website and is "ground truth" which could also be measured by other tools. In contrast, "non-cached links" feels more like a metric measuring lychee's behavior. As far as I understood, the ask in the original issue was about unique links, in which case I see the added HashSet as beneficial and necessary. I propose we add unique links with HashSet for semantic consistency. What do you think? |
|
I don't have strong feelings. It could be said that total links (including duplicates) is also a true value independent of lychee. Non-cached links could be useful as a measure of how much work lychee actually did. Unique links is probably more intuitive to understand. I'll leave it to you and others to decide, I don't have a stake here aside from not liking the hash set (but I can deal with it). |
Just to clarify here. My ask was exactly that, to have a unique URL count. Otherwise, a total URL count jump from 4K to 113K, like we noticed, sounds very misleading. If that's ok, I can give it a try and provide you with some feedback. Thanks for looking into this! |
|
Tested locally against our documentation site (~489 HTML pages, same site referenced in #2072). Build: Command: lychee --insecure -c ./links-check.toml './build/site/**/latest/**/*.html' \
'./build/site/hivemq-edge/**/*.html' \
'./build/site/hivemq-platform-operator/**/*.html' \
'./build/site/hivemq-operator/**/*.html' \
'./build/site/hivemq-cloud/**/*.html' \
'./static/llms.txt' './static/llms-full.txt'Results: The new |
Fixes #2072
@mre Is this what we want? Notice that de-duplication happens over
Request.uriso thespan,elementandattributeof URIs from different sources are lost.