-
-
Notifications
You must be signed in to change notification settings - Fork 183
feat: skip fragment checking for unsupported MIME types #1744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: skip fragment checking for unsupported MIME types #1744
Conversation
|
Woooh, that was tedious, getting all the types and methods right, what is returning an option to use
Finally all green! I am sure there are shorter/clearer/better performing ways for thees content type and file ending checks. But I am tired now 😅. |
376ed54 to
08b0ec2
Compare
|
Simplifying things. What I just tried was: if self.include_fragments
&& status.is_success()
&& method == Method::GET
&& response.url().fragment().is_some_and(|x| !x.is_empty())
&& let Some(content_type) = response
.headers()
.get(CONTENT_TYPE)
.and_then(|x| x.to_str().ok())
{
if content_type.starts_with("text/html") {
self.check_html_fragment(status, response, FileType::Html)
.await
} else if content_type.starts_with("text/markdown")
|| (content_type.starts_with("text/plain")
&& std::path::Path::new(response.url().path())
.extension()
.is_some_and(|x| x.eq_ignore_ascii_case("md")))
{
self.check_html_fragment(status, response, FileType::Markdown)
.await
} else {
status
}
} else {
status
}But that throws a So with Rust 1.88.0+ and edition 2024, the above should work, with the |
83630cb to
34a933c
Compare
|
Another approach to avoid the repetitive if self.include_fragments
&& status.is_success()
&& method == Method::GET
&& response.url().fragment().is_some_and(|x| !x.is_empty())
{
if let Some(content_type) = response
.headers()
.get(CONTENT_TYPE)
.and_then(|x| x.to_str().ok())
{
if content_type.starts_with("text/html") {
return self.check_html_fragment(status, response, FileType::Html)
.await
} else if content_type.starts_with("text/markdown")
|| (content_type.starts_with("text/plain")
&& std::path::Path::new(response.url().path())
.extension()
.is_some_and(|x| x.eq_ignore_ascii_case("md")))
{
return self.check_html_fragment(status, response, FileType::Markdown)
.await
}
}
}
statusAnd next Rust release, |
mre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one comment. Apart from that, looks fine.
|
From my side, we could merge as is and aim for a follow-up to introduce |
f91f075 to
7b1c3e9
Compare
|
Okay, looks good, better readable. And the two match blocks are potentially expandable:
The current structure allows both. |
|
Btw, the local file checker still passes binary data to the fragment checker, if a fragment is given. If I am not mistaken this means it reads the whole file content into RAM, which can be problematic when a large number of files, or large local files are checked, probably one of the reasons it fails at some point, when memory is full: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/checker/file.rs#L218C14-L234 But I have no good idea how to prevent that, since of course we have no response header to check. Maybe there is a way to abort this EDIT: Though: https://doc.rust-lang.org/std/fs/fn.read_to_string.html
Sounds like it errors out early already. EDIT2: Nah, the code looks more like it really reads everything into memory first, and tries to interpret it as UTF-8 afterwards. Not optimal. |
The remote URL/website checker currently passes all URLs with fragments to the fragment checker as HTML document, even if it is a different or unsupported MIME type. This can cause false fragment checking for Markdown documents, failures for other MIME types, especially binaries, and unnecessary traffic for large downloads, which are always finished completely, if the fragment checker is invoked. This commit checks the Content-Type header of the response: - Only if it is `text/html`, it is passed to the fragment checker as HTML type. - Only if it is `text/markdown`, of `text/plain` and URL path ends on `.md`, it is passed to the fragment checker as Markdown type. - In all other cases, the fragment checker is skipped and the HTTP status is returned. To invoke the fragment checker with a variable document type, a new `FileType` argument is added to the `check_html_fragment()` function. The fragment checker test and fixture are adjusted to match the expected result: checking a binary file via remote URL with fragment is now expected to succeed, since its Content-Type header does not invoke the fragment checker anymore. Signed-off-by: MichaIng <[email protected]>
7b1c3e9 to
a68c3ad
Compare
|
Conflict solved. |
Co-authored-by: MichaIng <[email protected]>
We could use libmagic to check the file type before reading. |
Let-chains are stable already. We could bump the MSRV (minimum supported Rust version) for this. Let me know if you'd like to change it to that or if you're happy. From my side, it's fine either way. |
Let me first verify whether really all data is loaded into RAM before the failure. If so, I would try to make this a topic among Rust devs. Maybe there is a common practice with standard crates already. I do not really believe that no one ever had the idea of a simple native way to error out early in such case, instead of reading any size of file fully into RAM, before recognizing on the first bits that it cannot be processed anyway. I was also a bit confused since we use the version and standard which should support it. But when I tried this, I still got the error that it is unstable: https://github.com/lycheeverse/lychee/actions/runs/15986698169/job/45092399397 I mean the related PR has been merged only 5 days ago: rust-lang/rust#143214 |
|
use std::fs::File;
use std::io::{BufRead, BufReader};
fn read_text_file_early_fail(path: &str) -> Result<String, Box<dyn std::error::Error>> {
let file = File::open(path)?;
let reader = BufReader::new(file);
let mut content = String::new();
for line in reader.lines() {
content.push_str(&line?); // This will fail early on invalid UTF-8
content.push('\n');
}
Ok(content)
}But that might be slower for actual UTF-8 files, so we'd have to benchmark that. We could probably make it an extension trait on |
|
Probably it is not worth it to tackle it then. I mean fragments used in URLs for non-text data is pretty rare anyway. For HTTP requests it was just easy enough as we do have the headers anyway. |
|
Good job! |

Fixes: #1737
The remote URL/website checker currently passes all URLs with fragments to the fragment checker as HTML document, even if it is a different or unsupported MIME type. This can cause false fragment checking for Markdown documents, failures for other MIME types, especially binaries, and unnecessary traffic for large downloads, which are always finished completely, if the fragment checker is invoked.
This commit checks the Content-Type header of the response:
text/html, it is passed to the fragment checker as HTML type.text/markdown, oftext/plainand URL path ends on.md, it is passed to the fragment checker as Markdown type.To invoke the fragment checker with a variable document type, a new
FileTypeargument is added to thecheck_html_fragment()function.The fragment checker test and fixture are adjusted to match the expected result: checking a binary file via remote URL with fragment is now expected to succeed, since its Content-Type header does not invoke the fragment checker anymore.