Skip to content

Conversation

@MichaIng
Copy link
Member

@MichaIng MichaIng commented Jun 25, 2025

Fixes: #1737

The remote URL/website checker currently passes all URLs with fragments to the fragment checker as HTML document, even if it is a different or unsupported MIME type. This can cause false fragment checking for Markdown documents, failures for other MIME types, especially binaries, and unnecessary traffic for large downloads, which are always finished completely, if the fragment checker is invoked.

This commit checks the Content-Type header of the response:

  • Only if it is text/html, it is passed to the fragment checker as HTML type.
  • Only if it is text/markdown, of text/plain and URL path ends on .md, it is passed to the fragment checker as Markdown type.
  • In all other cases, the fragment checker is skipped and the HTTP status is returned.

To invoke the fragment checker with a variable document type, a new FileType argument is added to the check_html_fragment() function.

The fragment checker test and fixture are adjusted to match the expected result: checking a binary file via remote URL with fragment is now expected to succeed, since its Content-Type header does not invoke the fragment checker anymore.

@MichaIng MichaIng added the enhancement New feature or request label Jun 25, 2025
@MichaIng
Copy link
Member Author

MichaIng commented Jun 25, 2025

Woooh, that was tedious, getting all the types and methods right, what is returning an option to use is_some_and(), what a result to use is_ok_and(), what is a string, or what needs to be converted to_str(), which however returns a result. And then clippy/rustfmt is suggesting sometimes iteratively changing results, like one suggestion results in another suggestion, which results in a format change suggestion:

  • ends_with(".rs")
  • => extension().map_or(false, |ext| ext.eq_ignore_ascii_case("rs")) reasonable suggestion, but ...
  • => extension().is_some_and(|ext| ext.eq_ignore_ascii_case("rs")) which I remember from elsewhere will be suggested 😄

Finally all green!

image

I am sure there are shorter/clearer/better performing ways for thees content type and file ending checks. But I am tired now 😅.

@MichaIng MichaIng force-pushed the content-type-based-fragment-checking branch 2 times, most recently from 376ed54 to 08b0ec2 Compare July 1, 2025 00:29
@MichaIng
Copy link
Member Author

MichaIng commented Jul 1, 2025

Simplifying things. What I just tried was:

                if self.include_fragments
                    && status.is_success()
                    && method == Method::GET
                    && response.url().fragment().is_some_and(|x| !x.is_empty())
                    && let Some(content_type) = response
                        .headers()
                        .get(CONTENT_TYPE)
                        .and_then(|x| x.to_str().ok())
                {
                    if content_type.starts_with("text/html") {
                        self.check_html_fragment(status, response, FileType::Html)
                            .await
                    } else if content_type.starts_with("text/markdown")
                        || (content_type.starts_with("text/plain")
                            && std::path::Path::new(response.url().path())
                                .extension()
                                .is_some_and(|x| x.eq_ignore_ascii_case("md")))
                    {
                        self.check_html_fragment(status, response, FileType::Markdown)
                            .await
                    } else {
                        status
                    }
                } else {
                    status
                }

But that throws a 'let' expressions in this position are unstable error, which however was somewhat solved upstream an hour ago (see last edit timestamp, what a coincidence): rust-lang/rust#53667

So with Rust 1.88.0+ and edition 2024, the above should work, with the let assignment being part of the parent if block. Until then however we need to use either one more if block (hence another else status repetition), or repeating the Some() validation for both cases, what I will do now.

@MichaIng MichaIng force-pushed the content-type-based-fragment-checking branch 2 times, most recently from 83630cb to 34a933c Compare July 1, 2025 00:58
@MichaIng
Copy link
Member Author

MichaIng commented Jul 1, 2025

Another approach to avoid the repetitive status and the doubled Some():

                if self.include_fragments
                    && status.is_success()
                    && method == Method::GET
                    && response.url().fragment().is_some_and(|x| !x.is_empty())
                {
                    if let Some(content_type) = response
                        .headers()
                        .get(CONTENT_TYPE)
                        .and_then(|x| x.to_str().ok())
                    {
                        if content_type.starts_with("text/html") {
                            return self.check_html_fragment(status, response, FileType::Html)
                                .await
                        } else if content_type.starts_with("text/markdown")
                            || (content_type.starts_with("text/plain")
                                && std::path::Path::new(response.url().path())
                                    .extension()
                                    .is_some_and(|x| x.eq_ignore_ascii_case("md")))
                        {
                            return self.check_html_fragment(status, response, FileType::Markdown)
                                .await
                        }
                    }
                }
                status

And next Rust release, && if let is not assumed unstable anymore, we can platten it by merging first and second if. Of course also when keeping the match without early returns.

Copy link
Member

@mre mre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one comment. Apart from that, looks fine.

@mre
Copy link
Member

mre commented Jul 3, 2025

From my side, we could merge as is and aim for a follow-up to introduce From<& Response>. Sounds good?

@MichaIng MichaIng force-pushed the content-type-based-fragment-checking branch 4 times, most recently from f91f075 to 7b1c3e9 Compare July 3, 2025 21:23
@MichaIng
Copy link
Member Author

MichaIng commented Jul 3, 2025

Okay, looks good, better readable. And the two match blocks are potentially expandable:

  • We might want to add support for more content types.
  • We might want to add support for more file extensions in case of text/plain content type.

The current structure allows both.

@MichaIng
Copy link
Member Author

MichaIng commented Jul 4, 2025

Btw, the local file checker still passes binary data to the fragment checker, if a fragment is given. If I am not mistaken this means it reads the whole file content into RAM, which can be problematic when a large number of files, or large local files are checked, probably one of the reasons it fails at some point, when memory is full: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/checker/file.rs#L218C14-L234

But I have no good idea how to prevent that, since of course we have no response header to check. Maybe there is a way to abort this read_to_string early, if it is no plain text file: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/utils/fragment_checker.rs#L24
But a topic for a different PR 😉.

EDIT: Though: https://doc.rust-lang.org/std/fs/fn.read_to_string.html

If the contents of the file are not valid UTF-8, then an error will also be returned.

Sounds like it errors out early already.

EDIT2: Nah, the code looks more like it really reads everything into memory first, and tries to interpret it as UTF-8 afterwards. Not optimal.

The remote URL/website checker currently passes all URLs with fragments to the fragment checker as HTML document, even if it is a different or unsupported MIME type. This can cause false fragment checking for Markdown documents, failures for other MIME types, especially binaries, and unnecessary traffic for large downloads, which are always finished completely, if the fragment checker is invoked.

This commit checks the Content-Type header of the response:
- Only if it is `text/html`, it is passed to the fragment checker as HTML type.
- Only if it is `text/markdown`, of `text/plain` and URL path ends on `.md`, it is passed to the fragment checker as Markdown type.
- In all other cases, the fragment checker is skipped and the HTTP status is returned.

To invoke the fragment checker with a variable document type, a new `FileType` argument is added to the `check_html_fragment()` function.

The fragment checker test and fixture are adjusted to match the expected result: checking a binary file via remote URL with fragment is now expected to succeed, since its Content-Type header does not invoke the fragment checker anymore.

Signed-off-by: MichaIng <[email protected]>
@MichaIng MichaIng force-pushed the content-type-based-fragment-checking branch from 7b1c3e9 to a68c3ad Compare July 4, 2025 14:43
@MichaIng
Copy link
Member Author

MichaIng commented Jul 4, 2025

Conflict solved.

@mre
Copy link
Member

mre commented Jul 4, 2025

EDIT2: Nah, the code looks more like it really reads everything into memory first, and tries to interpret it as UTF-8 afterwards. Not optimal.

We could use libmagic to check the file type before reading.
I found this to be an up-to-date crate: https://github.com/marirs/filemagic-rs

@mre
Copy link
Member

mre commented Jul 4, 2025

And next Rust release, && if let is not assumed unstable anymore, we can platten it by merging first and second if. Of course also when keeping the match without early returns.

Let-chains are stable already. We could bump the MSRV (minimum supported Rust version) for this. Let me know if you'd like to change it to that or if you're happy. From my side, it's fine either way.

@MichaIng
Copy link
Member Author

MichaIng commented Jul 4, 2025

We could use libmagic to check the file type before reading.

Let me first verify whether really all data is loaded into RAM before the failure. If so, I would try to make this a topic among Rust devs. Maybe there is a common practice with standard crates already. I do not really believe that no one ever had the idea of a simple native way to error out early in such case, instead of reading any size of file fully into RAM, before recognizing on the first bits that it cannot be processed anyway.

Let-chains are stable already.

I was also a bit confused since we use the version and standard which should support it. But when I tried this, I still got the error that it is unstable: https://github.com/lycheeverse/lychee/actions/runs/15986698169/job/45092399397

I mean the related PR has been merged only 5 days ago: rust-lang/rust#143214
Maybe it was stable already, but the related error has been removed just now, for Rust 1.90.0.

@mre
Copy link
Member

mre commented Jul 5, 2025

std::fs::read_to_string() only has to make a single allocation and reading the whole file at once is often faster than multiple smaller reads due to fewer system calls and better I/O patterns. There is https://github.com/rusticstuff/simdutf8, but I think it's overkill for what we need.
We can just use a buffer:

use std::fs::File;
use std::io::{BufRead, BufReader};

fn read_text_file_early_fail(path: &str) -> Result<String, Box<dyn std::error::Error>> {
    let file = File::open(path)?;
    let reader = BufReader::new(file);
    
    let mut content = String::new();
    for line in reader.lines() {
        content.push_str(&line?); // This will fail early on invalid UTF-8
        content.push('\n');
    }
    Ok(content)
}

But that might be slower for actual UTF-8 files, so we'd have to benchmark that. We could probably make it an extension trait on Path to make it more ergonomic.

@MichaIng
Copy link
Member Author

MichaIng commented Jul 5, 2025

Probably it is not worth it to tackle it then. I mean fragments used in URLs for non-text data is pretty rare anyway. For HTTP requests it was just easy enough as we do have the headers anyway.

@mre mre merged commit 92a9bca into lycheeverse:master Jul 6, 2025
6 checks passed
@mre
Copy link
Member

mre commented Jul 6, 2025

Good job!

@mre mre mentioned this pull request Jul 6, 2025
@MichaIng MichaIng deleted the content-type-based-fragment-checking branch July 6, 2025 16:52
@mre mre mentioned this pull request Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check fragments for remote URLs only for certain MIME types

2 participants