feat: skip fragment checking for unsupported MIME types #1744

MichaIng · 2025-06-25T21:15:47Z

The remote URL/website checker currently passes all URLs with fragments to the fragment checker as HTML document, even if it is a different or unsupported MIME type. This can cause false fragment checking for Markdown documents, failures for other MIME types, especially binaries, and unnecessary traffic for large downloads, which are always finished completely, if the fragment checker is invoked.

This commit checks the Content-Type header of the response:

Only if it is text/html, it is passed to the fragment checker as HTML type.
Only if it is text/markdown, of text/plain and URL path ends on .md, it is passed to the fragment checker as Markdown type.
In all other cases, the fragment checker is skipped and the HTTP status is returned.

To invoke the fragment checker with a variable document type, a new FileType argument is added to the check_html_fragment() function.

The fragment checker test and fixture are adjusted to match the expected result: checking a binary file via remote URL with fragment is now expected to succeed, since its Content-Type header does not invoke the fragment checker anymore.

MichaIng · 2025-06-25T21:32:21Z

Woooh, that was tedious, getting all the types and methods right, what is returning an option to use is_some_and(), what a result to use is_ok_and(), what is a string, or what needs to be converted to_str(), which however returns a result. And then clippy/rustfmt is suggesting sometimes iteratively changing results, like one suggestion results in another suggestion, which results in a format change suggestion:

ends_with(".rs")
=> extension().map_or(false, |ext| ext.eq_ignore_ascii_case("rs")) reasonable suggestion, but ...
=> extension().is_some_and(|ext| ext.eq_ignore_ascii_case("rs")) which I remember from elsewhere will be suggested 😄

Finally all green!

I am sure there are shorter/clearer/better performing ways for thees content type and file ending checks. But I am tired now 😅.

MichaIng · 2025-07-01T00:49:28Z

Simplifying things. What I just tried was:

                if self.include_fragments
                    && status.is_success()
                    && method == Method::GET
                    && response.url().fragment().is_some_and(|x| !x.is_empty())
                    && let Some(content_type) = response
                        .headers()
                        .get(CONTENT_TYPE)
                        .and_then(|x| x.to_str().ok())
                {
                    if content_type.starts_with("text/html") {
                        self.check_html_fragment(status, response, FileType::Html)
                            .await
                    } else if content_type.starts_with("text/markdown")
                        || (content_type.starts_with("text/plain")
                            && std::path::Path::new(response.url().path())
                                .extension()
                                .is_some_and(|x| x.eq_ignore_ascii_case("md")))
                    {
                        self.check_html_fragment(status, response, FileType::Markdown)
                            .await
                    } else {
                        status
                    }
                } else {
                    status
                }

But that throws a 'let' expressions in this position are unstable error, which however was somewhat solved upstream an hour ago (see last edit timestamp, what a coincidence): rust-lang/rust#53667

So with Rust 1.88.0+ and edition 2024, the above should work, with the let assignment being part of the parent if block. Until then however we need to use either one more if block (hence another else status repetition), or repeating the Some() validation for both cases, what I will do now.

MichaIng · 2025-07-01T01:31:27Z

Another approach to avoid the repetitive status and the doubled Some():

                if self.include_fragments
                    && status.is_success()
                    && method == Method::GET
                    && response.url().fragment().is_some_and(|x| !x.is_empty())
                {
                    if let Some(content_type) = response
                        .headers()
                        .get(CONTENT_TYPE)
                        .and_then(|x| x.to_str().ok())
                    {
                        if content_type.starts_with("text/html") {
                            return self.check_html_fragment(status, response, FileType::Html)
                                .await
                        } else if content_type.starts_with("text/markdown")
                            || (content_type.starts_with("text/plain")
                                && std::path::Path::new(response.url().path())
                                    .extension()
                                    .is_some_and(|x| x.eq_ignore_ascii_case("md")))
                        {
                            return self.check_html_fragment(status, response, FileType::Markdown)
                                .await
                        }
                    }
                }
                status

And next Rust release, && if let is not assumed unstable anymore, we can platten it by merging first and second if. Of course also when keeping the match without early returns.

lychee-lib/src/checker/website.rs

mre

Added one comment. Apart from that, looks fine.

mre · 2025-07-03T18:29:38Z

From my side, we could merge as is and aim for a follow-up to introduce From<& Response>. Sounds good?

MichaIng · 2025-07-03T21:30:49Z

Okay, looks good, better readable. And the two match blocks are potentially expandable:

We might want to add support for more content types.
We might want to add support for more file extensions in case of text/plain content type.

The current structure allows both.

fixtures/fragments/file1.md

MichaIng · 2025-07-04T14:28:28Z

Btw, the local file checker still passes binary data to the fragment checker, if a fragment is given. If I am not mistaken this means it reads the whole file content into RAM, which can be problematic when a large number of files, or large local files are checked, probably one of the reasons it fails at some point, when memory is full: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/checker/file.rs#L218C14-L234

But I have no good idea how to prevent that, since of course we have no response header to check. Maybe there is a way to abort this read_to_string early, if it is no plain text file: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/utils/fragment_checker.rs#L24
But a topic for a different PR 😉.

EDIT: Though: https://doc.rust-lang.org/std/fs/fn.read_to_string.html

If the contents of the file are not valid UTF-8, then an error will also be returned.

Sounds like it errors out early already.

EDIT2: Nah, the code looks more like it really reads everything into memory first, and tries to interpret it as UTF-8 afterwards. Not optimal.

The remote URL/website checker currently passes all URLs with fragments to the fragment checker as HTML document, even if it is a different or unsupported MIME type. This can cause false fragment checking for Markdown documents, failures for other MIME types, especially binaries, and unnecessary traffic for large downloads, which are always finished completely, if the fragment checker is invoked. This commit checks the Content-Type header of the response: - Only if it is `text/html`, it is passed to the fragment checker as HTML type. - Only if it is `text/markdown`, of `text/plain` and URL path ends on `.md`, it is passed to the fragment checker as Markdown type. - In all other cases, the fragment checker is skipped and the HTTP status is returned. To invoke the fragment checker with a variable document type, a new `FileType` argument is added to the `check_html_fragment()` function. The fragment checker test and fixture are adjusted to match the expected result: checking a binary file via remote URL with fragment is now expected to succeed, since its Content-Type header does not invoke the fragment checker anymore. Signed-off-by: MichaIng <[email protected]>

MichaIng · 2025-07-04T14:45:05Z

Conflict solved.

Co-authored-by: MichaIng <[email protected]>

mre · 2025-07-04T21:24:48Z

EDIT2: Nah, the code looks more like it really reads everything into memory first, and tries to interpret it as UTF-8 afterwards. Not optimal.

We could use libmagic to check the file type before reading.
I found this to be an up-to-date crate: https://github.com/marirs/filemagic-rs

mre · 2025-07-04T21:28:25Z

And next Rust release, && if let is not assumed unstable anymore, we can platten it by merging first and second if. Of course also when keeping the match without early returns.

Let-chains are stable already. We could bump the MSRV (minimum supported Rust version) for this. Let me know if you'd like to change it to that or if you're happy. From my side, it's fine either way.

MichaIng · 2025-07-04T23:03:29Z

We could use libmagic to check the file type before reading.

Let me first verify whether really all data is loaded into RAM before the failure. If so, I would try to make this a topic among Rust devs. Maybe there is a common practice with standard crates already. I do not really believe that no one ever had the idea of a simple native way to error out early in such case, instead of reading any size of file fully into RAM, before recognizing on the first bits that it cannot be processed anyway.

Let-chains are stable already.

I was also a bit confused since we use the version and standard which should support it. But when I tried this, I still got the error that it is unstable: https://github.com/lycheeverse/lychee/actions/runs/15986698169/job/45092399397

I mean the related PR has been merged only 5 days ago: rust-lang/rust#143214
Maybe it was stable already, but the related error has been removed just now, for Rust 1.90.0.

mre · 2025-07-05T06:54:13Z

std::fs::read_to_string() only has to make a single allocation and reading the whole file at once is often faster than multiple smaller reads due to fewer system calls and better I/O patterns. There is https://github.com/rusticstuff/simdutf8, but I think it's overkill for what we need.
We can just use a buffer:

use std::fs::File;
use std::io::{BufRead, BufReader};

fn read_text_file_early_fail(path: &str) -> Result<String, Box<dyn std::error::Error>> {
    let file = File::open(path)?;
    let reader = BufReader::new(file);
    
    let mut content = String::new();
    for line in reader.lines() {
        content.push_str(&line?); // This will fail early on invalid UTF-8
        content.push('\n');
    }
    Ok(content)
}

But that might be slower for actual UTF-8 files, so we'd have to benchmark that. We could probably make it an extension trait on Path to make it more ergonomic.

MichaIng · 2025-07-05T15:25:09Z

Probably it is not worth it to tackle it then. I mean fragments used in URLs for non-text data is pretty rare anyway. For HTTP requests it was just easy enough as we do have the headers anyway.

mre · 2025-07-06T08:46:20Z

Good job!

MichaIng added the enhancement New feature or request label Jun 25, 2025

MichaIng force-pushed the content-type-based-fragment-checking branch 2 times, most recently from 376ed54 to 08b0ec2 Compare July 1, 2025 00:29

MichaIng force-pushed the content-type-based-fragment-checking branch 2 times, most recently from 83630cb to 34a933c Compare July 1, 2025 00:58

mre reviewed Jul 3, 2025

View reviewed changes

lychee-lib/src/checker/website.rs Outdated Show resolved Hide resolved

mre reviewed Jul 3, 2025

View reviewed changes

MichaIng force-pushed the content-type-based-fragment-checking branch 4 times, most recently from f91f075 to 7b1c3e9 Compare July 3, 2025 21:23

mre reviewed Jul 4, 2025

View reviewed changes

fixtures/fragments/file1.md Outdated Show resolved Hide resolved

MichaIng force-pushed the content-type-based-fragment-checking branch from 7b1c3e9 to a68c3ad Compare July 4, 2025 14:43

Update fixtures/fragments/file1.md

1beba54

Co-authored-by: MichaIng <[email protected]>

mre approved these changes Jul 6, 2025

View reviewed changes

mre merged commit 92a9bca into lycheeverse:master Jul 6, 2025
6 checks passed

mre mentioned this pull request Jul 6, 2025

chore: release v0.20.0 #1760

Closed

MichaIng deleted the content-type-based-fragment-checking branch July 6, 2025 16:52

mre mentioned this pull request Aug 19, 2025

chore: release v0.20.0 #1808

Merged

Uh oh!

feat: skip fragment checking for unsupported MIME types #1744

feat: skip fragment checking for unsupported MIME types #1744

Uh oh!

Conversation

MichaIng commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaIng commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaIng commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaIng commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mre left a comment

Choose a reason for hiding this comment

Uh oh!

mre commented Jul 3, 2025

Uh oh!

MichaIng commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MichaIng commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaIng commented Jul 4, 2025

Uh oh!

mre commented Jul 4, 2025

Uh oh!

mre commented Jul 4, 2025

Uh oh!

MichaIng commented Jul 4, 2025

Uh oh!

mre commented Jul 5, 2025

Uh oh!

MichaIng commented Jul 5, 2025

Uh oh!

Uh oh!

mre commented Jul 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MichaIng commented Jun 25, 2025 •

edited

Loading

MichaIng commented Jun 25, 2025 •

edited

Loading

MichaIng commented Jul 1, 2025 •

edited

Loading

MichaIng commented Jul 1, 2025 •

edited

Loading

MichaIng commented Jul 3, 2025 •

edited

Loading

MichaIng commented Jul 4, 2025 •

edited

Loading