Skip to content

Validate safetensors data offsets#3364

Open
MillaFleurs wants to merge 2 commits intoml-explore:mainfrom
MillaFleurs:PR3363-safetensors-data_offsets-validation-fix
Open

Validate safetensors data offsets#3364
MillaFleurs wants to merge 2 commits intoml-explore:mainfrom
MillaFleurs:PR3363-safetensors-data_offsets-validation-fix

Conversation

@MillaFleurs
Copy link
Copy Markdown
Contributor

Proposed changes

#3363 Fix

The SafeTensors loader reads data_offsets from JSON metadata but does not validate the entry count, ordering, or consistency with the declared tensor shape and dtype. An attacker can declare a large shape (e.g., 1000×1000 float32 = 4 MB) while specifying data_offsets that cover only 4 bytes of actual data. When the tensor is evaluated, the loader reads far beyond the provided data, producing out-of-bounds memory access.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

Copy link
Copy Markdown
Collaborator

@zcbenz zcbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the malicious file's purpose is to trick the program to read more data than actual, it can simply provide a fake data_offsets together with a fake shape which would bypass the check here?

@MillaFleurs
Copy link
Copy Markdown
Contributor Author

Good observation. You're describing a scenario where all three metadata fields (shape, dtype, data_offsets) are internally consistent but the file doesn't actually contain enough data to back them. That's a valid concern, but it's a different class of issue from what this PR addresses.

I think you're right that this should be a seperate PR!

This PR fixes the case where data_offsets and shape*dtype disagree — an attacker declares a 4-byte data range but a 1000x1000 shape. Without this check, the loader silently constructs a 4MB tensor backed by 4 bytes of data. The consistency check catches this contradiction at load time, which is exactly what the https://docs.rs/safetensors/latest/safetensors/tensor/enum.SafeTensorError.html also enforces (via TensorInvalidInfo).

The scenario you describe is consistent metadata exceeding the actual file size and would require an additional file-size boundary check (which the Rust reference implementation also does via MetadataIncompleteBuffer).

I'd agree we should add as a follow-up improvement in a new PR.

Importantly, in that scenario the read() call will fail at eval time with an I/O error rather than silently reading garbage, so the impact is lower than the silent OOB this PR prevents.

Copy link
Copy Markdown
Collaborator

@zcbenz zcbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rebase to remove unrelated commits, and remove the docs and python test?

@zcbenz zcbenz changed the title PR3363 safetensors data offsets validation fix Validate safetensors data offsets Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants