Validate safetensors data offsets#3364
Conversation
…r gguf specs where dim = 4
zcbenz
left a comment
There was a problem hiding this comment.
If the malicious file's purpose is to trick the program to read more data than actual, it can simply provide a fake data_offsets together with a fake shape which would bypass the check here?
|
Good observation. You're describing a scenario where all three metadata fields (shape, dtype, data_offsets) are internally consistent but the file doesn't actually contain enough data to back them. That's a valid concern, but it's a different class of issue from what this PR addresses. I think you're right that this should be a seperate PR! This PR fixes the case where data_offsets and shape*dtype disagree — an attacker declares a 4-byte data range but a 1000x1000 shape. Without this check, the loader silently constructs a 4MB tensor backed by 4 bytes of data. The consistency check catches this contradiction at load time, which is exactly what the https://docs.rs/safetensors/latest/safetensors/tensor/enum.SafeTensorError.html also enforces (via TensorInvalidInfo). The scenario you describe is consistent metadata exceeding the actual file size and would require an additional file-size boundary check (which the Rust reference implementation also does via MetadataIncompleteBuffer). I'd agree we should add as a follow-up improvement in a new PR. Importantly, in that scenario the read() call will fail at eval time with an I/O error rather than silently reading garbage, so the impact is lower than the silent OOB this PR prevents. |
zcbenz
left a comment
There was a problem hiding this comment.
Can you rebase to remove unrelated commits, and remove the docs and python test?
Proposed changes
#3363 Fix
The SafeTensors loader reads
data_offsetsfrom JSON metadata but does not validate the entry count, ordering, or consistency with the declared tensor shape and dtype. An attacker can declare a large shape (e.g., 1000×1000 float32 = 4 MB) while specifyingdata_offsetsthat cover only 4 bytes of actual data. When the tensor is evaluated, the loader reads far beyond the provided data, producing out-of-bounds memory access.Checklist
Put an
xin the boxes that apply.pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes