Add CL_TYPE_AI_MODEL and associated file type magic signatures#1476
Conversation
This is just preliminary support for identifying an assortment of different AI model files. So far, this detects the following types: - GGML GGUF (.gguf) - ONNX AI (.onnx) - TensorFlow Lite (.tflite) Additional types to consider: - SafeTensors (.safetensors) - TensorFlow (.pb, .ckpt, .tfrecords) - Keras (.keras) - pickle (.pkl) - numpy (.npy, .npz) - coreml (.coreml) - PyTorch (.pt, .pth, .bin, .mar, .pte, .pt2, .ptl) Outside of being able to differentiate by file type, the scanner will treat CL_TYPE_AI_MODEL the same as CL_TYPE_BINARY_DATA. We're not adding parsers to further process these files, for now.
ab08a5a to
8a77214
Compare
|
This list maybe useful... https://github.com/trailofbits/ml-file-formats |
|
@Sanesecurity Thanks that's really helpful! Unfortunately, I found it's really hard to identify TensorFlow files because they mostly appear to be numpy arrays. Same with SafeTensor. The list I was given also treats numpy as a format for AI models, though I'm super skeptical about calling all numpy files "AI models". Numpy is useful for just data analysis in general. E.g. I use it for massaging pytest results into pretty graphs. If anyone has tips on recognizing these file types without using file extensions, I'd appreciate it. |
|
Thanks @Sanesecurity I've seen those but unfortunately that format they describe isn't very useful for identifying them. E.g. "TENSOR_NAME" and "NEXT_TENSOR_NAME" in the documentation are just placeholders for any name. The only strings that appear regularly are "dtype", "shape", and "data_offsets" which are generic to numpy arrays. |
This is just preliminary support for identifying an assortment of different AI model files.
So far, this detects the following types:
Additional types to consider:
Outside of being able to differentiate by file type, the scanner will treat CL_TYPE_AI_MODEL the same as CL_TYPE_BINARY_DATA. We're not adding parsers to further process these files, for now.