Skip to content

Add CL_TYPE_AI_MODEL and associated file type magic signatures#1476

Merged
val-ms merged 1 commit intoCisco-Talos:mainfrom
val-ms:CLAM-2741-ai-model-file-type-detection
Mar 28, 2025
Merged

Add CL_TYPE_AI_MODEL and associated file type magic signatures#1476
val-ms merged 1 commit intoCisco-Talos:mainfrom
val-ms:CLAM-2741-ai-model-file-type-detection

Conversation

@val-ms
Copy link
Contributor

@val-ms val-ms commented Mar 27, 2025

This is just preliminary support for identifying an assortment of different AI model files.

So far, this detects the following types:

  • GGML GGUF (.gguf)
  • ONNX AI (.onnx)
  • TensorFlow Lite (.tflite)

Additional types to consider:

  • SafeTensors (.safetensors)
  • TensorFlow (.pb, .ckpt, .tfrecords)
  • Keras (.keras)
  • pickle (.pkl)
  • numpy (.npy, .npz)
  • coreml (.coreml)
  • PyTorch (.pt, .pth, .bin, .mar, .pte, .pt2, .ptl)

Outside of being able to differentiate by file type, the scanner will treat CL_TYPE_AI_MODEL the same as CL_TYPE_BINARY_DATA. We're not adding parsers to further process these files, for now.

This is just preliminary support for identifying an assortment of
different AI model files.

So far, this detects the following types:
- GGML GGUF (.gguf)
- ONNX AI (.onnx)
- TensorFlow Lite (.tflite)

Additional types to consider:
- SafeTensors (.safetensors)
- TensorFlow (.pb, .ckpt, .tfrecords)
- Keras (.keras)
- pickle (.pkl)
- numpy (.npy, .npz)
- coreml (.coreml)
- PyTorch (.pt, .pth, .bin, .mar, .pte, .pt2, .ptl)

Outside of being able to differentiate by file type, the scanner
will treat CL_TYPE_AI_MODEL the same as CL_TYPE_BINARY_DATA.
We're not adding parsers to further process these files, for now.
@val-ms val-ms force-pushed the CLAM-2741-ai-model-file-type-detection branch from ab08a5a to 8a77214 Compare March 27, 2025 18:30
@Sanesecurity
Copy link

This list maybe useful...

https://github.com/trailofbits/ml-file-formats

@val-ms
Copy link
Contributor Author

val-ms commented Mar 27, 2025

@Sanesecurity Thanks that's really helpful!

Unfortunately, I found it's really hard to identify TensorFlow files because they mostly appear to be numpy arrays. Same with SafeTensor. The list I was given also treats numpy as a format for AI models, though I'm super skeptical about calling all numpy files "AI models". Numpy is useful for just data analysis in general. E.g. I use it for massaging pytest results into pretty graphs.

If anyone has tips on recognizing these file types without using file extensions, I'd appreciate it.

@val-ms
Copy link
Contributor Author

val-ms commented Mar 27, 2025

Thanks @Sanesecurity I've seen those but unfortunately that format they describe isn't very useful for identifying them. E.g. "TENSOR_NAME" and "NEXT_TENSOR_NAME" in the documentation are just placeholders for any name. The only strings that appear regularly are "dtype", "shape", and "data_offsets" which are generic to numpy arrays.

@val-ms val-ms merged commit a80db1b into Cisco-Talos:main Mar 28, 2025
23 of 24 checks passed
@val-ms val-ms deleted the CLAM-2741-ai-model-file-type-detection branch March 28, 2025 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants