Skip to content

FIPS-mode support Re: md5 -> sha2-256 caching, FP signatures, legacy CVD dsig verification, and related improvements#1532

Merged
val-ms merged 34 commits intoCisco-Talos:mainfrom
val-ms:CLAM-1859-sha256-cache
Aug 19, 2025
Merged

FIPS-mode support Re: md5 -> sha2-256 caching, FP signatures, legacy CVD dsig verification, and related improvements#1532
val-ms merged 34 commits intoCisco-Talos:mainfrom
val-ms:CLAM-1859-sha256-cache

Conversation

@val-ms
Copy link
Contributor

@val-ms val-ms commented Jul 2, 2025

This PR migrates ClamAV away from dependence on MD5 so that we can properly operate in FIPS environment. But it includes numerous other improvements.

This PR touches on a lot of modules and interfaces. Please review each commit messages. They are a little bit more detailed than the following. You may also find it easier to review each commit diff one at a time. They build on each other but are largely independent.

The key changes:

  1. Migrate the clean cache from MD5 to SHA2-256.

    This task had major ripple effects because the MD5 we calculate to use for the clean cache is also made available to callbacks. The consequence is that I had to make sure we only calcualte MD5 (or any hash) when required. And then offer new callbacks that let the caller get whichever hash they want. This tied into other necessary improvements to the callbacks, and the cl_scan*() interfaces.

    This snowballed - but I am very happy with the end product.

  2. Disable the legacy MD5-based CVD digital signature verification method when OpenSSL has FIPS-mode enabled, or when ClamAV is configured to enforce FIPS limits. This was rather simple.

  3. Disable loading MD5 and SHA1 false positive (FP) signatures or using those hashes to check for FPs. This was also simple.

At a high level, this PR adds the following:

  • Add new option to calculate and record additional hash types when the "generate metadata JSON" feature is enabled:

    • libclamav option: CL_SCAN_GENERAL_STORE_EXTRA_HASHES
    • clamscan option: --json-store-extra-hashes (default off)
    • clamd.conf option: JsonStoreExtraHashes (default 'no')
  • Renamed the sigtool option --sha256 to --sha2-256. The original option is still functional, but is deprecated.

  • Record unique object-id for each layer scanned. This can now be found in the scan metadata JSON and may be queried for each scan layer from the new scan callbacks functions (below).

  • For the "generate metadata JSON" feature:

    • The file hash is now stored as "sha2-256" instead of "FileMD5".
      • If you enable the "extra hashes" option, then it will also record "md5" and "sha1".
    • Each object scanned now has a unique "Object ID".
    • We now also save URIs for HTML files.
      • In the 1.5.0-beta, you would find that a scan of an HTML within another file like a ZIP would not record URIs. As URI recording is new in 1.5, this does not break behavior with any ClamAV release.
  • Signature matches with names that start with "Weak." will no longer alert. They will instead be tracked internally and will be present in scan metadata JSON. This is a step towards enabling alerting signatures (Strong indicators or PUA indicators) to depend on prior Weak indicator matches in the current layer or in child layers.

  • libclamav: Added a new engine option to toggle temp directory recursion.

    Temp directory recursion is the idea that each object scanned in ClamAV's recursive extract/scan process would get a new temp subdirectory, mimicing the nesting structure of the file. It was introduced in ClamAV 0.103 and is enabled whenever --leave-temps / LeaveTemporaryFiles is enabled. This PR makes it that an application can separately enable it if they wish. For ClamScan and ClamD, it will remain tied to --leave-temps / LeaveTemporaryFiles. The new engine option can be enabled with:

    cl_engine_set_num(engine, CL_ENGINE_TMPDIR_RECURSION, 1);
  • ClamBC: fix crashes on startup related to recursion stack initialization and cleanup.
    Resolves: Execute clambc -f 3986187.cbc return Segmentation fault (core dumped) #1484

  • libclamav: Add cl_scanfile_ex(), cl_scanmap_ex(), and cl_scandesc_ex() functions that provide the following additional parameters:

    • hash_hint: (Optional) A NULL terminated string of the file hash so that libclamav does not need to calculate it.
    • [out] hash_out: (Optional) A NULL terminated string of the file hash. The caller is responsible for freeing the string.
    • hash_alg: The hashing algorithm used for either hash_hint or hash_out. Supported algorithms are "md5", "sha1", "sha2-256". If not specified, the default is "sha2-256".
    • file_type_hint: (Optional) A NULL terminated string of the file type hint. E.g. "pe", "elf", "zip", etc. You may also use ClamAV type names such as "CL_TYPE_PE". ClamAV will ignore the hint if it is not familiar with the specified type. See also: https://docs.clamav.net/appendix/FileTypes.html#file-types
    • file_type_out: (Optional) A NULL terminated string of the file type of the top layer as determined by ClamAV. Will take the form of the standard ClamAV file type format. E.g. "CL_TYPE_PE". See also: https://docs.clamav.net/appendix/FileTypes.html#file-types
  • ClamScan: Add hash & file-type in/out CLI options. We will NOT be adding this for ClamDScan, as we don't have a mechanism
    in the ClamD socket API to receive scan options or a way for ClamD to include scan metadata in the response:

    • --hash-hint: The file hash so that libclamav does not need to calculate it. The type of hash must match the --hash-alg.
    • --log-hash: Print the file hash after each file scanned. The type of hash printed will match the --hash-alg.
    • --hash-alg: The hashing algorithm used for either --hash-hint or --log-hash. Supported algorithms are 'md5', 'sha1', 'sha2-256'. If not specified, the default is 'sha2-256'.
    • --file-type-hint: The file type hint so that libclamav can optimize scanning. E.g. 'pe', 'elf', 'zip', etc. You may also use ClamAV type names such as 'CL_TYPE_PE'. ClamAV will ignore the hint if it is not familiar with the specified type. See also: https://docs.clamav.net/appendix/FileTypes.html#file-types
    • --log-file-type: Print the file type after each file scanned.
  • Fix double-extraction of OOXML-based office documents.

  • Always run callbacks for embedded files. That is, files found within other files through signature matches instead of by parsing the files will now be processed the same way and so they will trigger application callbacks (e.g. "pre-scan", "post-scan", etc.).

  • libclamav: Added new scan-layer callback API functions:

      cl_engine_set_scan_callback(engine, &pre_hash_callback, CL_SCAN_CALLBACK_PRE_HASH);
      cl_engine_set_scan_callback(engine, &pre_scan_callback, CL_SCAN_CALLBACK_PRE_SCAN);
      cl_engine_set_scan_callback(engine, &post_scan_callback, CL_SCAN_CALLBACK_POST_SCAN);
      cl_engine_set_scan_callback(engine, &alert_callback, CL_SCAN_CALLBACK_ALERT);
      cl_engine_set_scan_callback(engine, &file_type_callback, CL_SCAN_CALLBACK_FILE_TYPE);

    Each callback may alter scan behavior using the following return codes:

    • CL_BREAK: Scan aborted by callback (the rest of the scan is skipped). This does not mark the file as clean or infected, it just skips the rest of the scan.

    • CL_SUCCESS / CL_CLEAN: File scan will continue. This is different than CL_VERIFIED because it does not affect prior or future alerts. Return CL_VERIFIED instead if you want to remove prior alerts for this layer and skip the rest of the scan for this layer.

    • CL_VIRUS: This means you don't trust the file. A new alert will be added. For CL_SCAN_CALLBACK_ALERT: Means you agree with the alert (no extra alert needed).

    • CL_VERIFIED: Layer explicitly trusted by the callback and previous alerts removed FOR THIS layer. You might want to do this if you trust the hash or verified a digital signature. The rest of the scan will be skipped FOR THIS layer. For contained files, this does NOT mean that the parent or adjacent layers are trusted.

    Each callback is given a pointer to the current scan layer from which they can get previous layers, can get the the layer's fmap, and then various attributes of the layer and of the fmap. To make this possible, this I introduced a handful of new APIs to
    query scan-layer details and fmap details:

      cl_error_t cl_fmap_set_name(cl_fmap_t *map, const char *name);
      cl_error_t cl_fmap_get_name(cl_fmap_t *map, const char **name_out);
      cl_error_t cl_fmap_set_path(cl_fmap_t *map, const char *path);
      cl_error_t cl_fmap_get_path(cl_fmap_t *map, const char **path_out, size_t *offset_out, size_t *len_out);
      cl_error_t cl_fmap_get_fd(const cl_fmap_t *map, int *fd_out, size_t *offset_out, size_t *len_out);
      cl_error_t cl_fmap_get_size(const cl_fmap_t *map, size_t *size_out);
      cl_error_t cl_fmap_set_hash(const cl_fmap_t *map, const char *hash_alg, char hash);
      cl_error_t cl_fmap_have_hash(const cl_fmap_t *map, const char *hash_alg, bool *have_hash_out);
      cl_error_t cl_fmap_will_need_hash_later(const cl_fmap_t *map, const char *hash_alg);
      cl_error_t cl_fmap_get_hash(const cl_fmap_t *map, const char *hash_alg, const char **hash_out);
      cl_error_t cl_fmap_get_data(const cl_fmap_t *map, size_t offset, size_t len, const uint8_t **data_out, size_t *data_len_out);
      cl_error_t cl_scan_layer_get_fmap(cl_scan_layer_t *layer, cl_fmap_t **fmap_out);
      cl_error_t cl_scan_layer_get_parent_layer(cl_scan_layer_t *layer, cl_scan_layer_t **parent_layer_out);
      cl_error_t cl_scan_layer_get_type(cl_scan_layer_t *layer, const char **type_out);
      cl_error_t cl_scan_layer_get_recursion_level(cl_scan_layer_t *layer, uint32_t *recursion_level_out);
      cl_error_t cl_scan_layer_get_object_id(cl_scan_layer_t *layer, uint64_t *object_id_out);
      cl_error_t cl_scan_layer_get_last_alert(cl_scan_layer_t *layer, const char **alert_name_out);
      cl_error_t cl_scan_layer_get_attributes(cl_scan_layer_t *layer, uint32_t *attributes_out);

    This deprecates, but does not immediately remove, the existing scan callbacks:

      void cl_engine_set_clcb_pre_cache(struct cl_engine *engine, clcb_pre_cache callback);
      void cl_engine_set_clcb_file_inspection(struct cl_engine *engine, clcb_file_inspection callback);
      void cl_engine_set_clcb_pre_scan(struct cl_engine *engine, clcb_pre_scan callback);
      void cl_engine_set_clcb_post_scan(struct cl_engine *engine, clcb_post_scan callback);
      void cl_engine_set_clcb_virus_found(struct cl_engine *engine, clcb_virus_found callback);
      void cl_engine_set_clcb_hash(struct cl_engine *engine, clcb_hash callback);

    I have also added an interactive test program to demonstrate the callbacks. See: examples/ex_scan_callbacks.c

  • ClamScan & libclamav: Improved the precision of tje bytes-scanned and bytes-read counters. The ClamScan scan summary will now report exact counts in "GiB", "MiB", "KiB", or "B" as appropriate. Previously, it always report "MB".

  • libclamav: Added extended hashing functions with a "flags" parameter that allows the caller to choose if they want to bypass FIPS:

      cl_error_t cl_hash_data_ex(const char *alg, const uint8_t *data, size_t data_len, uint8_t **hash, size_t *hash_len, uint32_t flags);
      cl_error_t cl_hash_init_ex(const char *alg, uint32_t flags, cl_hash_ctx_t **ctx_out);
      cl_error_t cl_update_hash_ex(cl_hash_ctx_t *ctx, const uint8_t *data, size_t length);
      cl_error_t cl_finish_hash_ex(cl_hash_ctx_t *ctx, uint8_t **hash, size_t *hash_len, uint32_t flags);
      void cl_hash_destroy(void *ctx);
      cl_error_t cl_hash_file_fd_ex(const char *alg, int fd, size_t offset, size_t length, uint8_t **hash, size_t *hash_len, uint32_t flags);
  • Freshclam, ClamD, ClamScan, and Sigtool: Added an option to enable FIPS-like limits disabling MD5 and SHA1 from being used for verifying digital sigantures or for being used for false-positive (FP) checks.

    For freshclam.conf and clamd.conf set this config option:

    FIPSCryptoHashLimits yes
    

    For clamscan and sigtool use this commandline option:

    --fips-limits
    

    For libclamav: Enable FIPS-limits for a ClamAV engine like this:

    cl_engine_set_num(engine, CL_ENGINE_FIPS_LIMITS, 1);

CLAM-255
CLAM-1412
CLAM-1433
CLAM-1583
CLAM-1858
CLAM-1859
CLAM-1860
CLAM-2485
CLAM-2626
CLAM-2796

Resolves: #564

@val-ms val-ms force-pushed the CLAM-1859-sha256-cache branch from 846f128 to c3ada9e Compare July 2, 2025 16:23
@val-ms val-ms changed the base branch from CLAM-1859-sha256-cache to main July 2, 2025 16:58
@val-ms val-ms force-pushed the CLAM-1859-sha256-cache branch from c3ada9e to 392095b Compare July 2, 2025 18:01
@val-ms val-ms force-pushed the CLAM-1859-sha256-cache branch 2 times, most recently from a8a308e to dccb188 Compare July 2, 2025 21:16
TheRaynMan
TheRaynMan previously approved these changes Jul 2, 2025
Copy link
Contributor

@TheRaynMan TheRaynMan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished full review together, and we did amendments and suggestions live.

@val-ms
Copy link
Contributor Author

val-ms commented Jul 3, 2025

It looks like I accidentally broke support for openssl 1.1 and also libjson 13.1. Our Jenkins AlmaLinux 8 build fails with:

../libclamav/libclamav.so.12.1.0: undefined reference to `EVP_MD_free'
../libclamav/libclamav.so.12.1.0: undefined reference to `EVP_MD_fetch'
../libclamav/libclamav.so.12.1.0: undefined reference to `json_object_new_uint64'

If we want to maintain support for those, will have to #ifdef around them. Easy to do for OpenSSL with OPENSSL_VERSION_MAJOR. Will have to look up how to do it for libjson-c.

@val-ms val-ms force-pushed the CLAM-1859-sha256-cache branch 3 times, most recently from fc296fa to 98456e7 Compare July 24, 2025 15:22
@val-ms val-ms force-pushed the CLAM-1859-sha256-cache branch 2 times, most recently from 71d026c to 8307327 Compare July 28, 2025 15:17
@val-ms val-ms force-pushed the CLAM-1859-sha256-cache branch from 8307327 to fda858e Compare July 29, 2025 04:39
Copy link

@csbg-draynor csbg-draynor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my consistency recommendations from review were resolved.

TheRaynMan
TheRaynMan previously approved these changes Jul 29, 2025
@rasundri
Copy link

rasundri commented Aug 6, 2025

@val-ms My Observations while doing the testing.

  • When --hash-hint is used, scan debug output should show that the hash calculation is being skipped. The current debug logs do not indicate that hash calculation is skipped when --hash-hint is provided.
  • When provided an invalid value to --hash-algo like "--hash-alg md5,sha256,sharavi ", the final output shows it too.
  • JSON Metadata has file-type detections as "Viruses" when the file-type callback is set to a CL_VIRUS, then CL_BREAK to abort scan, and CL_VERIFIED to trust all the layers.
{
  "Magic":"CLAMJSONv0",
  "FileName":"test.txt",
  "FilePath":"../../../../../PR-1532/test-files/test.txt",
  "FileSize":110,
  "ObjectID":0,
  "Viruses":[
    "Detected.By.Callback.FileType"
  ]
}
Metadata JSON Return Code: CL_SUCCESS (0)

Data scanned: 0 B
Hash:         e3f667e3dbc5ee4c7599b27994d13783cf3afa1565ec768ff1edf230c06868fa
File Type:    CL_TYPE_TEXT_ASCII
Verdict:      CL_VERDICT_NOTHING_FOUND
Return Code:  CL_SUCCESS (0)

Copy link
Contributor

@rsundriyal rsundriyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add the change to debug logs for --hash-hint use. Apart from it, looks great.

@val-ms

@val-ms val-ms force-pushed the CLAM-1859-sha256-cache branch from 2538726 to 3f1cdca Compare August 11, 2025 18:05
val-ms added 12 commits August 14, 2025 22:40
Add test to ensure CVD loading fails with --fips-limits flag unless you
provide the .cvd.sign file.

Also provide .sign files in freshclam tests because they'll be needed to
pass if building in a fips-enabled environment.
Scripted inputs may be used for automated tests.

Added automated tests for the example program to verify correct behavior
using different callback return codes and also using the new scan layer and
fmap API's.

Fixed a bug in ClamAV's evidence module (recording strong, PUA, and
weak indicators for each layer). Rust HashMaps are unordered so the
feature to get the last alert would return a random alert and not
specifically the last one. Switching to IndexMap resolves this, and
allows us to maintain insertion-order for iterating keys even when
removing a key.
Add string message to `cl_strerror()` and in example program for
cl_error_t::CL_BREAK enum variant.

Also fix uninitialized memory use issue in example program.
It is a shortcoming of existing scan APIs that it is not possible
to return an error without masking a verdict.
We presently work around this limitation by counting up detections at
the end and then overriding the error code with `CL_VIRUS`, if necessary.

The `cl_scanfile_ex()`, `cl_scandesc_ex()`, and `cl_scanmap_ex()` functions
should provide the scan verdict separately from the error code.

This introduces a new enum for recording and reporting a verdict:
`cl_verdict_t` with options:

- `CL_VERDICT_NOTHING_FOUND`
- `CL_VERDICT_TRUSTED`
- `CL_VERDICT_STRONG_INDICATOR`
- `CL_VERDICT_POTENTIALLY_UNWANTED`

Notably, the newer scan APIs may set the verdict to `CL_VERDICT_TRUSTED`
if there is a (hash-based) FP signature for a file, or in the cause where
Authenticode or similar certificate-based verification was performed, or
in the case where an application scan callback returned `CL_VERIFIED`.

CLAM-763
CLAM-865
Improvements to the ex_scan_callbacks.c program:
- Print the verdict enum variant names to be more explicit.
- Add the file_props callback (aka metadata JSON) with --gen-json option.
- Add a --debug option.
- Use '-' in option names instead of '_' to be consistent with other programs.
- Add option to disable allmatch, which I named --one-match. :)

Tests: Add ex_scan_callbacks test where --allmatch is disabled.
Verify that CL_VIRUS is returned when a match occurs.
I found a few bugs and inconsistencies from this test and went and fixed
them, and improved the clamav.h function comments as well.
Largely this resulted in cleanup in `cli_magic_scan()` to make sure we
don't accidentally overwrite the return code.
But it also meant making sure that callback functions which are supposed
to trust a file actually clear the evidence/verdict and don't return
CL_VIRUS.
Both enum variants are 0, so it's a no-op.

`cl_error_t` should be used to determine if we keep going or stop, whether
that's because there was a detection and we're not in allmatch mode, or if
because of an error.

`cl_verdict_t` should be used to determine or say the verdict (clean is a
verdict, though I feel 'nothing found' is more accurate).
Also minor fixes to sys.rs and clamav.h formatting
If the outermost layer is trusted (e.g. using an FP signature), the verdict passed
back by the `cl_scan*_ex()` functions should be CL_VERDICT_TRUSTED.
To ensure this, and other correct verdicts, I moved the logic setting the verdict
to occur when adding indicators, or trusting a given layer. Then at the end of a
scan, it will set the output verdict parameter to the top level verdict.

This commit also:

* Fixes a bug in the `ex_scan_callbacks` program where a crash would happen when
  a hash was retrieved for an inner layer, but isn't for the container.

* Added debug logs whenever a hash is calculated or set, printing the hash type
  and hash string.

* When a layer is trusted, in addition to removing evidence for that layer, it
  will also index the metadata JSON (if that feature is enabled) and will rename
  any "Viruses" to "IgnoredAlerts", and rename "ContainedIndicators" to
  "IgnoredContainedIndicators".

* Fixed an issue where setting the hash algorithm with extra characters, like
  setting to "sha256789" would ignore the extra characters, and report the hash
  type as the same. It will now fail if the string length differs from the known
  hash algorithm.
The current name "Viruses" is incorrect both because not all malware
are viruses, but also because ClamAV may be used to classify other
data, not just to search for malware indicators.

Renaming to "Alerts" is more consistent with other language such as
the options "--alert-exceeds-max", etc.
It will match the new "IgnoredAlerts", "ContainedAlerts", and
"IgnoredContainedAlerts" JSON key names.
We presently record Alerts as an array of signature names.
Instead, it should be an object with properties of its own.

We should record alerting indicators and weak indicators in a single
"Indicators", likely with the same structure as the "Alerts" objects.

When an alerting indicator is ignored (e.g. ignored by callback or if
the file is trusted by an FP signature), we can remove it from the
"Alerts" array, and for the "Indicators" array, add a "Ignored" key with
a string value that explains why it was ignored.

This eliminates the need to track and propagate the additional
"WeakIndicators" and "IgnoredAlerts" arrays.
@val-ms val-ms dismissed stale reviews from rsundriyal and TheRaynMan via b34ea5e August 15, 2025 02:42
@val-ms val-ms force-pushed the CLAM-1859-sha256-cache branch from 6c6b77c to b34ea5e Compare August 15, 2025 02:42
@val-ms val-ms requested a review from TheRaynMan August 15, 2025 02:42
TheRaynMan
TheRaynMan previously approved these changes Aug 15, 2025
Copy link
Contributor

@TheRaynMan TheRaynMan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still okay after rebase.

val-ms added 2 commits August 18, 2025 12:27
For OpenSSL 1, `EVP_get_digestbyname()` will fail with "sha2-*" algorithm names.
Must use "sha256", etc.

I made a shim that does the conversion, and I made an improvement to ignore case
when converting alg names to our hash type enumeration.

Other fixes for a few warnings.
@val-ms val-ms force-pushed the CLAM-1859-sha256-cache branch from b11cbf1 to 5314973 Compare August 18, 2025 16:27
@val-ms
Copy link
Contributor Author

val-ms commented Aug 18, 2025

Added openssl 1.x compatibility fix for debian:11 when using distro-provided dependencies. Force-push is to fix commit signing because apparently my macbook's signing isn't properly configured.

@val-ms val-ms merged commit 7e24507 into Cisco-Talos:main Aug 19, 2025
22 of 24 checks passed
@val-ms val-ms deleted the CLAM-1859-sha256-cache branch August 19, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Execute clambc -f 3986187.cbc return Segmentation fault (core dumped) ClamAV is unusable on FIPS-enabled Linux systems due to MD5 use

5 participants