diff --git a/docs/assets/images/hash_code_id_field.png b/docs/assets/images/hash_code_id_field.png new file mode 100644 index 00000000000..af767a68493 Binary files /dev/null and b/docs/assets/images/hash_code_id_field.png differ diff --git a/docs/assets/images/hash_code_status_column.png b/docs/assets/images/hash_code_status_column.png new file mode 100644 index 00000000000..c2e4a06c3be Binary files /dev/null and b/docs/assets/images/hash_code_status_column.png differ diff --git a/docs/content/en/working_with_findings/finding_deduplication/deduplication_algorithms.md b/docs/content/en/working_with_findings/finding_deduplication/deduplication_algorithms.md new file mode 100644 index 00000000000..f0efc473081 --- /dev/null +++ b/docs/content/en/working_with_findings/finding_deduplication/deduplication_algorithms.md @@ -0,0 +1,67 @@ +--- +title: "Deduplication Algorithms" +description: "How DefectDojo identifies duplicates: Unique ID, Hash Code, Unique ID or Hash Code, Legacy" +weight: 3 +--- + +## Overview + +DefectDojo supports four deduplication algorithms that can be selected per parser (test type): + +- **Unique ID From Tool**: Uses the scanner-provided unique identifier. +- **Hash Code**: Uses a configured set of fields to compute a hash. +- **Unique ID From Tool or Hash Code**: Prefer the tool’s unique ID; fall back to hash when no matching unique ID is found. +- **Legacy**: Historical algorithm with multiple conditions; only available in the Open Source version. + +Algorithm selection per parser is controlled by `DEDUPLICATION_ALGORITHM_PER_PARSER` (see the [OS tuning page](deduplication_tuning_os) for configuration details). + +## How endpoints are assessed per algorithm + +Endpoints can influence deduplication in different ways depending on the algorithm and configuration. + +### Unique ID From Tool + +- Deduplication uses `unique_id_from_tool` (or `vuln_id_from_tool`). +- **Endpoints are ignored** for duplicate matching. +- A finding’s hash may still be calculated for other features, but it does not affect deduplication under this algorithm. + +### Hash Code + +- Deduplication uses a hash computed from fields specified by `HASHCODE_FIELDS_PER_SCANNER` for the given parser. +- The hash also includes fields from `HASH_CODE_FIELDS_ALWAYS` (see Service field section below). +- Endpoints can affect deduplication in two ways: + - If the scanner’s hash fields include `endpoints`, they are part of the hash and must match accordingly. +- If the scanner’s hash fields do not include `endpoints`, optional endpoint-based matching can be enabled via `DEDUPE_ALGO_ENDPOINT_FIELDS` (OS setting). When configured: + - Set it to an empty list `[]` to ignore endpoints entirely. + - Set it to a list of endpoint attributes (e.g. `["host", "port"]`). If at least one endpoint pair between the two findings matches on all listed attributes, deduplication can occur. + +### Unique ID From Tool or Hash Code + +- Intended flow: + 1) Try to deduplicate using the tool’s unique ID (endpoints ignored on this path). + 2) If no match by unique ID, fall back to the Hash Code path. +- When falling back to hash code, endpoint behavior is identical to the Hash Code algorithm. + +### Legacy (OS only) + +- Deduplication considers multiple attributes including endpoints. +- Behavior differs for static vs dynamic findings: + - **Static findings**: The new finding must contain all endpoints of the original. Extra endpoints on the new finding are allowed. + - **Dynamic findings**: Endpoints must strictly match (commonly by host and port); differing endpoints prevent deduplication. +- If there are no endpoints and both `file_path` and `line` are empty, deduplication typically does not occur. + +## Background processing + +- Dedupe is triggered on import/reimport and during certain updates run via Celery in the background. + +## Service field and its impact + +- By default, `HASH_CODE_FIELDS_ALWAYS = ["service"]`, meaning the `service` associated with a finding is appended to the hash for all scanners. +- Practical implications: + - Two otherwise identical findings with different `service` values will produce different hashes and will not deduplicate under Hash-based paths. + - During import/reimport, the `Service` field entered in the UI can override the parser-provided service. Changing it can change the hash and therefore affect deduplication outcomes. + - If you want service to have no impact on deduplication, configure `HASH_CODE_FIELDS_ALWAYS` accordingly (see the OS tuning page). Removing `service` from the always-included list will stop it from affecting hashes. + +See also: the [Open Source tuning guide](deduplication_tuning_os) for configuration details and examples. + + diff --git a/docs/content/en/working_with_findings/finding_deduplication/deduplication_tuning_os.md b/docs/content/en/working_with_findings/finding_deduplication/deduplication_tuning_os.md new file mode 100644 index 00000000000..162b683d4c0 --- /dev/null +++ b/docs/content/en/working_with_findings/finding_deduplication/deduplication_tuning_os.md @@ -0,0 +1,147 @@ +--- +title: "Deduplication Tuning (Open Source)" +description: "Configure deduplication in DefectDojo Open Source: algorithms, hash fields, endpoints, and service" +weight: 5 +--- + +This page explains how to tune deduplication in the Open Source (OS) edition of DefectDojo. For a visual, feature-rich tuning UI, see the Pro documentation. The OS edition uses settings files and environment variables. + +See also: [Configuration](../../open_source/installation/configuration) for details on environment variables and `local_settings.py` overrides. + +## What you can configure + +- **Algorithm per parser**: Choose one of Unique ID From Tool, Hash Code, Unique ID From Tool or Hash Code, or Legacy (OS only). +- **Hash fields per scanner**: Decide which fields contribute to the hash for each parser. +- **Allow null CWE**: Control whether a missing/zero CWE is acceptable when hashing. +- **Endpoint consideration**: Optionally use endpoints for deduplication when they’re not part of the hash. +- **Always-included fields**: Add fields (e.g., `service`) to all hashes regardless of per-scanner settings. + +## Key settings (defaults shown) + +All defaults are defined in `dojo/settings/settings.dist.py`. Override via environment or `local_settings.py`. + +### Algorithm per parser + +- Setting: `DEDUPLICATION_ALGORITHM_PER_PARSER` +- Values per parser: one of `unique_id_from_tool`, `hash_code`, `unique_id_from_tool_or_hash_code`, `legacy`. +- Example (env variable JSON string): + +```bash +DD_DEDUPLICATION_ALGORITHM_PER_PARSER='{"Trivy Scan": "hash_code", "Veracode Scan": "unique_id_from_tool_or_hash_code"}' +``` + +### Hash fields per scanner + +- Setting: `HASHCODE_FIELDS_PER_SCANNER` +- Example default for Trivy in OS: + +```startLine:endLine:dojo/settings/settings.dist.py +1318:1321:dojo/settings/settings.dist.py + "Trivy Operator Scan": ["title", "severity", "vulnerability_ids", "description"], + "Trivy Scan": ["title", "severity", "vulnerability_ids", "cwe", "description"], + "TFSec Scan": ["severity", "vuln_id_from_tool", "file_path", "line"], + "Snyk Scan": ["vuln_id_from_tool", "file_path", "component_name", "component_version"], +``` + +- Override example (env variable JSON string): + +```bash +DD_HASHCODE_FIELDS_PER_SCANNER='{"ZAP Scan":["title","cwe","severity"],"Trivy Scan":["title","severity","vulnerability_ids","description"]}' +``` + +### Allow null CWE per scanner + +- Setting: `HASHCODE_ALLOWS_NULL_CWE` +- Controls per parser whether a null/zero CWE is acceptable in hashing. If False and the finding has `cwe = 0`, the hash falls back to the legacy computation for that finding. + +### Always-included fields in hash + +- Setting: `HASH_CODE_FIELDS_ALWAYS` +- Default: `["service"]` +- Impact: Appended to the hash for every scanner. Removing `service` here stops it from affecting hashes across the board. + +```startLine:endLine:dojo/settings/settings.dist.py +1464:1466:dojo/settings/settings.dist.py +# Adding fields to the hash_code calculation regardless of the previous settings +HASH_CODE_FIELDS_ALWAYS = ["service"] +``` + +### Optional endpoint-based dedupe + +- Setting: `DEDUPE_ALGO_ENDPOINT_FIELDS` +- Default: `["host", "path"]` +- Purpose: If endpoints are not part of the hash fields, you can still require a minimal endpoint match to deduplicate. If the list is empty `[]`, endpoints are ignored on the dedupe path. + +```startLine:endLine:dojo/settings/settings.dist.py +1491:1499:dojo/settings/settings.dist.py +# Allows to deduplicate with endpoints if endpoints is not included in the hashcode. +# Possible values are: scheme, host, port, path, query, fragment, userinfo, and user. +# If a finding has more than one endpoint, only one endpoint pair must match to mark the finding as duplicate. +DEDUPE_ALGO_ENDPOINT_FIELDS = ["host", "path"] +``` + +## Endpoints: how to tune + +Endpoints can affect deduplication via two mechanisms: + +1) Include `endpoints` in `HASHCODE_FIELDS_PER_SCANNER` for a parser. Then endpoints are part of the hash and must match exactly according to the parser’s hashing rules. +2) If endpoints are not in the hash fields, use `DEDUPLE_ALGO_ENDPOINT_FIELDS` to specify attributes to compare. Examples: + - `[]`: endpoints are ignored for dedupe. + - `["host"]`: findings dedupe if any endpoint pair matches by host. + - `["host", "port"]`: findings dedupe if any endpoint pair matches by host AND port. + +Notes: + +- For Legacy algorithm, static vs dynamic findings have different endpoint matching rules (see the algorithms page). The `DEDUPLE_ALGO_ENDPOINT_FIELDS` setting applies to the hash-code path, not the Legacy algorithm’s intrinsic logic. +- For `unique_id_from_tool` (ID-based) matching, endpoints are ignored for the dedupe decision. + +## Service field: dedupe and reimport + +- With default `HASH_CODE_FIELDS_ALWAYS = ["service"]`, the `service` field is appended to the hash. Two otherwise equal findings with different `service` values will not dedupe on hash-based paths. +- During import via UI/API, the `Service` input can override the parser-provided service. Changing it changes the hash and can alter dedupe behavior and reimport matching. +- If you want dedupe independent of service, remove `service` from `HASH_CODE_FIELDS_ALWAYS` or leave the `Service` field empty during import. + +## After changing deduplication settings + +- Changes to dedupe configuration (e.g., `HASHCODE_FIELDS_PER_SCANNER`, `HASH_CODE_FIELDS_ALWAYS`, `DEDUPLICATION_ALGORITHM_PER_PARSER`) are not applied retroactively automatically. To re-evaluate existing findings you must run the management command below. + +Run inside the uwsgi container. Example (hash codes only, no dedupe): + +```bash +docker compose exec uwsgi /bin/bash -c "python manage.py dedupe --hash_code_only" +``` + +Help/usage: + +options: + --parser PARSER List of parsers for which hash_code needs recomputing + (defaults to all parsers) + --hash_code_only Only compute hash codes + --dedupe_only Only run deduplication + --dedupe_sync Run dedupe in the foreground, default false +``` + +If you submit dedupe to Celery (without `--dedupe_sync`), allow time for tasks to complete before evaluating results. + +## Where to configure + +- Prefer environment variables in deployments. For local development or advanced overrides, use `local_settings.py`. +- See `configuration.md` for details on how to set environment variables and configure local overrides. + +### Troubleshooting + +To help troubleshooting deduplication use the following tools: + +- Observe log out in the `dojo.specific-loggers.deduplication` category. This is a class independant logger that outputs details about the deduplication process and settings when processing findings. +- Observe the `unique_id_from_tool` and `hash_code` values by hovering over the `ID` field or `Status` column: + +![Unique ID from Tool and Hash Code on the View Finding page](images/hash_code_id_field.png) + +![Unique ID from Tool and Hash Code on the Finding List Status Column](images/hash_code_status_column.png) + +## Related documentation + +- [Deduplication Algorithms](deduplication_algorithms): conceptual overview and endpoint behavior. +- [Avoiding duplicates via reimport](avoiding_duplicates_via_reimport). + + diff --git a/unittests/test_deduplication_logic.py b/unittests/test_deduplication_logic.py index 82ecfb177dd..1be76d911ce 100644 --- a/unittests/test_deduplication_logic.py +++ b/unittests/test_deduplication_logic.py @@ -819,6 +819,45 @@ def test_identical_different_endpoints_unique_id(self): # expect duplicate, as endpoints shouldn't affect dedupe and hash_code due to unique_id self.assert_finding(finding_new, not_pk=124, duplicate=True, duplicate_finding_id=124, hash_code=finding_124.hash_code) + def test_identical_endpoints_unique_id(self): + # create identical copy and add the same endpoint to both original and new + finding_124 = Finding.objects.get(id=124) + ep_o = Endpoint(product=finding_124.test.engagement.product, finding=finding_124, host="samehost.com", protocol="https") + ep_o.save() + finding_124.endpoints.add(ep_o) + finding_124.save(dedupe_option=False) + + finding_new, finding_124 = self.copy_and_reset_finding(find_id=124) + finding_new.save(dedupe_option=False) + ep_n = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="samehost.com", protocol="https") + ep_n.save() + finding_new.endpoints.add(ep_n) + finding_new.save() + + # expect duplicate: unique_id match dominates regardless of identical endpoints + self.assert_finding(finding_new, not_pk=124, duplicate=True, duplicate_finding_id=124, hash_code=finding_124.hash_code) + + def test_extra_endpoints_unique_id(self): + # add endpoints to original and more endpoints to new + finding_124 = Finding.objects.get(id=124) + ep1 = Endpoint(product=finding_124.test.engagement.product, finding=finding_124, host="base1.com", protocol="https") + ep1.save() + finding_124.endpoints.add(ep1) + finding_124.save(dedupe_option=False) + + finding_new, finding_124 = self.copy_and_reset_finding(find_id=124) + finding_new.save(dedupe_option=False) + ep2 = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="base1.com", protocol="https") + ep2.save() + ep3 = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="extra.com", protocol="https") + ep3.save() + finding_new.endpoints.add(ep2) + finding_new.endpoints.add(ep3) + finding_new.save() + + # expect duplicate: unique_id match regardless of extra endpoints + self.assert_finding(finding_new, not_pk=124, duplicate=True, duplicate_finding_id=124, hash_code=finding_124.hash_code) + # algo unique_id_or_hash_code Veracode scan def test_identical_unique_id_or_hash_code(self): @@ -829,6 +868,66 @@ def test_identical_unique_id_or_hash_code(self): # expect duplicate as uid matches self.assert_finding(finding_new, not_pk=224, duplicate=True, duplicate_finding_id=224, hash_code=finding_224.hash_code) + def test_identical_endpoints_unique_id_or_hash_code(self): + # add identical endpoints to original and new; uid match should dedupe + finding_224 = Finding.objects.get(id=224) + ep_o = Endpoint(product=finding_224.test.engagement.product, finding=finding_224, host="endpoint.same.com", protocol="https") + ep_o.save() + finding_224.endpoints.add(ep_o) + finding_224.save(dedupe_option=False) + + finding_new, finding_224 = self.copy_and_reset_finding(find_id=224) + finding_new.save(dedupe_option=False) + ep_n = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="endpoint.same.com", protocol="https") + ep_n.save() + finding_new.endpoints.add(ep_n) + finding_new.save() + + self.assert_finding(finding_new, not_pk=224, duplicate=True, duplicate_finding_id=224, hash_code=finding_224.hash_code) + + def test_extra_endpoints_unique_id_or_hash_code(self): + # add endpoint to original; add original + extra endpoint to new; uid match should dedupe + finding_224 = Finding.objects.get(id=224) + ep_o = Endpoint(product=finding_224.test.engagement.product, finding=finding_224, host="endpoint.base.com", protocol="https") + ep_o.save() + finding_224.endpoints.add(ep_o) + finding_224.save(dedupe_option=False) + + finding_new, finding_224 = self.copy_and_reset_finding(find_id=224) + finding_new.save(dedupe_option=False) + ep_n1 = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="endpoint.base.com", protocol="https") + ep_n1.save() + ep_n2 = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="endpoint.extra.com", protocol="https") + ep_n2.save() + finding_new.endpoints.add(ep_n1) + finding_new.endpoints.add(ep_n2) + finding_new.save() + + self.assert_finding(finding_new, not_pk=224, duplicate=True, duplicate_finding_id=224, hash_code=finding_224.hash_code) + + def test_intersect_endpoints_unique_id_or_hash_code(self): + # original has two endpoints; new has one overlapping and one different; uid match should dedupe + finding_224 = Finding.objects.get(id=224) + ep_o1 = Endpoint(product=finding_224.test.engagement.product, finding=finding_224, host="ep1.com", protocol="https") + ep_o1.save() + ep_o2 = Endpoint(product=finding_224.test.engagement.product, finding=finding_224, host="ep2.com", protocol="https") + ep_o2.save() + finding_224.endpoints.add(ep_o1) + finding_224.endpoints.add(ep_o2) + finding_224.save(dedupe_option=False) + + finding_new, finding_224 = self.copy_and_reset_finding(find_id=224) + finding_new.save(dedupe_option=False) + ep_n1 = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="ep2.com", protocol="https") + ep_n1.save() + ep_n2 = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="ep3.com", protocol="https") + ep_n2.save() + finding_new.endpoints.add(ep_n1) + finding_new.endpoints.add(ep_n2) + finding_new.save() + + self.assert_finding(finding_new, not_pk=224, duplicate=True, duplicate_finding_id=224, hash_code=finding_224.hash_code) + # existing BUG? finding gets matched on hash code, while there is also an existing finding with matching unique_id_from_tool def test_identical_unique_id_or_hash_code_bug(self): # create identical copy @@ -858,6 +957,92 @@ def test_different_unique_id_unique_id_or_hash_code(self): # expect duplicate, uid mismatch, but same hash_code self.assert_finding(finding_new, not_pk=224, duplicate=False, not_hash_code=finding_224.hash_code) + def test_uid_mismatch_hash_match_identical_endpoints_unique_id_or_hash_code(self): + # Force UID mismatch but ensure hash matches; set identical endpoints on both + dedupe_algo_endpoint_fields = settings.DEDUPE_ALGO_ENDPOINT_FIELDS + settings.DEDUPE_ALGO_ENDPOINT_FIELDS = ["host", "port"] + + finding_224 = Finding.objects.get(id=224) + # add endpoint to original + ep_o = Endpoint(product=finding_224.test.engagement.product, finding=finding_224, host="same.com", protocol="https") + ep_o.save() + finding_224.endpoints.add(ep_o) + finding_224.save(dedupe_option=False) + + # create new with same title/desc to keep hash same, different uid + finding_new, finding_224 = self.copy_and_reset_finding(find_id=224) + finding_new.unique_id_from_tool = "DIFF-UID" + finding_new.save(dedupe_option=False) + ep_n = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="same.com", protocol="https") + ep_n.save() + finding_new.endpoints.add(ep_n) + finding_new.save() + + # expect duplicate via hash path despite UID mismatch and identical endpoints + self.assert_finding(finding_new, not_pk=224, duplicate=True, duplicate_finding_id=finding_224.id, hash_code=finding_224.hash_code) + + # reset + settings.DEDUPE_ALGO_ENDPOINT_FIELDS = dedupe_algo_endpoint_fields + + def test_uid_mismatch_hash_match_extra_endpoints_unique_id_or_hash_code(self): + # Force UID mismatch but ensure hash matches; new has extra endpoints + dedupe_algo_endpoint_fields = settings.DEDUPE_ALGO_ENDPOINT_FIELDS + settings.DEDUPE_ALGO_ENDPOINT_FIELDS = ["host", "port"] + + finding_224 = Finding.objects.get(id=224) + ep_o = Endpoint(product=finding_224.test.engagement.product, finding=finding_224, host="base.com", protocol="https") + ep_o.save() + finding_224.endpoints.add(ep_o) + finding_224.save(dedupe_option=False) + + finding_new, finding_224 = self.copy_and_reset_finding(find_id=224) + finding_new.unique_id_from_tool = "DIFF-UID" + finding_new.save(dedupe_option=False) + ep_n1 = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="base.com", protocol="https") + ep_n1.save() + ep_n2 = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="extra.com", protocol="https") + ep_n2.save() + finding_new.endpoints.add(ep_n1) + finding_new.endpoints.add(ep_n2) + finding_new.save() + + # expect duplicate via hash path despite UID mismatch and extra endpoints + self.assert_finding(finding_new, not_pk=224, duplicate=True, duplicate_finding_id=finding_224.id, hash_code=finding_224.hash_code) + + # reset + settings.DEDUPE_ALGO_ENDPOINT_FIELDS = dedupe_algo_endpoint_fields + + def test_uid_mismatch_hash_match_intersect_endpoints_unique_id_or_hash_code(self): + # Force UID mismatch but ensure hash matches; endpoints partially overlap + dedupe_algo_endpoint_fields = settings.DEDUPE_ALGO_ENDPOINT_FIELDS + settings.DEDUPE_ALGO_ENDPOINT_FIELDS = ["host", "port"] + + finding_224 = Finding.objects.get(id=224) + ep_o1 = Endpoint(product=finding_224.test.engagement.product, finding=finding_224, host="ep1.com", protocol="https") + ep_o1.save() + ep_o2 = Endpoint(product=finding_224.test.engagement.product, finding=finding_224, host="ep2.com", protocol="https") + ep_o2.save() + finding_224.endpoints.add(ep_o1) + finding_224.endpoints.add(ep_o2) + finding_224.save(dedupe_option=False) + + finding_new, finding_224 = self.copy_and_reset_finding(find_id=224) + finding_new.unique_id_from_tool = "DIFF-UID" + finding_new.save(dedupe_option=False) + ep_n1 = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="ep2.com", protocol="https") + ep_n1.save() + ep_n2 = Endpoint(product=finding_new.test.engagement.product, finding=finding_new, host="ep3.com", protocol="https") + ep_n2.save() + finding_new.endpoints.add(ep_n1) + finding_new.endpoints.add(ep_n2) + finding_new.save() + + # expect duplicate via hash path despite UID mismatch and intersecting endpoints + self.assert_finding(finding_new, not_pk=224, duplicate=True, duplicate_finding_id=finding_224.id, hash_code=finding_224.hash_code) + + # reset + settings.DEDUPE_ALGO_ENDPOINT_FIELDS = dedupe_algo_endpoint_fields + def test_identical_ordering_unique_id_or_hash_code(self): # create identical copy finding_new, finding_225 = self.copy_and_reset_finding(find_id=225)