DefectDojo
diff --git a/‎docs/assets/images/hash_code_id_field.png‎
62.8 KB b/‎docs/assets/images/hash_code_id_field.png‎
62.8 KB
diff --git a/‎docs/assets/images/hash_code_status_column.png‎
54.5 KB b/‎docs/assets/images/hash_code_status_column.png‎
54.5 KB
diff --git a/‎docs/content/en/working_with_findings/finding_deduplication/deduplication_algorithms.md‎
Lines changed: 67 additions & 0 deletions b/‎docs/content/en/working_with_findings/finding_deduplication/deduplication_algorithms.md‎
Lines changed: 67 additions & 0 deletions
diff --git a/‎docs/content/en/working_with_findings/finding_deduplication/deduplication_tuning_os.md‎
Lines changed: 147 additions & 0 deletions b/‎docs/content/en/working_with_findings/finding_deduplication/deduplication_tuning_os.md‎
Lines changed: 147 additions & 0 deletions
@@ -0,0 +1,67 @@
+---
+title: "Deduplication Algorithms"
+description: "How DefectDojo identifies duplicates: Unique ID, Hash Code, Unique ID or Hash Code, Legacy"
+weight: 3
+---
+
+## Overview
+
+DefectDojo supports four deduplication algorithms that can be selected per parser (test type):
+
+- **Unique ID From Tool**: Uses the scanner-provided unique identifier.
+- **Hash Code**: Uses a configured set of fields to compute a hash.
+- **Unique ID From Tool or Hash Code**: Prefer the tool’s unique ID; fall back to hash when no matching unique ID is found.
+- **Legacy**: Historical algorithm with multiple conditions; only available in the Open Source version.
+
+Algorithm selection per parser is controlled by `DEDUPLICATION_ALGORITHM_PER_PARSER` (see the [OS tuning page](deduplication_tuning_os) for configuration details).
+
+## How endpoints are assessed per algorithm
+
+Endpoints can influence deduplication in different ways depending on the algorithm and configuration.
+
+### Unique ID From Tool
+
+- Deduplication uses `unique_id_from_tool` (or `vuln_id_from_tool`).
+- **Endpoints are ignored** for duplicate matching.
+- A finding’s hash may still be calculated for other features, but it does not affect deduplication under this algorithm.
+
+### Hash Code
+
+- Deduplication uses a hash computed from fields specified by `HASHCODE_FIELDS_PER_SCANNER` for the given parser.
+- The hash also includes fields from `HASH_CODE_FIELDS_ALWAYS` (see Service field section below).
+- Endpoints can affect deduplication in two ways:
+  - If the scanner’s hash fields include `endpoints`, they are part of the hash and must match accordingly.
+- If the scanner’s hash fields do not include `endpoints`, optional endpoint-based matching can be enabled via `DEDUPE_ALGO_ENDPOINT_FIELDS` (OS setting). When configured:
+    - Set it to an empty list `[]` to ignore endpoints entirely.
+    - Set it to a list of endpoint attributes (e.g. `["host", "port"]`). If at least one endpoint pair between the two findings matches on all listed attributes, deduplication can occur.
+
+### Unique ID From Tool or Hash Code
+
+- Intended flow:
+  1) Try to deduplicate using the tool’s unique ID (endpoints ignored on this path).
+  2) If no match by unique ID, fall back to the Hash Code path.
+- When falling back to hash code, endpoint behavior is identical to the Hash Code algorithm.
+
+### Legacy (OS only)
+
+- Deduplication considers multiple attributes including endpoints.
+- Behavior differs for static vs dynamic findings:
+  - **Static findings**: The new finding must contain all endpoints of the original. Extra endpoints on the new finding are allowed.
+  - **Dynamic findings**: Endpoints must strictly match (commonly by host and port); differing endpoints prevent deduplication.
+- If there are no endpoints and both `file_path` and `line` are empty, deduplication typically does not occur.
+
+## Background processing
+
+- Dedupe is triggered on import/reimport and during certain updates run via Celery in the background.
+
+## Service field and its impact
+
+- By default, `HASH_CODE_FIELDS_ALWAYS = ["service"]`, meaning the `service` associated with a finding is appended to the hash for all scanners.
+- Practical implications:
+  - Two otherwise identical findings with different `service` values will produce different hashes and will not deduplicate under Hash-based paths.
+  - During import/reimport, the `Service` field entered in the UI can override the parser-provided service. Changing it can change the hash and therefore affect deduplication outcomes.
+  - If you want service to have no impact on deduplication, configure `HASH_CODE_FIELDS_ALWAYS` accordingly (see the OS tuning page). Removing `service` from the always-included list will stop it from affecting hashes.
+
+See also: the [Open Source tuning guide](deduplication_tuning_os) for configuration details and examples.
+
+
@@ -0,0 +1,147 @@
+---
+title: "Deduplication Tuning (Open Source)"
+description: "Configure deduplication in DefectDojo Open Source: algorithms, hash fields, endpoints, and service"
+weight: 5
+---
+
+This page explains how to tune deduplication in the Open Source (OS) edition of DefectDojo. For a visual, feature-rich tuning UI, see the Pro documentation. The OS edition uses settings files and environment variables.
+
+See also: [Configuration](../../open_source/installation/configuration) for details on environment variables and `local_settings.py` overrides.
+
+## What you can configure
+
+- **Algorithm per parser**: Choose one of Unique ID From Tool, Hash Code, Unique ID From Tool or Hash Code, or Legacy (OS only).
+- **Hash fields per scanner**: Decide which fields contribute to the hash for each parser.
+- **Allow null CWE**: Control whether a missing/zero CWE is acceptable when hashing.
+- **Endpoint consideration**: Optionally use endpoints for deduplication when they’re not part of the hash.
+- **Always-included fields**: Add fields (e.g., `service`) to all hashes regardless of per-scanner settings.
+
+## Key settings (defaults shown)
+
+All defaults are defined in `dojo/settings/settings.dist.py`. Override via environment or `local_settings.py`.
+
+### Algorithm per parser
+
+- Setting: `DEDUPLICATION_ALGORITHM_PER_PARSER`
+- Values per parser: one of `unique_id_from_tool`, `hash_code`, `unique_id_from_tool_or_hash_code`, `legacy`.
+- Example (env variable JSON string):
+
+```bash
+DD_DEDUPLICATION_ALGORITHM_PER_PARSER='{"Trivy Scan": "hash_code", "Veracode Scan": "unique_id_from_tool_or_hash_code"}'
+```
+
+### Hash fields per scanner
+
+- Setting: `HASHCODE_FIELDS_PER_SCANNER`
+- Example default for Trivy in OS:
+
+```startLine:endLine:dojo/settings/settings.dist.py
+1318:1321:dojo/settings/settings.dist.py
+    "Trivy Operator Scan": ["title", "severity", "vulnerability_ids", "description"],
+    "Trivy Scan": ["title", "severity", "vulnerability_ids", "cwe", "description"],
+    "TFSec Scan": ["severity", "vuln_id_from_tool", "file_path", "line"],
+    "Snyk Scan": ["vuln_id_from_tool", "file_path", "component_name", "component_version"],
+```
+
+- Override example (env variable JSON string):
+
+```bash
+DD_HASHCODE_FIELDS_PER_SCANNER='{"ZAP Scan":["title","cwe","severity"],"Trivy Scan":["title","severity","vulnerability_ids","description"]}'
+```
+
+### Allow null CWE per scanner
+
+- Setting: `HASHCODE_ALLOWS_NULL_CWE`
+- Controls per parser whether a null/zero CWE is acceptable in hashing. If False and the finding has `cwe = 0`, the hash falls back to the legacy computation for that finding.
+
+### Always-included fields in hash
+
+- Setting: `HASH_CODE_FIELDS_ALWAYS`
+- Default: `["service"]`
+- Impact: Appended to the hash for every scanner. Removing `service` here stops it from affecting hashes across the board.
+
+```startLine:endLine:dojo/settings/settings.dist.py
+1464:1466:dojo/settings/settings.dist.py
+# Adding fields to the hash_code calculation regardless of the previous settings
+HASH_CODE_FIELDS_ALWAYS = ["service"]
+```
+
+### Optional endpoint-based dedupe
+
+- Setting: `DEDUPE_ALGO_ENDPOINT_FIELDS`
+- Default: `["host", "path"]`
+- Purpose: If endpoints are not part of the hash fields, you can still require a minimal endpoint match to deduplicate. If the list is empty `[]`, endpoints are ignored on the dedupe path.
+
+```startLine:endLine:dojo/settings/settings.dist.py
+1491:1499:dojo/settings/settings.dist.py
+# Allows to deduplicate with endpoints if endpoints is not included in the hashcode.
+# Possible values are: scheme, host, port, path, query, fragment, userinfo, and user.
+# If a finding has more than one endpoint, only one endpoint pair must match to mark the finding as duplicate.
+DEDUPE_ALGO_ENDPOINT_FIELDS = ["host", "path"]
+```
+
+## Endpoints: how to tune
+
+Endpoints can affect deduplication via two mechanisms:
+
+1) Include `endpoints` in `HASHCODE_FIELDS_PER_SCANNER` for a parser. Then endpoints are part of the hash and must match exactly according to the parser’s hashing rules.
+2) If endpoints are not in the hash fields, use `DEDUPLE_ALGO_ENDPOINT_FIELDS` to specify attributes to compare. Examples:
+   - `[]`: endpoints are ignored for dedupe.
+   - `["host"]`: findings dedupe if any endpoint pair matches by host.
+   - `["host", "port"]`: findings dedupe if any endpoint pair matches by host AND port.
+
+Notes:
+
+- For Legacy algorithm, static vs dynamic findings have different endpoint matching rules (see the algorithms page). The `DEDUPLE_ALGO_ENDPOINT_FIELDS` setting applies to the hash-code path, not the Legacy algorithm’s intrinsic logic.
+- For `unique_id_from_tool` (ID-based) matching, endpoints are ignored for the dedupe decision.
+
+## Service field: dedupe and reimport
+
+- With default `HASH_CODE_FIELDS_ALWAYS = ["service"]`, the `service` field is appended to the hash. Two otherwise equal findings with different `service` values will not dedupe on hash-based paths.
+- During import via UI/API, the `Service` input can override the parser-provided service. Changing it changes the hash and can alter dedupe behavior and reimport matching.
+- If you want dedupe independent of service, remove `service` from `HASH_CODE_FIELDS_ALWAYS` or leave the `Service` field empty during import.
+
+## After changing deduplication settings
+
+- Changes to dedupe configuration (e.g., `HASHCODE_FIELDS_PER_SCANNER`, `HASH_CODE_FIELDS_ALWAYS`, `DEDUPLICATION_ALGORITHM_PER_PARSER`) are not applied retroactively automatically. To re-evaluate existing findings you must run the management command below.
+
+Run inside the uwsgi container. Example (hash codes only, no dedupe):
+
+```bash
+docker compose exec uwsgi /bin/bash -c "python manage.py dedupe --hash_code_only"
+```
+
+Help/usage:
+
+options:
+  --parser PARSER       List of parsers for which hash_code needs recomputing
+                        (defaults to all parsers)
+  --hash_code_only      Only compute hash codes
+  --dedupe_only         Only run deduplication
+  --dedupe_sync         Run dedupe in the foreground, default false
+```
+
+If you submit dedupe to Celery (without `--dedupe_sync`), allow time for tasks to complete before evaluating results.
+
+## Where to configure
+
+- Prefer environment variables in deployments. For local development or advanced overrides, use `local_settings.py`.
+- See `configuration.md` for details on how to set environment variables and configure local overrides.
+
+### Troubleshooting
+
+To help troubleshooting deduplication use the following tools:
+
+- Observe log out in the `dojo.specific-loggers.deduplication` category. This is a class independant logger that outputs details about the deduplication process and settings when processing findings.
+- Observe the `unique_id_from_tool` and `hash_code` values by hovering over the `ID` field or `Status` column:
+
+![Unique ID from Tool and Hash Code on the View Finding page](images/hash_code_id_field.png)
+
+![Unique ID from Tool and Hash Code on the Finding List Status Column](images/hash_code_status_column.png)
+
+## Related documentation
+
+- [Deduplication Algorithms](deduplication_algorithms): conceptual overview and endpoint behavior.
+- [Avoiding duplicates via reimport](avoiding_duplicates_via_reimport).
+
+