Skip to content

Commit 8a9a3da

Browse files
Add tests and documentation for deduplication algorithms (#13464)
* deduplication logic: add missing tests * deduplication logic: add docs * deduplication logic: add docs
1 parent e172143 commit 8a9a3da

File tree

5 files changed

+399
-0
lines changed

5 files changed

+399
-0
lines changed
62.8 KB
Loading
54.5 KB
Loading
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
---
2+
title: "Deduplication Algorithms"
3+
description: "How DefectDojo identifies duplicates: Unique ID, Hash Code, Unique ID or Hash Code, Legacy"
4+
weight: 3
5+
---
6+
7+
## Overview
8+
9+
DefectDojo supports four deduplication algorithms that can be selected per parser (test type):
10+
11+
- **Unique ID From Tool**: Uses the scanner-provided unique identifier.
12+
- **Hash Code**: Uses a configured set of fields to compute a hash.
13+
- **Unique ID From Tool or Hash Code**: Prefer the tool’s unique ID; fall back to hash when no matching unique ID is found.
14+
- **Legacy**: Historical algorithm with multiple conditions; only available in the Open Source version.
15+
16+
Algorithm selection per parser is controlled by `DEDUPLICATION_ALGORITHM_PER_PARSER` (see the [OS tuning page](deduplication_tuning_os) for configuration details).
17+
18+
## How endpoints are assessed per algorithm
19+
20+
Endpoints can influence deduplication in different ways depending on the algorithm and configuration.
21+
22+
### Unique ID From Tool
23+
24+
- Deduplication uses `unique_id_from_tool` (or `vuln_id_from_tool`).
25+
- **Endpoints are ignored** for duplicate matching.
26+
- A finding’s hash may still be calculated for other features, but it does not affect deduplication under this algorithm.
27+
28+
### Hash Code
29+
30+
- Deduplication uses a hash computed from fields specified by `HASHCODE_FIELDS_PER_SCANNER` for the given parser.
31+
- The hash also includes fields from `HASH_CODE_FIELDS_ALWAYS` (see Service field section below).
32+
- Endpoints can affect deduplication in two ways:
33+
- If the scanner’s hash fields include `endpoints`, they are part of the hash and must match accordingly.
34+
- If the scanner’s hash fields do not include `endpoints`, optional endpoint-based matching can be enabled via `DEDUPE_ALGO_ENDPOINT_FIELDS` (OS setting). When configured:
35+
- Set it to an empty list `[]` to ignore endpoints entirely.
36+
- Set it to a list of endpoint attributes (e.g. `["host", "port"]`). If at least one endpoint pair between the two findings matches on all listed attributes, deduplication can occur.
37+
38+
### Unique ID From Tool or Hash Code
39+
40+
- Intended flow:
41+
1) Try to deduplicate using the tool’s unique ID (endpoints ignored on this path).
42+
2) If no match by unique ID, fall back to the Hash Code path.
43+
- When falling back to hash code, endpoint behavior is identical to the Hash Code algorithm.
44+
45+
### Legacy (OS only)
46+
47+
- Deduplication considers multiple attributes including endpoints.
48+
- Behavior differs for static vs dynamic findings:
49+
- **Static findings**: The new finding must contain all endpoints of the original. Extra endpoints on the new finding are allowed.
50+
- **Dynamic findings**: Endpoints must strictly match (commonly by host and port); differing endpoints prevent deduplication.
51+
- If there are no endpoints and both `file_path` and `line` are empty, deduplication typically does not occur.
52+
53+
## Background processing
54+
55+
- Dedupe is triggered on import/reimport and during certain updates run via Celery in the background.
56+
57+
## Service field and its impact
58+
59+
- By default, `HASH_CODE_FIELDS_ALWAYS = ["service"]`, meaning the `service` associated with a finding is appended to the hash for all scanners.
60+
- Practical implications:
61+
- Two otherwise identical findings with different `service` values will produce different hashes and will not deduplicate under Hash-based paths.
62+
- During import/reimport, the `Service` field entered in the UI can override the parser-provided service. Changing it can change the hash and therefore affect deduplication outcomes.
63+
- If you want service to have no impact on deduplication, configure `HASH_CODE_FIELDS_ALWAYS` accordingly (see the OS tuning page). Removing `service` from the always-included list will stop it from affecting hashes.
64+
65+
See also: the [Open Source tuning guide](deduplication_tuning_os) for configuration details and examples.
66+
67+
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
---
2+
title: "Deduplication Tuning (Open Source)"
3+
description: "Configure deduplication in DefectDojo Open Source: algorithms, hash fields, endpoints, and service"
4+
weight: 5
5+
---
6+
7+
This page explains how to tune deduplication in the Open Source (OS) edition of DefectDojo. For a visual, feature-rich tuning UI, see the Pro documentation. The OS edition uses settings files and environment variables.
8+
9+
See also: [Configuration](../../open_source/installation/configuration) for details on environment variables and `local_settings.py` overrides.
10+
11+
## What you can configure
12+
13+
- **Algorithm per parser**: Choose one of Unique ID From Tool, Hash Code, Unique ID From Tool or Hash Code, or Legacy (OS only).
14+
- **Hash fields per scanner**: Decide which fields contribute to the hash for each parser.
15+
- **Allow null CWE**: Control whether a missing/zero CWE is acceptable when hashing.
16+
- **Endpoint consideration**: Optionally use endpoints for deduplication when they’re not part of the hash.
17+
- **Always-included fields**: Add fields (e.g., `service`) to all hashes regardless of per-scanner settings.
18+
19+
## Key settings (defaults shown)
20+
21+
All defaults are defined in `dojo/settings/settings.dist.py`. Override via environment or `local_settings.py`.
22+
23+
### Algorithm per parser
24+
25+
- Setting: `DEDUPLICATION_ALGORITHM_PER_PARSER`
26+
- Values per parser: one of `unique_id_from_tool`, `hash_code`, `unique_id_from_tool_or_hash_code`, `legacy`.
27+
- Example (env variable JSON string):
28+
29+
```bash
30+
DD_DEDUPLICATION_ALGORITHM_PER_PARSER='{"Trivy Scan": "hash_code", "Veracode Scan": "unique_id_from_tool_or_hash_code"}'
31+
```
32+
33+
### Hash fields per scanner
34+
35+
- Setting: `HASHCODE_FIELDS_PER_SCANNER`
36+
- Example default for Trivy in OS:
37+
38+
```startLine:endLine:dojo/settings/settings.dist.py
39+
1318:1321:dojo/settings/settings.dist.py
40+
"Trivy Operator Scan": ["title", "severity", "vulnerability_ids", "description"],
41+
"Trivy Scan": ["title", "severity", "vulnerability_ids", "cwe", "description"],
42+
"TFSec Scan": ["severity", "vuln_id_from_tool", "file_path", "line"],
43+
"Snyk Scan": ["vuln_id_from_tool", "file_path", "component_name", "component_version"],
44+
```
45+
46+
- Override example (env variable JSON string):
47+
48+
```bash
49+
DD_HASHCODE_FIELDS_PER_SCANNER='{"ZAP Scan":["title","cwe","severity"],"Trivy Scan":["title","severity","vulnerability_ids","description"]}'
50+
```
51+
52+
### Allow null CWE per scanner
53+
54+
- Setting: `HASHCODE_ALLOWS_NULL_CWE`
55+
- Controls per parser whether a null/zero CWE is acceptable in hashing. If False and the finding has `cwe = 0`, the hash falls back to the legacy computation for that finding.
56+
57+
### Always-included fields in hash
58+
59+
- Setting: `HASH_CODE_FIELDS_ALWAYS`
60+
- Default: `["service"]`
61+
- Impact: Appended to the hash for every scanner. Removing `service` here stops it from affecting hashes across the board.
62+
63+
```startLine:endLine:dojo/settings/settings.dist.py
64+
1464:1466:dojo/settings/settings.dist.py
65+
# Adding fields to the hash_code calculation regardless of the previous settings
66+
HASH_CODE_FIELDS_ALWAYS = ["service"]
67+
```
68+
69+
### Optional endpoint-based dedupe
70+
71+
- Setting: `DEDUPE_ALGO_ENDPOINT_FIELDS`
72+
- Default: `["host", "path"]`
73+
- Purpose: If endpoints are not part of the hash fields, you can still require a minimal endpoint match to deduplicate. If the list is empty `[]`, endpoints are ignored on the dedupe path.
74+
75+
```startLine:endLine:dojo/settings/settings.dist.py
76+
1491:1499:dojo/settings/settings.dist.py
77+
# Allows to deduplicate with endpoints if endpoints is not included in the hashcode.
78+
# Possible values are: scheme, host, port, path, query, fragment, userinfo, and user.
79+
# If a finding has more than one endpoint, only one endpoint pair must match to mark the finding as duplicate.
80+
DEDUPE_ALGO_ENDPOINT_FIELDS = ["host", "path"]
81+
```
82+
83+
## Endpoints: how to tune
84+
85+
Endpoints can affect deduplication via two mechanisms:
86+
87+
1) Include `endpoints` in `HASHCODE_FIELDS_PER_SCANNER` for a parser. Then endpoints are part of the hash and must match exactly according to the parser’s hashing rules.
88+
2) If endpoints are not in the hash fields, use `DEDUPLE_ALGO_ENDPOINT_FIELDS` to specify attributes to compare. Examples:
89+
- `[]`: endpoints are ignored for dedupe.
90+
- `["host"]`: findings dedupe if any endpoint pair matches by host.
91+
- `["host", "port"]`: findings dedupe if any endpoint pair matches by host AND port.
92+
93+
Notes:
94+
95+
- For Legacy algorithm, static vs dynamic findings have different endpoint matching rules (see the algorithms page). The `DEDUPLE_ALGO_ENDPOINT_FIELDS` setting applies to the hash-code path, not the Legacy algorithm’s intrinsic logic.
96+
- For `unique_id_from_tool` (ID-based) matching, endpoints are ignored for the dedupe decision.
97+
98+
## Service field: dedupe and reimport
99+
100+
- With default `HASH_CODE_FIELDS_ALWAYS = ["service"]`, the `service` field is appended to the hash. Two otherwise equal findings with different `service` values will not dedupe on hash-based paths.
101+
- During import via UI/API, the `Service` input can override the parser-provided service. Changing it changes the hash and can alter dedupe behavior and reimport matching.
102+
- If you want dedupe independent of service, remove `service` from `HASH_CODE_FIELDS_ALWAYS` or leave the `Service` field empty during import.
103+
104+
## After changing deduplication settings
105+
106+
- Changes to dedupe configuration (e.g., `HASHCODE_FIELDS_PER_SCANNER`, `HASH_CODE_FIELDS_ALWAYS`, `DEDUPLICATION_ALGORITHM_PER_PARSER`) are not applied retroactively automatically. To re-evaluate existing findings you must run the management command below.
107+
108+
Run inside the uwsgi container. Example (hash codes only, no dedupe):
109+
110+
```bash
111+
docker compose exec uwsgi /bin/bash -c "python manage.py dedupe --hash_code_only"
112+
```
113+
114+
Help/usage:
115+
116+
options:
117+
--parser PARSER List of parsers for which hash_code needs recomputing
118+
(defaults to all parsers)
119+
--hash_code_only Only compute hash codes
120+
--dedupe_only Only run deduplication
121+
--dedupe_sync Run dedupe in the foreground, default false
122+
```
123+
124+
If you submit dedupe to Celery (without `--dedupe_sync`), allow time for tasks to complete before evaluating results.
125+
126+
## Where to configure
127+
128+
- Prefer environment variables in deployments. For local development or advanced overrides, use `local_settings.py`.
129+
- See `configuration.md` for details on how to set environment variables and configure local overrides.
130+
131+
### Troubleshooting
132+
133+
To help troubleshooting deduplication use the following tools:
134+
135+
- Observe log out in the `dojo.specific-loggers.deduplication` category. This is a class independant logger that outputs details about the deduplication process and settings when processing findings.
136+
- Observe the `unique_id_from_tool` and `hash_code` values by hovering over the `ID` field or `Status` column:
137+
138+
![Unique ID from Tool and Hash Code on the View Finding page](images/hash_code_id_field.png)
139+
140+
![Unique ID from Tool and Hash Code on the Finding List Status Column](images/hash_code_status_column.png)
141+
142+
## Related documentation
143+
144+
- [Deduplication Algorithms](deduplication_algorithms): conceptual overview and endpoint behavior.
145+
- [Avoiding duplicates via reimport](avoiding_duplicates_via_reimport).
146+
147+

0 commit comments

Comments
 (0)