|
| 1 | +--- |
| 2 | +title: "Deduplication Tuning (Open Source)" |
| 3 | +description: "Configure deduplication in DefectDojo Open Source: algorithms, hash fields, endpoints, and service" |
| 4 | +weight: 5 |
| 5 | +--- |
| 6 | + |
| 7 | +This page explains how to tune deduplication in the Open Source (OS) edition of DefectDojo. For a visual, feature-rich tuning UI, see the Pro documentation. The OS edition uses settings files and environment variables. |
| 8 | + |
| 9 | +See also: [Configuration](../../open_source/installation/configuration) for details on environment variables and `local_settings.py` overrides. |
| 10 | + |
| 11 | +## What you can configure |
| 12 | + |
| 13 | +- **Algorithm per parser**: Choose one of Unique ID From Tool, Hash Code, Unique ID From Tool or Hash Code, or Legacy (OS only). |
| 14 | +- **Hash fields per scanner**: Decide which fields contribute to the hash for each parser. |
| 15 | +- **Allow null CWE**: Control whether a missing/zero CWE is acceptable when hashing. |
| 16 | +- **Endpoint consideration**: Optionally use endpoints for deduplication when they’re not part of the hash. |
| 17 | +- **Always-included fields**: Add fields (e.g., `service`) to all hashes regardless of per-scanner settings. |
| 18 | + |
| 19 | +## Key settings (defaults shown) |
| 20 | + |
| 21 | +All defaults are defined in `dojo/settings/settings.dist.py`. Override via environment or `local_settings.py`. |
| 22 | + |
| 23 | +### Algorithm per parser |
| 24 | + |
| 25 | +- Setting: `DEDUPLICATION_ALGORITHM_PER_PARSER` |
| 26 | +- Values per parser: one of `unique_id_from_tool`, `hash_code`, `unique_id_from_tool_or_hash_code`, `legacy`. |
| 27 | +- Example (env variable JSON string): |
| 28 | + |
| 29 | +```bash |
| 30 | +DD_DEDUPLICATION_ALGORITHM_PER_PARSER='{"Trivy Scan": "hash_code", "Veracode Scan": "unique_id_from_tool_or_hash_code"}' |
| 31 | +``` |
| 32 | + |
| 33 | +### Hash fields per scanner |
| 34 | + |
| 35 | +- Setting: `HASHCODE_FIELDS_PER_SCANNER` |
| 36 | +- Example default for Trivy in OS: |
| 37 | + |
| 38 | +```startLine:endLine:dojo/settings/settings.dist.py |
| 39 | +1318:1321:dojo/settings/settings.dist.py |
| 40 | + "Trivy Operator Scan": ["title", "severity", "vulnerability_ids", "description"], |
| 41 | + "Trivy Scan": ["title", "severity", "vulnerability_ids", "cwe", "description"], |
| 42 | + "TFSec Scan": ["severity", "vuln_id_from_tool", "file_path", "line"], |
| 43 | + "Snyk Scan": ["vuln_id_from_tool", "file_path", "component_name", "component_version"], |
| 44 | +``` |
| 45 | + |
| 46 | +- Override example (env variable JSON string): |
| 47 | + |
| 48 | +```bash |
| 49 | +DD_HASHCODE_FIELDS_PER_SCANNER='{"ZAP Scan":["title","cwe","severity"],"Trivy Scan":["title","severity","vulnerability_ids","description"]}' |
| 50 | +``` |
| 51 | + |
| 52 | +### Allow null CWE per scanner |
| 53 | + |
| 54 | +- Setting: `HASHCODE_ALLOWS_NULL_CWE` |
| 55 | +- Controls per parser whether a null/zero CWE is acceptable in hashing. If False and the finding has `cwe = 0`, the hash falls back to the legacy computation for that finding. |
| 56 | + |
| 57 | +### Always-included fields in hash |
| 58 | + |
| 59 | +- Setting: `HASH_CODE_FIELDS_ALWAYS` |
| 60 | +- Default: `["service"]` |
| 61 | +- Impact: Appended to the hash for every scanner. Removing `service` here stops it from affecting hashes across the board. |
| 62 | + |
| 63 | +```startLine:endLine:dojo/settings/settings.dist.py |
| 64 | +1464:1466:dojo/settings/settings.dist.py |
| 65 | +# Adding fields to the hash_code calculation regardless of the previous settings |
| 66 | +HASH_CODE_FIELDS_ALWAYS = ["service"] |
| 67 | +``` |
| 68 | + |
| 69 | +### Optional endpoint-based dedupe |
| 70 | + |
| 71 | +- Setting: `DEDUPE_ALGO_ENDPOINT_FIELDS` |
| 72 | +- Default: `["host", "path"]` |
| 73 | +- Purpose: If endpoints are not part of the hash fields, you can still require a minimal endpoint match to deduplicate. If the list is empty `[]`, endpoints are ignored on the dedupe path. |
| 74 | + |
| 75 | +```startLine:endLine:dojo/settings/settings.dist.py |
| 76 | +1491:1499:dojo/settings/settings.dist.py |
| 77 | +# Allows to deduplicate with endpoints if endpoints is not included in the hashcode. |
| 78 | +# Possible values are: scheme, host, port, path, query, fragment, userinfo, and user. |
| 79 | +# If a finding has more than one endpoint, only one endpoint pair must match to mark the finding as duplicate. |
| 80 | +DEDUPE_ALGO_ENDPOINT_FIELDS = ["host", "path"] |
| 81 | +``` |
| 82 | + |
| 83 | +## Endpoints: how to tune |
| 84 | + |
| 85 | +Endpoints can affect deduplication via two mechanisms: |
| 86 | + |
| 87 | +1) Include `endpoints` in `HASHCODE_FIELDS_PER_SCANNER` for a parser. Then endpoints are part of the hash and must match exactly according to the parser’s hashing rules. |
| 88 | +2) If endpoints are not in the hash fields, use `DEDUPLE_ALGO_ENDPOINT_FIELDS` to specify attributes to compare. Examples: |
| 89 | + - `[]`: endpoints are ignored for dedupe. |
| 90 | + - `["host"]`: findings dedupe if any endpoint pair matches by host. |
| 91 | + - `["host", "port"]`: findings dedupe if any endpoint pair matches by host AND port. |
| 92 | + |
| 93 | +Notes: |
| 94 | + |
| 95 | +- For Legacy algorithm, static vs dynamic findings have different endpoint matching rules (see the algorithms page). The `DEDUPLE_ALGO_ENDPOINT_FIELDS` setting applies to the hash-code path, not the Legacy algorithm’s intrinsic logic. |
| 96 | +- For `unique_id_from_tool` (ID-based) matching, endpoints are ignored for the dedupe decision. |
| 97 | + |
| 98 | +## Service field: dedupe and reimport |
| 99 | + |
| 100 | +- With default `HASH_CODE_FIELDS_ALWAYS = ["service"]`, the `service` field is appended to the hash. Two otherwise equal findings with different `service` values will not dedupe on hash-based paths. |
| 101 | +- During import via UI/API, the `Service` input can override the parser-provided service. Changing it changes the hash and can alter dedupe behavior and reimport matching. |
| 102 | +- If you want dedupe independent of service, remove `service` from `HASH_CODE_FIELDS_ALWAYS` or leave the `Service` field empty during import. |
| 103 | + |
| 104 | +## After changing deduplication settings |
| 105 | + |
| 106 | +- Changes to dedupe configuration (e.g., `HASHCODE_FIELDS_PER_SCANNER`, `HASH_CODE_FIELDS_ALWAYS`, `DEDUPLICATION_ALGORITHM_PER_PARSER`) are not applied retroactively automatically. To re-evaluate existing findings you must run the management command below. |
| 107 | + |
| 108 | +Run inside the uwsgi container. Example (hash codes only, no dedupe): |
| 109 | + |
| 110 | +```bash |
| 111 | +docker compose exec uwsgi /bin/bash -c "python manage.py dedupe --hash_code_only" |
| 112 | +``` |
| 113 | + |
| 114 | +Help/usage: |
| 115 | + |
| 116 | +options: |
| 117 | + --parser PARSER List of parsers for which hash_code needs recomputing |
| 118 | + (defaults to all parsers) |
| 119 | + --hash_code_only Only compute hash codes |
| 120 | + --dedupe_only Only run deduplication |
| 121 | + --dedupe_sync Run dedupe in the foreground, default false |
| 122 | +``` |
| 123 | +
|
| 124 | +If you submit dedupe to Celery (without `--dedupe_sync`), allow time for tasks to complete before evaluating results. |
| 125 | +
|
| 126 | +## Where to configure |
| 127 | +
|
| 128 | +- Prefer environment variables in deployments. For local development or advanced overrides, use `local_settings.py`. |
| 129 | +- See `configuration.md` for details on how to set environment variables and configure local overrides. |
| 130 | +
|
| 131 | +### Troubleshooting |
| 132 | +
|
| 133 | +To help troubleshooting deduplication use the following tools: |
| 134 | +
|
| 135 | +- Observe log out in the `dojo.specific-loggers.deduplication` category. This is a class independant logger that outputs details about the deduplication process and settings when processing findings. |
| 136 | +- Observe the `unique_id_from_tool` and `hash_code` values by hovering over the `ID` field or `Status` column: |
| 137 | +
|
| 138 | + |
| 139 | +
|
| 140 | + |
| 141 | +
|
| 142 | +## Related documentation |
| 143 | +
|
| 144 | +- [Deduplication Algorithms](deduplication_algorithms): conceptual overview and endpoint behavior. |
| 145 | +- [Avoiding duplicates via reimport](avoiding_duplicates_via_reimport). |
| 146 | +
|
| 147 | +
|
0 commit comments