probe: zero width bad char injection#1489
Conversation
|
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO Document and I hereby sign the DCO |
|
recheck |
|
Wow, thank you! Will take a look. NB There are failing tests - could you address these? |
|
I’ll dig into the failing tests and push a fix shortly. |
Failure 1 (docstring assertion — fixed)The test_probes.py expects every probe docstring to have at least two paragraphs (summary + detail) separated by a blank line. I’ve fixed this by expanding the docstring into a proper two-paragraph form. Failure 2 (langprovider call count mismatch)@leondz, need some help here- BadCharacters is the only probe in this test whose prompts are stored as garak.attempt.Conversation objects instead of raw strings. Because of that, Probe.probe() takes the “conversation” branch, which calls langprovider.get_text once per prompt/turn instead of batching all prompts into a single call. We still do one reverse-translation per attempt, and the test’s mock is attached to the same Passthru instance for both directions. With 256 prompts, this yields 256 forward + 256 reverse calls (512 total), while the test assumes the string/batched path and expects len(prompts)+1 (257) calls. |
This is the reason the test is failing, the test can to be updated to account for exercising this new calculation, that can be accomplished by simply adding detection for the |
erickgalinkin
left a comment
There was a problem hiding this comment.
Overall looks great, thanks! A few places where I need some clarification. Beyond that, self.prompts should be list[str] and the conversation creation is handled in _mint_attempt. This should save us some work. I'll need to do some more local testing as well.
| return "".join(rendered) | ||
|
|
||
|
|
||
| def _load_homoglyph_map() -> dict[str, List[str]]: |
There was a problem hiding this comment.
Does it make more sense to just turn intentional.txt into a json file so we can load it from disk without all the extra file parsing?
I don't see anywhere the .txt file is used directly.
There was a problem hiding this comment.
intentional.txt is the upstream Unicode Security format (https://www.unicode.org/Public/security/latest/intentional.txt), so we can drop in updates directly without maintaining a parallel generated artifact. Parsing is minimal (split on ; / #) and only happens once at init, so there isn’t much overhead to avoid. Converting this to JSON would add a regeneration step, but I can switch to JSON if that’s what’s expected/preferred.
leondz
left a comment
There was a problem hiding this comment.
This is great. A few questions and comments but in good shape.
Requested changes applied.
This PR adds a new
BadCharactersprobe that exercises models with imperceptible and structurally tricky Unicode “bad character” perturbations reported in [#233].The probe:
Enumerates prompt variants that include:
Pre-generates the full Cartesian set of prompt variations and then downsamples the pool using
run.soft_probe_prompt_cap, mirroring the existingprobes.phrasing.*pattern so runs stay inference-friendly and reproducible.Key parameters (with defaults):
payload_name:"harmful_behaviors"perturbation_budget:1enabled_categories:["invisible", "homoglyph", "reordering", "deletion"]max_position_candidates:24max_reorder_candidates:24max_ascii_variants:len(ASCII_PRINTABLE)follow_prompt_cap:True(honorsoft_probe_prompt_cap)These can be tuned per run.
“Just Try Everything” Strategy
The probe explicitly implements the “just try everything” approach:
For a given payload and category, it creates all combinations within the configured budget:
DEFAULT_INVISIBLEcharacters._Swapobjects rendered via_render_swaps.This means the full search space is enumerated up front, and Downsampling with
soft_probe_prompt_capVerification
List the steps needed to make sure this thing works
python -m pytest tests/probes/test_probes_badcharacters.py