probe: doctor attack + encoding/Leet by leondz · Pull Request #1180 · NVIDIA/garak

leondz · 2025-04-25T09:30:24Z

Implementation of doctor / puppetry attack from https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms/

Also adds module for encoding funcs req'd by more than one plugin

NB this highlights missing intents & techniques functionality

Verification

garak -m test -p doctor,encoding.InjectLeet

jmartin-tech

Currently typo makes this not do any replacement, there should probably be a unit tests here to ensure _build_prompts() actually mutates the templates.

Also based on the programatic/XML based contexts being added in the templates is this really lang = "en" or should it be considered lang = "*"?

garak/resources/encodings.py

garak/probes/doctor.py

erickgalinkin · 2025-04-29T21:38:19Z

I feel like this could be included under dan. It's very DAN-like in its approach and IMO, not worthy of a distinct probe category.

leondz · 2025-04-30T05:36:32Z

I think I disagree with the DAN idea. Of we continue this reasoning a little, Dan really should go in jailbreaks, and jailbreaks is not a meaningful category (too vague). We know our focus is on techniques - this looks like a role-playing attack, and of those, and specific coherent subgroup.

If one wants to get reporting on all role play type attacks, we have taxonomies and grouping in report_digest to enable that.

I could get behind a rename from doctor to house.

Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

cktlco · 2025-05-27T06:21:49Z

I think I disagree with the DAN idea. Of we continue this reasoning a little, Dan really should go in jailbreaks, and jailbreaks is not a meaningful category (too vague). We know our focus is on techniques - this looks like a role-playing attack, and of those, and specific coherent subgroup.

If one wants to get reporting on all role play type attacks, we have taxonomies and grouping in report_digest to enable that.

I could get behind a rename from doctor to house.

Hi, similar to this -- I'm interested in submitting another broadly-successful roleplay prompt, but based on comments here, I can't tell if the project is interested in that type of PR (due to perceived noise or overlap or other).

What is the latest guidance on where these "jailbreak" or "role playing" prompts should live?

Intuitively, it seems like a dimension that would be reported across the model(s) scanned -- ie, the model correctly responded to a probe under normal conditions, but responded with prohibited content only when the context was prefixed with XYZ roleplay prompt.

I thought of this as a "buff" use case, but I'm probably misunderstanding the philosophy. Thanks!

leondz · 2025-06-27T12:43:12Z

@erickgalinkin

I feel like this could be included under dan. It's very DAN-like in its approach and IMO, not worthy of a distinct probe category.

@cktlco

What is the latest guidance on where these "jailbreak" or "role playing" prompts should live?

Intuitively, it seems like a dimension that would be reported across the model(s) scanned -- ie, the model correctly responded to a probe under normal conditions, but responded with prohibited content only when the context was prefixed with XYZ roleplay prompt.

Unpacking this:

what's a "probe category"? Well,,,,
we have a code structure where modules group probe classes by "theme"
there are a variety of taxonomatata that group individual probe classes (eg. OWASP Top10 LLM 2023)
there is a typology of intents
there are a few categorisations of strategy/technique/tactic

Thinking out "loud":

a. Future things are going to parameterise (3) & (4) and make use of them really flexible. We almost certainly don't want to arrange our code by both of them, that's got to be dynamic.
b. If we structure our code, (1), by any of the other dimensions, we lose a dimension of expressiveness
c. Reporting organised by code doesn't make a huge amount of sense to me - I think human consumers are more interested in techniques & intents, 3. and 4., than
d. It might be intuitive to name code after technique, but this creates ties that bind us quite awkwardly, I think:
- i. The way we conceive of technique categories has changed over time and probably will continue to do so. Making class names match the categorisation-of-the-day means churn in names & name paths, creating extra for us (writing fixers) and making configs rot faster
- ii. We use filesystems to store code, which generally support only DAG-based paths. However, this is too constrained for the technique description, because a probe class can use/implement >1 technique. While symlinks do us all the wonderful courtesy of simply existing, and git supports them natively, Windows also exists, and anyway, this gives us a matching & resolution problem (if a file with the same name can be accessed via two different directories/techniques, is it the same file, just implementing both techniques, or two separate files?)
e. We probably want to move away from using the code structure in reporting. Reporting post-Technique & Intent is, I think, going to make this really clear. Doing so reduces the impact & constraint of file organisation choices, uh, meaning fewer discussions like this, I hope
f. I don't have a strong concept of what justifies a distinct new file/module in probes. Implementing (e.) means we can skip addressing the task of defining this, for which I suspect there are many "OK" answers, perhaps a global optima, but no answers that everybody loves. I prefer it slightly scrappy, so we don't debate it - and also something other than (3) or (4), so that we don't lose expressiveness.

The above is why I prefer decoupling naming code structure (i.e. detector & probe module and class names) from how we formally conceive of probes.

post-script:
Adding a technique taxonomy has been on the backburner for a while - this & grandma should be grouped there, without us having to move files around. grandma's naming is super similar and vaguely intuitive - otoh, a whole module for a roleplayed character feels unsustainable, agree.

Could we merge as-is and expect that what we learn Technique & Intent lends good clarity over this, perhaps even insights into (f)?

leondz · 2025-06-27T12:50:48Z

@cktlco

I thought of this as a "buff" use case, but I'm probably misunderstanding the philosophy. Thanks!

No, that's OK, this is reasonable and a result of how probes in garak currently combine both "how we make the target fail" (technique) and the "what we make the target do" (intent). Intent is to separate these two things clearly out.

In the future buffs have a good chance of mostly focusing on data augmentation, which has a bit of overlap with technique, but shouldn't do the same thing.

None of this was clear when LLM Security first started as a field, hence older structures overlapping these concepts a bit.

erickgalinkin · 2025-07-02T15:35:28Z

I'm ok with this merging this. We can figure out taxonomy and stuff as we move forward.

leondz added 3 commits April 25, 2025 11:07

add tags for doctor probes

f84295b

add encoding probe for leetspeak

fc2feca

factor leet function up, unify probe naming

e9fd631

leondz added probes Content & activity of LLM probes new plugin Describes an entirely new probe, detector, generator or harness labels Apr 25, 2025

leondz requested review from erickgalinkin and jmartin-tech April 25, 2025 09:30

leondz marked this pull request as ready for review April 25, 2025 09:51

jmartin-tech requested changes Apr 28, 2025

View reviewed changes

garak/resources/encodings.py Outdated Show resolved Hide resolved

garak/probes/doctor.py Show resolved Hide resolved

leondz self-assigned this May 6, 2025

leondz and others added 5 commits May 16, 2025 12:49

docsting typo

10926d4

Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

🇺🇸

5478b3f

Merge branch 'main' into probe/doctor

eb5d104

move to garak.probes pattern

34d5bcb

move to garak.probes pattern

87a14b2

merge; move leet encoding probe to EncodingMixin pattern

058890e

rm straggling BaseEncodingProbe ref

78e30e6

leondz added 2 commits July 3, 2025 06:50

add template/prompt processing tests

3173356

update debug param in doctor test output

d76438b

leondz changed the title ~~probe: doctor attack~~ probe: doctor attack + encoding/Leet Jul 3, 2025

leondz merged commit 08b8332 into NVIDIA:main Jul 3, 2025
15 checks passed

github-actions bot locked and limited conversation to collaborators Jul 3, 2025

leondz deleted the probe/doctor branch July 3, 2025 05:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

probe: doctor attack + encoding/Leet#1180

probe: doctor attack + encoding/Leet#1180
leondz merged 12 commits intoNVIDIA:mainfrom
leondz:probe/doctor

leondz commented Apr 25, 2025

Uh oh!

jmartin-tech left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

erickgalinkin commented Apr 29, 2025

Uh oh!

leondz commented Apr 30, 2025

Uh oh!

cktlco commented May 27, 2025

Uh oh!

leondz commented Jun 27, 2025 •

edited

Loading

Uh oh!

leondz commented Jun 27, 2025

Uh oh!

erickgalinkin commented Jul 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

leondz commented Apr 25, 2025

Verification

Uh oh!

jmartin-tech left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

erickgalinkin commented Apr 29, 2025

Uh oh!

leondz commented Apr 30, 2025

Uh oh!

cktlco commented May 27, 2025

Uh oh!

leondz commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leondz commented Jun 27, 2025

Uh oh!

erickgalinkin commented Jul 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jmartin-tech left a comment •

edited

Loading

leondz commented Jun 27, 2025 •

edited

Loading