Skip to content

probe: doctor attack + encoding/Leet#1180

Merged
leondz merged 12 commits intoNVIDIA:mainfrom
leondz:probe/doctor
Jul 3, 2025
Merged

probe: doctor attack + encoding/Leet#1180
leondz merged 12 commits intoNVIDIA:mainfrom
leondz:probe/doctor

Conversation

@leondz
Copy link
Collaborator

@leondz leondz commented Apr 25, 2025

Implementation of doctor / puppetry attack from https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms/

Also adds module for encoding funcs req'd by more than one plugin

NB this highlights missing intents & techniques functionality

Verification

  • garak -m test -p doctor,encoding.InjectLeet

@leondz leondz added probes Content & activity of LLM probes new plugin Describes an entirely new probe, detector, generator or harness labels Apr 25, 2025
@leondz leondz marked this pull request as ready for review April 25, 2025 09:51
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently typo makes this not do any replacement, there should probably be a unit tests here to ensure _build_prompts() actually mutates the templates.

Also based on the programatic/XML based contexts being added in the templates is this really lang = "en" or should it be considered lang = "*"?

@erickgalinkin
Copy link
Collaborator

I feel like this could be included under dan. It's very DAN-like in its approach and IMO, not worthy of a distinct probe category.

@leondz
Copy link
Collaborator Author

leondz commented Apr 30, 2025

I think I disagree with the DAN idea. Of we continue this reasoning a little, Dan really should go in jailbreaks, and jailbreaks is not a meaningful category (too vague). We know our focus is on techniques - this looks like a role-playing attack, and of those, and specific coherent subgroup.

If one wants to get reporting on all role play type attacks, we have taxonomies and grouping in report_digest to enable that.

I could get behind a rename from doctor to house.

@leondz leondz self-assigned this May 6, 2025
leondz and others added 5 commits May 16, 2025 12:49
Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
@cktlco
Copy link

cktlco commented May 27, 2025

I think I disagree with the DAN idea. Of we continue this reasoning a little, Dan really should go in jailbreaks, and jailbreaks is not a meaningful category (too vague). We know our focus is on techniques - this looks like a role-playing attack, and of those, and specific coherent subgroup.

If one wants to get reporting on all role play type attacks, we have taxonomies and grouping in report_digest to enable that.

I could get behind a rename from doctor to house.

Hi, similar to this -- I'm interested in submitting another broadly-successful roleplay prompt, but based on comments here, I can't tell if the project is interested in that type of PR (due to perceived noise or overlap or other).

What is the latest guidance on where these "jailbreak" or "role playing" prompts should live?

Intuitively, it seems like a dimension that would be reported across the model(s) scanned -- ie, the model correctly responded to a probe under normal conditions, but responded with prohibited content only when the context was prefixed with XYZ roleplay prompt.

I thought of this as a "buff" use case, but I'm probably misunderstanding the philosophy. Thanks!

@leondz
Copy link
Collaborator Author

leondz commented Jun 27, 2025

@erickgalinkin

I feel like this could be included under dan. It's very DAN-like in its approach and IMO, not worthy of a distinct probe category.

@cktlco

What is the latest guidance on where these "jailbreak" or "role playing" prompts should live?

Intuitively, it seems like a dimension that would be reported across the model(s) scanned -- ie, the model correctly responded to a probe under normal conditions, but responded with prohibited content only when the context was prefixed with XYZ roleplay prompt.

Unpacking this:

  1. what's a "probe category"? Well,,,,
  2. we have a code structure where modules group probe classes by "theme"
  3. there are a variety of taxonomatata that group individual probe classes (eg. OWASP Top10 LLM 2023)
  4. there is a typology of intents
  5. there are a few categorisations of strategy/technique/tactic

Thinking out "loud":

  • a. Future things are going to parameterise (3) & (4) and make use of them really flexible. We almost certainly don't want to arrange our code by both of them, that's got to be dynamic.
  • b. If we structure our code, (1), by any of the other dimensions, we lose a dimension of expressiveness
  • c. Reporting organised by code doesn't make a huge amount of sense to me - I think human consumers are more interested in techniques & intents, 3. and 4., than
  • d. It might be intuitive to name code after technique, but this creates ties that bind us quite awkwardly, I think:
    • i. The way we conceive of technique categories has changed over time and probably will continue to do so. Making class names match the categorisation-of-the-day means churn in names & name paths, creating extra for us (writing fixers) and making configs rot faster
    • ii. We use filesystems to store code, which generally support only DAG-based paths. However, this is too constrained for the technique description, because a probe class can use/implement >1 technique. While symlinks do us all the wonderful courtesy of simply existing, and git supports them natively, Windows also exists, and anyway, this gives us a matching & resolution problem (if a file with the same name can be accessed via two different directories/techniques, is it the same file, just implementing both techniques, or two separate files?)
  • e. We probably want to move away from using the code structure in reporting. Reporting post-Technique & Intent is, I think, going to make this really clear. Doing so reduces the impact & constraint of file organisation choices, uh, meaning fewer discussions like this, I hope
  • f. I don't have a strong concept of what justifies a distinct new file/module in probes. Implementing (e.) means we can skip addressing the task of defining this, for which I suspect there are many "OK" answers, perhaps a global optima, but no answers that everybody loves. I prefer it slightly scrappy, so we don't debate it - and also something other than (3) or (4), so that we don't lose expressiveness.

The above is why I prefer decoupling naming code structure (i.e. detector & probe module and class names) from how we formally conceive of probes.

post-script:
Adding a technique taxonomy has been on the backburner for a while - this & grandma should be grouped there, without us having to move files around. grandma's naming is super similar and vaguely intuitive - otoh, a whole module for a roleplayed character feels unsustainable, agree.

Could we merge as-is and expect that what we learn Technique & Intent lends good clarity over this, perhaps even insights into (f)?

@leondz
Copy link
Collaborator Author

leondz commented Jun 27, 2025

@cktlco

I thought of this as a "buff" use case, but I'm probably misunderstanding the philosophy. Thanks!

No, that's OK, this is reasonable and a result of how probes in garak currently combine both "how we make the target fail" (technique) and the "what we make the target do" (intent). Intent is to separate these two things clearly out.

In the future buffs have a good chance of mostly focusing on data augmentation, which has a bit of overlap with technique, but shouldn't do the same thing.

None of this was clear when LLM Security first started as a field, hence older structures overlapping these concepts a bit.

@erickgalinkin
Copy link
Collaborator

I'm ok with this merging this. We can figure out taxonomy and stuff as we move forward.

@leondz leondz changed the title probe: doctor attack probe: doctor attack + encoding/Leet Jul 3, 2025
@leondz leondz merged commit 08b8332 into NVIDIA:main Jul 3, 2025
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jul 3, 2025
@leondz leondz deleted the probe/doctor branch July 3, 2025 05:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

new plugin Describes an entirely new probe, detector, generator or harness probes Content & activity of LLM probes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants