probe: doctor attack + encoding/Leet#1180
Conversation
There was a problem hiding this comment.
Currently typo makes this not do any replacement, there should probably be a unit tests here to ensure _build_prompts() actually mutates the templates.
Also based on the programatic/XML based contexts being added in the templates is this really lang = "en" or should it be considered lang = "*"?
|
I feel like this could be included under |
|
I think I disagree with the DAN idea. Of we continue this reasoning a little, Dan really should go in jailbreaks, and jailbreaks is not a meaningful category (too vague). We know our focus is on techniques - this looks like a role-playing attack, and of those, and specific coherent subgroup. If one wants to get reporting on all role play type attacks, we have taxonomies and grouping in report_digest to enable that. I could get behind a rename from |
Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Hi, similar to this -- I'm interested in submitting another broadly-successful roleplay prompt, but based on comments here, I can't tell if the project is interested in that type of PR (due to perceived noise or overlap or other). What is the latest guidance on where these "jailbreak" or "role playing" prompts should live? Intuitively, it seems like a dimension that would be reported across the model(s) scanned -- ie, the model correctly responded to a probe under normal conditions, but responded with prohibited content only when the context was prefixed with XYZ roleplay prompt. I thought of this as a "buff" use case, but I'm probably misunderstanding the philosophy. Thanks! |
Unpacking this:
Thinking out "loud":
The above is why I prefer decoupling naming code structure (i.e. detector & probe module and class names) from how we formally conceive of probes. post-script: Could we merge as-is and expect that what we learn Technique & Intent lends good clarity over this, perhaps even insights into (f)? |
No, that's OK, this is reasonable and a result of how probes in garak currently combine both "how we make the target fail" (technique) and the "what we make the target do" (intent). Intent is to separate these two things clearly out. In the future None of this was clear when LLM Security first started as a field, hence older structures overlapping these concepts a bit. |
|
I'm ok with this merging this. We can figure out taxonomy and stuff as we move forward. |
Implementation of doctor / puppetry attack from https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms/
Also adds module for encoding funcs req'd by more than one plugin
NB this highlights missing intents & techniques functionality
Verification
garak -m test -p doctor,encoding.InjectLeet