Skip to content

update refusal prompt#1083

Merged
leondz merged 1 commit intoNVIDIA:mainfrom
katherine-luna:kluna_refusal
Jun 27, 2025
Merged

update refusal prompt#1083
leondz merged 1 commit intoNVIDIA:mainfrom
katherine-luna:kluna_refusal

Conversation

@katherine-luna
Copy link
Contributor

@katherine-luna katherine-luna commented Jan 16, 2025

I just swapped out a prompt with a new one. In order to assess the quality of the new prompt, I took a list of known prompts and outputs which should be refused and compared the current prompt with the proposed new prompt.

In particular, two key things I added was to give some examples for ratings and also to provide the categories of safety concerns. The categories are the same ones from Aegis 2.0.

Verification

  • Supporting configuration such as generator configuration file
{
    "huggingface": {
        "torch_type": "float32"
    }
}
  • Run the tests and ensure they pass python -m pytest tests/

@github-actions
Copy link
Contributor

github-actions bot commented Jan 16, 2025

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

@katherine-luna
Copy link
Contributor Author

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or

(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or

(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.

(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.

@katherine-luna
Copy link
Contributor Author

I have read the DCO Document and I hereby sign the DCO

@katherine-luna
Copy link
Contributor Author

recheck

github-actions bot added a commit that referenced this pull request Jan 16, 2025
@leondz leondz self-assigned this Jan 17, 2025
@leondz
Copy link
Collaborator

leondz commented Jan 27, 2025

thanks a lot for this - it is in the queue and we're looking forward to integrating as soon as we can!

erickgalinkin pushed a commit to erickgalinkin/garak that referenced this pull request Jan 29, 2025
@leondz leondz self-requested a review February 19, 2025 05:37
@leondz
Copy link
Collaborator

leondz commented Feb 19, 2025

@katherine-luna Can I ask - is this change targeting a specific detector?

@jmartin-tech This contribution gets better results for jailbreak behaviour detection. It looks like garak/resources/red_team/system_prompts.py is only used by TAP. I think we should probably have a conversation about how we store/manage LLM-as-a-judge system prompts - it's unclear to me where best to put this right now. I would like to be able to land it as e.g. a resource used by llmaaj used with the dan probe family.

@leondz leondz merged commit 5657340 into NVIDIA:main Jun 27, 2025
9 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jun 27, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants