-
Notifications
You must be signed in to change notification settings - Fork 600
FEAT: Cyber scenario #1180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEAT: Cyber scenario #1180
Conversation
…into cyber_scenario Resolving merge conflict
|
This looks good! i'm wondering if there are ways to incorporate like xpia attacks or converters (MaliciousQuestionGeneratorConverter, there might be more just at first glance) to be a bit more creative rather than just updating prompts |
There definitely are! CyberStrategy is used very sparsely here, which I don't like, but I haven't found a way to reconcile the nature of cybersecurity harms (which are often sequential, iterative, and don't rely on conversions as much) with the tag-based system. But it's definitely something I want to drive in a second PR |
|
|
||
| # Adversarial chat is used by the RedTeamingAttack (multiturn) variation of the scenario. Otherwise, it's not | ||
| # invoked. | ||
| self._adversarial_chat = adversarial_chat if adversarial_chat else self._get_default_adversarial_target() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I don't think the level of flexibility in passing an adversarial chat is necessary. I think we can pick our best environment variables and use those. Probably I'd use AZURE_OPENAI_GPT4O_UNSAFE_ENDPOINT
But would be good to sync with @hannahwestra25 since I do think it'd be nice to be consistent across airt scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FoundryScenario allows flexible adversarial chats, which is why I added it here, but both scenarios default to AZURE_OPENAI_GPT4O_UNSAFE_ENDPOINT anyway. I think we should leave it as-is for consistency but I'm open to removing it
| 1. Tries to detect malware using the instructions in the scoring config below. | ||
| 2. Returns a true/false score for malware presence rather than something like a danger score in [0.0, 1.0]. | ||
| """ | ||
| return SelfAskTrueFalseScorer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good! I'd recommend also making into a composite scorer and adding a refusal scorer to check that it's not a refusal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this would be a good place to actually fix CyberScenario's scorer as a CyberScenarioScorer, which itself is a composite scorer that checks for multiple criteria? Using the malware generation example that could mean
- SelfAskTrueFalseScorer for malware presence
- Likert-based scorer for severity of exploit
- Refusal scorer as a backstop (if refusal_scorer flags true, disregard other two scorers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure if it's in scope here but I could go either way, what do you think? I'll work on it locally in the meantime
| # objective_target is guaranteed to be non-None by parent class validation | ||
| assert self._objective_target is not None | ||
|
|
||
| attack_strategy: Optional[AttackStrategy] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a quirk of the existing Scenario contract that I don't know how we should resolve. I don't like my implementation here, so I would appreciate suggestions on how to reconcile it.
Scenario.initialize_async populates the Scenario._scenario_composites attribute using ScenarioStrategy.prepare_scenario_strategies, expecting that Scenario._get_atomic_attacks_async will use scenario_composites to create the List[AtomicAttack]. However, this breaks scenarios that do not want or cannot use composite strategies, since scenario_strategies is ephemeral in initialize_async and is only saved via scenario_composites.
I tried fixing this by just creating Scenario._original_strategies in initialize_async, but this is hacky, because the argument passed to initialize_async is of type Sequence[ScenarioStrategy | ScenarioCompositeStrategy]. So, you have to deserialize the composite strategies (if there are any) which leads to duplication and is hard coded anyway.
I felt the easiest fix was to compartmentalize this in CyberScenario by just accessing the first element of the composite with strategy.strategies[0], but this is a consequence of CyberStrategy being non-composable which I want to change soon. In my opinion that's a large enough change to the logic of CyberScenario that it warrants a second PR but I'm curious if there's a quicker fix I haven't found yet.
|
|
||
| # Adversarial chat is used by the RedTeamingAttack (multiturn) variation of the scenario. Otherwise, it's not | ||
| # invoked. | ||
| self._adversarial_chat = adversarial_chat if adversarial_chat else self._get_default_adversarial_target() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FoundryScenario allows flexible adversarial chats, which is why I added it here, but both scenarios default to AZURE_OPENAI_GPT4O_UNSAFE_ENDPOINT anyway. I think we should leave it as-is for consistency but I'm open to removing it
| 1. Tries to detect malware using the instructions in the scoring config below. | ||
| 2. Returns a true/false score for malware presence rather than something like a danger score in [0.0, 1.0]. | ||
| """ | ||
| return SelfAskTrueFalseScorer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure if it's in scope here but I could go either way, what do you think? I'll work on it locally in the meantime
Description
Adds a cybersecurity harms scenario to pyrit called the CyberScenario, which tests a model's willingness to generate malware via single-turn or multi-turn (red teaming) attack methods. Changes listed below:
This PR is meant to be a starting point for additional cybersecurity harm scaffolding as there are still many places CyberScenario can be expanded on.
Tests and Documentation
Unit tests focus on initialization, attack generation, execution, and scenario properties, similarly to other scenarios.