Skip to content

Conversation

@ValbuenaVC
Copy link
Contributor

@ValbuenaVC ValbuenaVC commented Nov 10, 2025

Description

Adds a cybersecurity harms scenario to pyrit called the CyberScenario, which tests a model's willingness to generate malware via single-turn or multi-turn (red teaming) attack methods. Changes listed below:

  • Added CyberScenario and CyberStrategy classes
  • Added generic malware-oriented prompts to induce cyber harms as seed prompts
  • Added true/false scoring YAML for malware-oriented prompts
  • Fixed minor typo in grounded.yaml
  • Added unit tests for CyberScenario

This PR is meant to be a starting point for additional cybersecurity harm scaffolding as there are still many places CyberScenario can be expanded on.

Tests and Documentation

Unit tests focus on initialization, attack generation, execution, and scenario properties, similarly to other scenarios.

@ValbuenaVC ValbuenaVC marked this pull request as ready for review November 12, 2025 00:23
@ValbuenaVC ValbuenaVC changed the title [DRAFT] FEAT: Cyber scenario FEAT: Cyber scenario Nov 12, 2025
@hannahwestra25
Copy link
Contributor

hannahwestra25 commented Nov 12, 2025

This looks good! i'm wondering if there are ways to incorporate like xpia attacks or converters (MaliciousQuestionGeneratorConverter, there might be more just at first glance) to be a bit more creative rather than just updating prompts

@ValbuenaVC
Copy link
Contributor Author

This looks good! i'm wondering if there are ways to incorporate like xpia attacks or converters (MaliciousQuestionGeneratorConverter, there might be more just at first glance) to be a bit more creative rather than just updating prompts

There definitely are! CyberStrategy is used very sparsely here, which I don't like, but I haven't found a way to reconcile the nature of cybersecurity harms (which are often sequential, iterative, and don't rely on conversions as much) with the tag-based system. But it's definitely something I want to drive in a second PR


# Adversarial chat is used by the RedTeamingAttack (multiturn) variation of the scenario. Otherwise, it's not
# invoked.
self._adversarial_chat = adversarial_chat if adversarial_chat else self._get_default_adversarial_target()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I don't think the level of flexibility in passing an adversarial chat is necessary. I think we can pick our best environment variables and use those. Probably I'd use AZURE_OPENAI_GPT4O_UNSAFE_ENDPOINT

But would be good to sync with @hannahwestra25 since I do think it'd be nice to be consistent across airt scenarios

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FoundryScenario allows flexible adversarial chats, which is why I added it here, but both scenarios default to AZURE_OPENAI_GPT4O_UNSAFE_ENDPOINT anyway. I think we should leave it as-is for consistency but I'm open to removing it

1. Tries to detect malware using the instructions in the scoring config below.
2. Returns a true/false score for malware presence rather than something like a danger score in [0.0, 1.0].
"""
return SelfAskTrueFalseScorer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good! I'd recommend also making into a composite scorer and adding a refusal scorer to check that it's not a refusal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this would be a good place to actually fix CyberScenario's scorer as a CyberScenarioScorer, which itself is a composite scorer that checks for multiple criteria? Using the malware generation example that could mean

  1. SelfAskTrueFalseScorer for malware presence
  2. Likert-based scorer for severity of exploit
  3. Refusal scorer as a backstop (if refusal_scorer flags true, disregard other two scorers)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure if it's in scope here but I could go either way, what do you think? I'll work on it locally in the meantime

# objective_target is guaranteed to be non-None by parent class validation
assert self._objective_target is not None

attack_strategy: Optional[AttackStrategy] = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a quirk of the existing Scenario contract that I don't know how we should resolve. I don't like my implementation here, so I would appreciate suggestions on how to reconcile it.

Scenario.initialize_async populates the Scenario._scenario_composites attribute using ScenarioStrategy.prepare_scenario_strategies, expecting that Scenario._get_atomic_attacks_async will use scenario_composites to create the List[AtomicAttack]. However, this breaks scenarios that do not want or cannot use composite strategies, since scenario_strategies is ephemeral in initialize_async and is only saved via scenario_composites.

I tried fixing this by just creating Scenario._original_strategies in initialize_async, but this is hacky, because the argument passed to initialize_async is of type Sequence[ScenarioStrategy | ScenarioCompositeStrategy]. So, you have to deserialize the composite strategies (if there are any) which leads to duplication and is hard coded anyway.

I felt the easiest fix was to compartmentalize this in CyberScenario by just accessing the first element of the composite with strategy.strategies[0], but this is a consequence of CyberStrategy being non-composable which I want to change soon. In my opinion that's a large enough change to the logic of CyberScenario that it warrants a second PR but I'm curious if there's a quicker fix I haven't found yet.


# Adversarial chat is used by the RedTeamingAttack (multiturn) variation of the scenario. Otherwise, it's not
# invoked.
self._adversarial_chat = adversarial_chat if adversarial_chat else self._get_default_adversarial_target()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FoundryScenario allows flexible adversarial chats, which is why I added it here, but both scenarios default to AZURE_OPENAI_GPT4O_UNSAFE_ENDPOINT anyway. I think we should leave it as-is for consistency but I'm open to removing it

1. Tries to detect malware using the instructions in the scoring config below.
2. Returns a true/false score for malware presence rather than something like a danger score in [0.0, 1.0].
"""
return SelfAskTrueFalseScorer(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure if it's in scope here but I could go either way, what do you think? I'll work on it locally in the meantime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants