Skip to content

Conversation

@hannahwestra25
Copy link
Contributor

@hannahwestra25 hannahwestra25 commented Nov 6, 2025

Description

Add rapid response harm scenario which tests several different strategies for each harm category. The idea is to have a quick, comprehensive scenario to run before drilling down into more specific strategies.

Tests and Documentation

Added rapid response notebook plus instructions for dataset naming.
Added unit tests

@hannahwestra25 hannahwestra25 changed the title [DRAFT] rapid response harm scenario Rapid response harm scenario Nov 7, 2025

# Hate speech datasets

hate_stories = await create_seed_dataset(
Copy link
Contributor

@rlundeen2 rlundeen2 Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should manage a few of these, even if the list is incomplete. So instead of having strings in the notebooks, I'd put these in datasets/seed_prompts/ai_rt and maybe one file per category.

Eventually it might be nice to have a single function call that can load all our yaml seedprompts into the database and folks can use those as examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go even further and say we should provide a truly end-to-end solution here that gives some results even if the customer doesn't bring their own datasets. Of course, the conundrum is that we may not be able to share the exact datasets we're using, but maybe it's something we should actually strive for.

Btw I'll keep fighting against the ai_rt naming for external assets 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I think it's safe to require an upload, which could even be done as part of initialization. I think the dataset question can be tackled independently of this PR. Although for this one it'd be nice to include some sample datasets that we can later add to the db easily

E.g. workflow for external user

  1. memory.add_dataset(redteam) # not part of this PR
  2. Run the scenario

"ScenarioStrategy",
"ScenarioIdentifier",
"ScenarioResult",
"RapidResponseHarmScenario",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we might want a whole import line here? For example

from pyrit.scenarios.ai_rt import RapidResponseHarmScenario

But we may need some init shenanigans. IDK what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean importing from the ai_rt folder rather than the file itself ?

Each harm categories has a few different strategies to test different aspects of the harm type.
"""

ALL = ("all", {"all"})
Copy link
Contributor

@rlundeen2 rlundeen2 Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea is to only have the meta-categories. I think this may make the most sense just to have hate, fairness, violence.... leakage vs each individual scenario_strategy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the composition makes the code quite a bit more complicated, and I would guess most users will either just want to use "all" or a subset of the categories

Copy link
Contributor

@rlundeen2 rlundeen2 Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I think it should look like the following (and that's it)

class RapidResponseHarmStrategy(ScenarioStrategy):
    """
   RapidResponseHarmStrategy defines a set of strategies for testing model behavior
    in several different harm categories.

    Each harm categories has a few different strategies to test different aspects of the harm type.
    """

    ALL = ("all", {"all"})
    HATE = ("hate", set[str]())
    FAIRNESS = ("fairness", set[str]())
    VIOLENCE = ("violence", {set[str]())
    SEXUAL = ("sexual", set[str]())
    HARASSMENT = ("harassment", set[str]())
    MISINFORMATION = ("misinformation", set[str]())
    LEAKAGE = ("leakage", set[str]())

Alternatively, if you do want a long and short running version (which I also think is legit!) I might split it up like this, where the complex attacks contain long running methods. But my gut is that, it might just be simpler to have a completely separate scenario class for those

    ALL = ("all", {"all"})
    HATE_QUICK = ("hate_quick", {"quick", "hate"})
    HATE_EXTENDED = ("hate_extended", {"complex", "hate"})
    FAIRNESS_QUICK = ("fairness_quick",  {"quick", "fairness"})
...

Either way, I'd keep specific techniques out, and specific tests/datasets

# Extract RapidResponseHarmStrategy enums from the composite
strategy_list = [s for s in composite_strategy.strategies if isinstance(s, RapidResponseHarmStrategy)]

# Determine the attack type based on the strategy tags
Copy link
Contributor

@rlundeen2 rlundeen2 Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this decision in advance if we can (and if it's what operators want).

Say we get the strategy "Hate". Maybe we could do something like pick a set of strategies for hate that we want. Something like PromptSending for baseline, MultiTurn, and RolePlaying. But I could also see specific attacks/converters being created for different categories, so it might make sense to split it up this way too.

if strategy.value == "hate":
  seed_groups = memory.get_seed_groups(dataset_name="ai_rt_rapid_response_1", harm_category="hate")
elif strategy.value == "violence":
  ....
  
#now we have the seedGroups, and do we do the same attacks with every category or are they different?
# my guess might be they're the same?
# and can we decide? 
# My guess would be they're the same strategies but different objectives.

 attack1 = PromptSendingAttack(
                objective_target=self._objective_target,
                attack_converter_config=attack_converter_config,
                attack_scoring_config=self._scorer_config,
            )
attack2 = ....

# and then append all of these atomic attacks in the same spot. E.g. you can have more than one "hate" attack and they will be grouped together

atomic_attacks.append(
                AtomicAttack(
                    atomic_attack_name="hate", attack=attack1, objectives=hate_objectives, seed_groups=hate_seed_groups
                )
            )
            
atomic_attacks.append(
                AtomicAttack(
                    atomic_attack_name="hate", attack=attack2, objectives=hate_objectives, seed_groups=hate_seed_groups
                )
            )

attack_type: type[AttackStrategy] = PromptSendingAttack
if attack_tag:
if attack_tag[0] == RapidResponseHarmStrategy.Crescendo:
attack_type = CrescendoAttack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One arc you might be thinking about is Crescendo. But because that takes so much longer to run we might consider a different rapid response scenario for that. And/or for this one, we could pre-compute successes so it runs really fast (e.g. similar to our second cookbook).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in talking with Frederic and considering the purpose of this scenario, I think it makes sense to exclude crescendo from this scenario. It's not an attack that is (except in one instance of ~20 attacks) used in the notebook that Frederic created that was the inspo for this scenario. IMO crescendo could be considered a more in depth analysis of a harm and this scenario is supposed to be a higher level, initial analysis.

@rlundeen2
Copy link
Contributor

rlundeen2 commented Nov 8, 2025

Overall this is good! It'll be really nice to have solid examples here :)

My biggest feedback is that I think we should define exactly what we want out of this scenario. Here is what I think it is. "Can I get a vibe of this objective_target in a couple hours based on how it does on these harm categories".

And if we keep that strategy, we want to do the best we can to answer that question, and the strategies themselves should be baked in as much as possible. Along these lines, I'd recommend:

  1. Simplify the strategies. I suspect most users just want to run "all" to get a vibe check, or to run specific harm categories. And if there is a strategy they want but it takes a long time (like crescendo) maybe we should split that off into a seperate longer-running scenario class.
  2. Choose the attacks to do with those strategies explicitly (which converters and attacks to use). E.g. we can get the objectives from memory, and then this scenario can decide how we send those. I wouldn't make this configurable, because it adds another dimension to things.

@romanlutz romanlutz changed the title Rapid response harm scenario FEAT Rapid response harm scenario Nov 8, 2025
Copy link
Contributor

@fdubut fdubut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a few comments, mainly on structure and naming. I'll try to run my notebook shortly "as a scenario" to get a better sense of how this all works, and will share more feedback then.


# Hate speech datasets

hate_stories = await create_seed_dataset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go even further and say we should provide a truly end-to-end solution here that gives some results even if the customer doesn't bring their own datasets. Of course, the conundrum is that we may not be able to share the exact datasets we're using, but maybe it's something we should actually strive for.

Btw I'll keep fighting against the ai_rt naming for external assets 😆

model_name=""
)

# Define the helper adversarial target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the nature of the scenario, returning aggregate results on a variety of test cases, I'm wondering if we should give the option to customers to skip all test cases that require an adversarial target if they don't have one available. I think a lot of attacks that would succeed with a true adversarial target will fail with a regular model, skewing the final results.

# %%
# Load the datasets into memory

violence_civic_data = await create_seed_dataset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original notebook, the prompts are sequential (passed using multi-prompt attack). I haven't looked yet at the actual scenario definition but wanted to point that out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned offline but there's an issue with Multi Prompt which basically makes it error out so for now am using red teaming attack. For this PR, I'm going to keep as RedTeaming and then when we work through that issue, I can update this scenario (I like the idea of keeping this simple and the multiprompt is a simpler multi turn attack so am preferential to using it).

*,
objective_target: PromptTarget,
scenario_strategies: Sequence[RapidResponseHarmStrategy | ScenarioCompositeStrategy] | None = None,
adversarial_chat: Optional[PromptChatTarget] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to what I mentioned in my comment on the notebook, I'm wondering if we should exclude from the scenario the attacks that require an adversarial chat when none is passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can set a default; this is what the foundry scenario does

Comment on lines 111 to 113
scenario_strategies (Sequence[RapidResponseHarmStrategy | ScenarioCompositeStrategy] | None):
The harm strategies or composite strategies to include in this scenario. If None,
defaults to RapidResponseHarmStrategy.ALL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will a user be able to compose a multi-turn scenario strategy like Crescendo? or just sticking with single turn/multiprompt sending attacks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the default behavior is to run PromptSending (the baseline), RolePlaying, RedTeaming, and ManyShot by default; I'm on the fence about having a basic & extended version (basic would maybe just run promptsending and red teaming vs extended which would run them all; my reservation is that idk how much value the scenario has when the basic version would be run because as the name suggests, it's pretty basic) wdyt ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants