FEAT: Anthropic Model-Written Evals Dataset #1170

0xm00n · 2025-11-05T18:01:33Z

Description

This PR adds support for the Anthropic model-written-evals dataset to PyRIT. The model-written-evals dataset contains 154 evaluation datasets designed to test LLM behaviors across 4 main categories: persona traits, sycophancy, advanced AI risks, and gender bias. The evaluations use language models to automatically generate test cases across multiple behavioral dimensions.

Dataset: https://github.com/anthropics/evals

Associated Paper: https://arxiv.org/abs/2212.09251

Work Completed

Implemented the fetch_anthropic_evals_dataset() function in anthropic_evals_dataset.py
Added unit tests in test_anthropic_evals_dataset.py (12 test cases)
Added integration test in test_fetch_datasets.py
Updated API documentation in api.rst
Registered function in pyrit/datasets/__init__.py

Related Issue

Contributes to issue #450

[pull] main from Azure:main

- Implement fetch_sorry_bench_dataset() for sorry-bench/sorry-bench-202503 on Hugging Face - Support filtering by 44 categories and 21 prompt style linguistic mutations - Default to base prompts only - Add unit tests (10 tests) - Add integration test - Update API documentation

…dation - Change prompt_styles (list) to prompt_style (string) for simpler API - Add input validation for categories and prompt_style - Add VALID_CATEGORIES and VALID_PROMPT_STYLES for input validation - Fix exception handling to preserve stack traces - Update tests to match new API (12 tests, all passing)

Implements support for Anthropic's model-written-evals dataset with 154 evaluation datasets across 4 categories. Changes: - Add fetch_anthropic_evals_dataset() function with category filtering - Support for persona, sycophancy, advanced-ai-risk, and winogenerated categories - Fetches directly from GitHub repository using API for file discovery - Add unit tests (12 tests) - Add integration test - Update API docs

…aset

pyrit/datasets/anthropic_evals_dataset.py

pyrit/datasets/sorry_bench_dataset.py

hannahwestra25 · 2025-11-07T16:49:41Z

Thanks for adding these! made a few stylistic suggestions!

Co-authored-by: hannahwestra25 <[email protected]>

romanlutz

Nice work! Looking forward to having this dataset fetcher in PyRIT 🙂

pyrit/datasets/anthropic_evals_dataset.py

romanlutz · 2025-11-13T20:45:15Z

pyrit/datasets/anthropic_evals_dataset.py

+    - persona: 119 datasets testing personality traits, beliefs, and potentially dangerous goals
+    - advanced-ai-risk: 32 datasets (16 human + 16 LM-generated) testing AI safety concerns
+    - sycophancy: 3 datasets testing whether models repeat back user views
+    - winogenerated: 2 datasets testing gender bias in occupations


only the persona and sycophancy ones come through if you run this.

The winogenerated ones don't have a "question", but a "sentence_with_blank" that needs to have the blank replaced with the values. So those will map to multiple prompts each.

I'm not sure why the advanced-ai-risk ones got filtered (yet) but you may want to investigate why they're getting filtered.

i think i've fixed all of the issues and added tests (which all passed). however, when loading up the dataset categories, I see this for the winogenerated category:

Skipping item with empty prompt in winogenerated_occupations.jsonl Skipping item with empty prompt in winogenerated_occupations.jsonl Skipping item with empty prompt in winogenerated_occupations.jsonl Skipping item with empty prompt in winogenerated_occupations.jsonl Skipping item with empty prompt in winogenerated_occupations.jsonl Skipping item with empty prompt in winogenerated_occupations.jsonl Loaded 2990 winogenerated prompts First prompt: The precision instrument repairer charged the client $30 an hour for _ services

This is running the following below:

from pyrit.datasets import fetch_anthropic_evals_dataset dataset = fetch_anthropic_evals_dataset(category='winogenerated') print(f'Loaded {len(dataset.prompts)} winogenerated prompts') print(f'First prompt: {dataset.prompts[0].value}')

not entirely sure if theres anything wrong though (seems it loads all of the 2990). let me know your thoughts.

Will rerun tomorrow

@AdrGav941 upon further inspection of this dataset I wonder if this is better suited for Q&A?

https://github.com/anthropics/evals

Everything appears to be Q&A. In some files the parsing would be easier than in others, though. The winogenerated one has pronoun options which would be the answer options. But in some others it's just in the text itself...

Any thoughts?

@romanlutz this dataset does seem to lend itself well to the QA structure we have. Especially for winogenerated. Like you said, in winogenerated it looks to need minimal transformation to work with QA. The others would need some parsing but all look to have "choices" baked into the items themselves.

I agree that this would be a great candidate to make into QA prompts.

@romanlutz @AdrGav941 how would making it into QA prompts change the code structure now?

…rated loading has empty prompts winogenerated_occupations.jsonl???

romanlutz · 2025-11-14T14:37:58Z

@0xm00n btw I just saw Anthropic drop a new dataset: https://www.anthropic.com/news/political-even-handedness

Any chance you're interested in contributing a similar fetcher for it as well? It has two prompts per row so it would require a tiny bit of custom handling, but otherwise (more?) straightforward than this one.

0xm00n · 2025-11-15T00:25:54Z

@0xm00n btw I just saw Anthropic drop a new dataset: https://www.anthropic.com/news/political-even-handedness

Any chance you're interested in contributing a similar fetcher for it as well? It has two prompts per row so it would require a tiny bit of custom handling, but otherwise (more?) straightforward than this one.

Yup, seems pretty easy. will work on it soon

0xm00n and others added 5 commits November 2, 2025 20:10

Merge pull request #34 from Azure/main

2a44aa3

[pull] main from Azure:main

Merge branch 'Azure:main' into feat/anthropic-model-written-evals-dat…

cee6223

…aset

hannahwestra25 reviewed Nov 7, 2025

View reviewed changes

pyrit/datasets/anthropic_evals_dataset.py Show resolved Hide resolved

hannahwestra25 reviewed Nov 7, 2025

View reviewed changes

pyrit/datasets/anthropic_evals_dataset.py Outdated Show resolved Hide resolved

hannahwestra25 reviewed Nov 7, 2025

View reviewed changes

pyrit/datasets/sorry_bench_dataset.py Show resolved Hide resolved

0xm00n and others added 5 commits November 7, 2025 16:01

Update pyrit/datasets/anthropic_evals_dataset.py

3c3d27b

Co-authored-by: hannahwestra25 <[email protected]>

Update pyrit/datasets/sorry_bench_dataset.py

9f6bd77

Co-authored-by: hannahwestra25 <[email protected]>

get rid of redundant error block

246723e

Merge branch 'main' into feat/anthropic-model-written-evals-dataset

ebc3345

Merge branch 'main' into feat/anthropic-model-written-evals-dataset

7ef1b1d

romanlutz reviewed Nov 12, 2025

View reviewed changes

pyrit/datasets/anthropic_evals_dataset.py Outdated Show resolved Hide resolved

pyrit/datasets/anthropic_evals_dataset.py Outdated Show resolved Hide resolved

pyrit/datasets/anthropic_evals_dataset.py Outdated Show resolved Hide resolved

0xm00n and others added 4 commits November 12, 2025 10:52

precommit fixes

569033f

Merge branch 'main' into feat/anthropic-model-written-evals-dataset

c411244

Merge branch 'main' into feat/anthropic-model-written-evals-dataset

7f85e30

Merge branch 'main' into feat/anthropic-model-written-evals-dataset

15395bb

romanlutz approved these changes Nov 13, 2025

View reviewed changes

Merge branch 'main' into feat/anthropic-model-written-evals-dataset

8f4c44b

romanlutz reviewed Nov 13, 2025

View reviewed changes

romanlutz self-requested a review November 13, 2025 20:45

0xm00n and others added 2 commits November 13, 2025 23:25

fix category loading for winogenerated and advanced-ai-risk. winogene…

e6f3318

…rated loading has empty prompts winogenerated_occupations.jsonl???

Merge branch 'main' into feat/anthropic-model-written-evals-dataset

c9225a9

FEAT: Anthropic Model-Written Evals Dataset #1170

Are you sure you want to change the base?

FEAT: Anthropic Model-Written Evals Dataset #1170

Uh oh!

Conversation

0xm00n commented Nov 5, 2025

Description

Work Completed

Related Issue

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hannahwestra25 commented Nov 7, 2025

Uh oh!

romanlutz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

romanlutz Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

0xm00n Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

romanlutz Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

romanlutz Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

AdrGav941 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

0xm00n Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romanlutz commented Nov 14, 2025

Uh oh!

0xm00n commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

0xm00n Nov 15, 2025 •

edited

Loading