Skip to content

reporting: tier biased security aggregate#1329

Merged
leondz merged 31 commits intoNVIDIA:mainfrom
leondz:reporting/tbsa
Jan 28, 2026
Merged

reporting: tier biased security aggregate#1329
leondz merged 31 commits intoNVIDIA:mainfrom
leondz:reporting/tbsa

Conversation

@leondz
Copy link
Collaborator

@leondz leondz commented Aug 11, 2025

adds single-scalar aggregating from garak results, biased by tier

Problem

Since garak results are spread across each probe, it is hard to evaluate if a given model is good or bad. It requires the researcher to have deep understanding of each probe to reach to some conclusion
After finetuning a model, there is no way to compare Model A Vs Model B in an easy way.

Ask - The ask is NOT to make Garak benchmark a score but rather provide an easy way to non-security experts to evaluate results

Desiderata

  • A single result for a garak run, for model teams
  • Quantitative, scalar score (though not necessarily in a metric space)
  • Some stability
  • Simple to understand for model and system developers
  • Simple to understand for security folk
  • Weights failures in important probes higher
  • Gives lower score to targets with higher variation in performance over probes but same averages
  • Prioritises increases in rate of severity or failure
  • Doesn't overstate precision

Proposal

Let’s use “tier-based score aggregate"

How it works

  • We could use pass/fail for every probe:detector pair, but this might not afford enough granularity
  • Each probe:detector result (both pass rate and Z-score) is graded internally in garak on a 1-5 scale, 5 is great, 1 is awful - this uses the DEFCON scale
    • Grading boundaries are determined through experience using garak for review
    • Value boundaries stored currently in garak.analyze
    • DEFCON 1 & 2 are fails, according to Tier descriptions
  • Moving from pass/fail to a 1-5 scale gives us more granularity
  • First, we aggregate each probe:detector’s scores into one. This means combining the pass rate and Z-score. To do this, we extract the DEFCON for pass rate and for Z-score, and take the minimum. Any other aggregation measure would conceal important failures.
  • Next, we group probe:detector aggregate defcons by tier
  • We calculate the aggregate (e.g. harmonic mean, interpolated lower quartile) for Tier 1 and for Tier 2 probe:detector pairs
  • We take the weighted mean of Tier 1 and Tier 2 probes; propose a 2:1 weighting here
  • Round to 1 d.p.
  • Now you have a score in the range 1.0-5.0 where higher is better. \o/

NB: No garak score is stable over time. This is intended behaviour. For comparability, appropriate parts of config need to match; see #1193

Notes

The score can only be compared among identical versions of garak and identical configs. We need to give a key identifying the basis for comparison (e.g. config & version). Tracking this feature in garak issue 1193.

Because “proportion of probe detections passed/failed”, we can use a sensible scale

Percentage is still a bad idea - 100% sounds like a target is fully secure, but garak can never show this. No intent to support this design, for safety reasons

Scores will be affected by the mixture of detectors and probes within Tier 1; later iterations, after technique/intent lands, will let us group this up by strategies & impacts, allowing more meaningful & stable results

Items not in the bag may not be included, meaning an integration overhead for probe updates. One proposal: just use the absolute score, we’re doing a min() for aggregation anyway

Beside config & version, any given TBSA score is predicated on:

  • DEFCON boundaries for pass rate & Z-score (stable, pretty confident about these)
  • Composition of Tier 1 and Tier 2’s probes (flux only between releases)
  • Detectors chosen for each probe (low flux, only between releases)
  • Models & regex used for detector (low flux, only between releases)
  • External APIs used and whatever is going on there (who knows)

How this fills the goals

  • Garak TBSA is a single, scalar item
  • It’s stable for the same version and config
  • Dev teams can understand ratings 1-5, 1=bad 5=good
  • Many security folks understand DEFCON scores; many either lived through the cold war, saw Wargames, or have some familiarity with big military
  • Tier-1 probes have high impact, Tier-3 + U probes have no impact, so we have a weighting
  • Aggregation function downweights high-variance results
  • Coarse granularity of one decimal place expresses precision appropriately

What’s the ideal solution like

We should really be balancing score components grouped by their characteristics, rather than just how many probe/detector pairs there are. Ideally we’d like to be able to group techniques and group impacts. We have a couple of typologies for these but aren’t well-informed enough. Uniform weighting seems the only reasonable choice in the absence of anything else.

Implementation

Build this as a new analysis tool.

consumes:

  • report digest object

relies on:

  • absolute scores in digest
  • relative scores in digest
  • defcons in digest
  • tier defs in digest

outputs:

  • 2s.f. / 1d.p. score [1.0,5.0]

usage:

  • python -m garak.analyze.tbsa -r <report.jsonl filepath>
  • garak.analyze.tbsa.digest_to_tbsa()

open questions/extensions:

  • choose between current garak calibration and calibration in report file
  • what if there's no calibration data available overall
  • do we fill in gaps if major (0.x) versions match
  • what if there's no calibration data in the file but file calibration was recommended
  • what if there's absolute score but no relative, for a T1 probe
  • allow use of calculated z-scores, calculated defcons, current tierdefs
  • what if absolute is lower than relative, for T2 (insufficient impact, use relative)
  • configurable aggregate function (mean, floor, first quartile, harmonic mean)
  • aggregate multiple probe assessments to just one (mean? floor? harmonic? as specified by probe?)
  • load current / custom calibration
  • how strictly should we fail? (all missing relative? any missing relative? cutoff?)
  • how to handle groups? mean of scores per group? one DC per group?
  • tests
  • should this be included in report digest (i guess), let's do that non-circularly

@leondz leondz self-assigned this Aug 11, 2025
@leondz leondz added the reporting Reporting, analysis, and other per-run result functions label Aug 11, 2025
@leondz
Copy link
Collaborator Author

leondz commented Nov 11, 2025

NB:

  • has it been updated for the latest digest keys?
  • should also be updated for argparse to be consistent with other tools in the analyze package.

@leondz leondz marked this pull request as ready for review December 9, 2025 16:42
Copy link
Collaborator

@aishwaryap aishwaryap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't try running it but logic wise this looks good to me!

@leondz
Copy link
Collaborator Author

leondz commented Jan 16, 2026

added JSON output. it's looking p good rn

image

@leondz leondz requested a review from jmartin-tech January 16, 2026 13:35
@leondz
Copy link
Collaborator Author

leondz commented Jan 20, 2026

leondz and others added 4 commits January 21, 2026 20:03
Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
leondz and others added 6 commits January 21, 2026 20:04
Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Copy link
Collaborator

@erickgalinkin erickgalinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor phrasing nits. Otherwise looks good to me.

TBSA is a method for getting a rough single number estimating the risk posed by a target based on a garak run.

While we've done our best to represent security knowledge in this score, it's no substitute for examining the run results.
Relying on a TBSA score instead of the run report is a security risk - without exceptions. **Do not do this, do not let other people do this**.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOVE THIS


You can also join our `Discord <https://discord.gg/uVch4puUCs>`_
and follow us on `Twitter <https://twitter.com/garak_llm>`_!
and follow us on `LinkedIn <https://www.linkedin.com/company/garakllm/>`_ & `X <https://www.twitter.com/garak_llm>`_!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still on X, the everything app? Should we be on Bluesky? :P

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're open to receiving followers. Open up a bluesky if you want to manage it!

leondz and others added 6 commits January 28, 2026 10:52
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
@leondz leondz merged commit 0d9b5de into NVIDIA:main Jan 28, 2026
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jan 28, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

reporting Reporting, analysis, and other per-run result functions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants