Skip to content

refactor LatentInjection#1152

Merged
leondz merged 11 commits intoNVIDIA:mainfrom
leondz:update/latentinjection_refactor
Apr 26, 2025
Merged

refactor LatentInjection#1152
leondz merged 11 commits intoNVIDIA:mainfrom
leondz:update/latentinjection_refactor

Conversation

@leondz
Copy link
Collaborator

@leondz leondz commented Apr 7, 2025

The LatentInjection module had an overly complex inheritance graph.

This is updated to use mixins, as follows:

  • LatentInjectionMixin - retained; used in all probes; adds tags, detector, methods for assembling prompts & triggers
  • NonFullMixin - used to map Full probes to standard (i.e. lightweight) versions
  • TranslationMixin - templates and assembly for translation-based injections
  • FactSnippetMixin - docs, instructions, injections, and assembly for fact snippet-based instructions

LatentWhoisSnippet took a heavy refactoring; previously inherited on LatentInjectionFactSnippetEiffelFull, now inherits LatentWhois and FactSnippetMixin

Verification

  • python -m pytest tests/probes/test_probes_latentinjection.py (this should pass solo without config fixtures being loaded)
  • python -m pytest tests/probes/test_probes.py
  • Check that prompts from each probe correspond to the intended attack
  • Check that probes have the correct active True/False depending on their size

@leondz leondz added architecture Architectural upgrades probes Content & activity of LLM probes labels Apr 7, 2025
@leondz leondz assigned jmartin-tech and unassigned jmartin-tech Apr 7, 2025
@leondz leondz requested a review from jmartin-tech April 7, 2025 09:43
Copy link
Contributor

@mrowebot mrowebot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing shows this changed the prompt count for LatentWhoisSnippet, looking closer into why.

From 0.10.3.1

latentinjection.LatentWhois                                                 base.TriggerListDetector: PASS  ok on   28/  28
latentinjection.LatentWhoisSnippet                                          base.TriggerListDetector: PASS  ok on   32/  32

On main @ 55da36b

latentinjection.LatentWhois                                                 base.TriggerListDetector: PASS  ok on   28/  28
latentinjection.LatentWhoisSnippet                                          base.TriggerListDetector: PASS  ok on   32/  32

This PR branch:

latentinjection.LatentWhois                                                 base.TriggerListDetector: PASS  ok on   28/  28
latentinjection.LatentWhoisSnippet                                          base.TriggerListDetector: PASS  ok on  256/ 256

I suspect this is not an expected change.

@leondz
Copy link
Collaborator Author

leondz commented Apr 17, 2025

Thanks, good catch. I believe the randomisation logic changed for this probe to be more in line with common practice in garak, rather than predicated on generations, so I think some change is likely if that happens to be the cause.

leondz and others added 2 commits April 18, 2025 16:57
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
@leondz leondz mentioned this pull request Apr 23, 2025
1 task
@leondz leondz force-pushed the update/latentinjection_refactor branch from b80749a to 2e119fe Compare April 24, 2025 14:00
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of LatentWhoisSnippet shows the permutations selected are consistent but not identical to previous prompts. The application of soft_prompt_cap as a limiter in the Full version is part of the reason in testing no identical prompts were found.

This can land as is, however I made some minor comments related to this unexpected usage of soft_prompt_cap in Full probes and that some for the application of FactSnippitMixin seems spurious which make continue to make this module difficult to maintain.



class LatentWhoisSnippet(LatentInjectionFactSnippetEiffelFull):
class LatentWhoisSnippetFull(FactSnippetMixin, LatentWhois):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, this does not look be needed, nothing is being inherited from FactSnippetMixin.

FactSnippetMixin.injection_instructions is still accessible as written in line 602 if this is removed.

Suggested change
class LatentWhoisSnippetFull(FactSnippetMixin, LatentWhois):
class LatentWhoisSnippetFull(LatentWhois):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snippet building needs refactoring anyway after the current fix in #1181 - will update that PR and then it should be processed directly after this one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thoughts:

@leondz
Copy link
Collaborator Author

leondz commented Apr 26, 2025

Review of LatentWhoisSnippet shows the permutations selected are consistent but not identical to previous prompts

Sounds like expected behaviour - both are intentionally sampling rather than using whole population. Good it's consistent. Tests based on validation learnings welcome

@leondz leondz merged commit 37f684d into NVIDIA:main Apr 26, 2025
9 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Apr 26, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

architecture Architectural upgrades probes Content & activity of LLM probes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants