fix(validation): harden llm_validator prompt isolation#2307
Open
amingclawdev wants to merge 1 commit into
Open
fix(validation): harden llm_validator prompt isolation#2307amingclawdev wants to merge 1 commit into
amingclawdev wants to merge 1 commit into
Conversation
Serialize validation rules and candidate values as JSON data so malicious candidate text cannot be interpolated into validation instructions. Explicitly reject invalid values with ValueError and add regression coverage for prompt-injection-shaped input. Chain-Source-Stage: observer-hotfix Chain-Project: instructor Chain-Bug-Id: INSTRUCTOR-LLM-VALIDATOR-PROMPT-INJECTION
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
## Summary
llm_validatorvalidation rules and candidate values as JSON data instead of interpolating candidate text into a natural-language instruction.candidate_valueas untrusted data and never follow instructions embedded inside it.assertwith an explicitValueError, preservingallow_overridebehavior when a fixed value is available.Problem
llm_validatorpreviously built the validation request like this:f"Does `{v}` follow the rules: {statement}"That puts user-controlled candidate text and validation instructions in the same natural-language message. A malicious candidate can include delimiters, newlines, or instruction-like text such as:
Because LLMs do not enforce backticks as a hard security boundary, this can make the candidate value look like a peer instruction rather than data being validated. If the model follows that injected instruction, an invalid value can be marked valid, or an attacker-controlled replacement can be suggested when override behavior is enabled.
The old code also used
assert resp.is_valid, resp.reasonfor invalid values. Assertions can be stripped when Python runs with optimization flags, so validation failure should not rely onassert.Why this change
Before this change, the validator message mixed three different concepts in one string:
After this change, the user message is structured data:
{ "validation_rule": "...", "candidate_value": "..." }The system prompt then tells the model that the user message is JSON data, not instructions, and that only
candidate_valueshould be evaluated againstvalidation_rule.JSON is not a magical complete defense against prompt injection, but it makes the instruction/data boundary explicit and prevents delimiter-breakout text from being interpolated into the validator instruction itself. It also gives tests a concrete property to assert: malicious text must remain inside the
candidate_valuefield and must not appear in the system prompt.Testing
uv run --extra dev --with eval-type-backport python -m pytest tests/test_llm_validator_allow_override.pyuv run --extra dev ruff check instructor/validation/llm_validators.py tests/test_llm_validator_allow_override.pygit diff --check