Skip to content

Conversation

@makeshn
Copy link
Collaborator

@makeshn makeshn commented Jan 5, 2026

Description

This PR adds a new Reasoning Guardrail Connector for the nvidia/Nemotron-Content-Safety-Reasoning-4B model.

Key Changes:

  • Added a new model type content_safety_reasoning that integrates with the Nemotron Content Safety Reasoning model
  • Added support for reasoning mode toggle (/think vs /no_think) via config option
  • Added new output parsers (nemotron_reasoning_parse_prompt_safety, nemotron_reasoning_parse_response_safety) for parsing the model's safety classification output
  • Added example configuration demonstrating content safety checks on both input and output flows as well as reasoning off and reasoning on modes.
  • Tutorial showing how to serve and setup the model via vLLM with OpenAI-compatible API

Checklist

  • I've read the CONTRIBUTING guidelines.
  • I've updated the documentation if applicable.
  • I've added tests if applicable.
  • @mentions of the person or team responsible for reviewing proposed changes.

@makeshn makeshn self-assigned this Jan 5, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1565

"sample_conversation_two_turns": self.config.sample_conversation,
# Make rails config available in templates for accessing settings like
# content_safety.reasoning.enabled, content_safety.multilingual.enabled, etc.
"config": self.config.rails.config,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pouyanpi @tgasser-nv this was needed so that we can toggle reasoning-on and reasoning-off modes in the jinja-2 prompt template depending on config.yml

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to pass reasoning_enabled in actions.py from render_task_prompt (see this commit). Content safety should own its configuration, it also allows us to keep LLMTaskManager generic 👍🏻 Please let me know if you see any issues.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile Summary

This PR adds a new Reasoning Guardrail Connector for the nvidia/Nemotron-Content-Safety-Reasoning-4B model, enabling content safety checks with optional reasoning traces.

Key changes:

  • Added two new output parsers (nemotron_reasoning_parse_prompt_safety and nemotron_reasoning_parse_response_safety) to parse the model's "harmful/unharmful" output format and strip <think> tags
  • Added ReasoningConfig to the configuration schema to allow toggling between reasoning on/off modes via rails.config.content_safety.reasoning.enabled
  • Exposed config.rails.config in Jinja2 template rendering context (in taskmanager.py:138) to allow templates to access configuration settings
  • Provided comprehensive example configuration demonstrating both input and output rails with the new model type
  • Added detailed tutorial documentation explaining deployment with vLLM and configuration options

Issue found:
The example prompts.yml file uses an undefined variable reasoning_mode instead of config.content_safety.reasoning.enabled, which will cause the template rendering to fail or default incorrectly. The documentation shows the correct usage.

Confidence Score: 3/5

  • This PR has a critical bug in the example configuration that will prevent the reasoning mode toggle from working correctly
  • The core implementation (output parsers, config schema, taskmanager changes) is solid and well-structured. However, the example prompts.yml contains a template variable error that references an undefined variable (reasoning_mode) instead of the correct path (config.content_safety.reasoning.enabled). This inconsistency between the example and documentation will cause confusion and broken functionality for users following the example.
  • examples/configs/content_safety_reasoning/prompts.yml requires fixes to use the correct template variable path

Important Files Changed

Filename Overview
nemoguardrails/llm/output_parsers.py Added two new output parsers (nemotron_reasoning_parse_prompt_safety and nemotron_reasoning_parse_response_safety) with helper functions for parsing Nemotron Content Safety Reasoning model outputs
nemoguardrails/llm/taskmanager.py Registered new output parsers and exposed config.rails.config object in template rendering context to enable access to configuration settings
nemoguardrails/rails/llm/config.py Added ReasoningConfig model to support reasoning mode toggle for content safety models
examples/configs/content_safety_reasoning/prompts.yml Prompt templates for content safety checks - uses incorrect variable reasoning_mode instead of config.content_safety.reasoning.enabled

Sequence Diagram

sequenceDiagram
    participant User
    participant LLMRails
    participant TaskManager
    participant ContentSafetyModel as Nemotron-4B
    participant MainLLM
    
    User->>LLMRails: generate(message)
    
    Note over LLMRails,TaskManager: Input Rail Flow
    LLMRails->>TaskManager: render_task_prompt("content_safety_check_input")
    TaskManager->>TaskManager: Apply template with config.content_safety.reasoning.enabled
    TaskManager-->>LLMRails: Prompt with /think or /no_think
    
    LLMRails->>ContentSafetyModel: Check input safety
    ContentSafetyModel-->>LLMRails: "Prompt harm: harmful/unharmful"
    
    LLMRails->>TaskManager: parse_task_output(output)
    TaskManager->>TaskManager: nemotron_reasoning_parse_prompt_safety()
    TaskManager->>TaskManager: Strip <think> tags, extract harm value
    TaskManager-->>LLMRails: [True] or [False]
    
    alt Input is safe
        LLMRails->>MainLLM: Generate response
        MainLLM-->>LLMRails: Response text
        
        Note over LLMRails,TaskManager: Output Rail Flow
        LLMRails->>TaskManager: render_task_prompt("content_safety_check_output")
        TaskManager-->>LLMRails: Prompt with /think or /no_think
        
        LLMRails->>ContentSafetyModel: Check output safety
        ContentSafetyModel-->>LLMRails: "Response Harm: harmful/unharmful"
        
        LLMRails->>TaskManager: parse_task_output(output)
        TaskManager->>TaskManager: nemotron_reasoning_parse_response_safety()
        TaskManager-->>LLMRails: [True] or [False]
        
        alt Output is safe
            LLMRails-->>User: Response
        else Output is unsafe
            LLMRails-->>User: "I'm sorry, I can't respond to that."
        end
    else Input is unsafe
        LLMRails-->>User: "I'm sorry, I can't respond to that."
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@codecov
Copy link

codecov bot commented Jan 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@Pouyanpi Pouyanpi changed the title add reasoning guardrail connector feat: add reasoning guardrail connector Jan 6, 2026
@Pouyanpi Pouyanpi force-pushed the feature/reasoning_guardrail_connector branch from e36d9fc to 4c134db Compare January 8, 2026 10:19
makeshn and others added 5 commits January 8, 2026 11:22
Include the `reasoning_enabled` flag from the content safety config in
the context passed to both input and output content safety check
prompts.
This enables prompt templates to conditionally adjust behavior based on
the reasoning setting. Removes redundant config exposure in
LLMTaskManager's default context.
@Pouyanpi Pouyanpi force-pushed the feature/reasoning_guardrail_connector branch from 4c134db to 921f1d7 Compare January 8, 2026 10:24
Comment on lines +8 to +14
- type: content_safety_reasoning
engine: openai
parameters:
openai_api_base: http://localhost:8001/v1
model_name: nvidia/Nemotron-Content-Safety-Reasoning-4B
temperature: 0.6
top_p: 0.95
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it still works if we change it to:

  - type: content_safety_reasoning
    engine: openai
    model: nvidia/Nemotron-Content-Safety-Reasoning-4B
    parameters:
      openai_api_base: http://localhost:8001/v1
      temperature: 0.6
      top_p: 0.95

can you please check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to modify the example in the tutorial as well.

@Pouyanpi Pouyanpi self-requested a review January 8, 2026 10:30
Copy link
Collaborator

@Pouyanpi Pouyanpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @makeshn , this is great 🚀

I just made a minor change based on my comment above and added tests. Let me know if you see any issues, we can revert/change.

Please look at the tests to ensure the assertions are aligned with you expectations. As I cannot run it myself, would you please to run it and pass some inputs to see everything is still working. Thanks 👍🏻

@Pouyanpi Pouyanpi added the enhancement New feature or request label Jan 8, 2026
@Pouyanpi Pouyanpi added this to the 0.20.0 milestone Jan 8, 2026
Copy link
Member

@trebedea trebedea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some suggestions.

:maxdepth: 2
Check Harmful Content <nemotron-safety-guard-deployment>
Content Safety Reasoning <nemotron-content-safety-reasoning-deployment>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use Respect Custom Safety and Dialogue Policies? This would focus on use-case rather than how it works.

:link: nemotron-content-safety-reasoning-deployment
:link-type: doc

Deploy Nemotron-Content-Safety-Reasoning-4B for customizable content safety with reasoning traces.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here, maybe we want to focus on use-case rather than how the model works. An alternative would be:

:::{grid-item-card} Respect Custom Safety and Dialogue Policies
:link: nemotron-content-safety-reasoning-deployment
:link-type: doc

Enforce customizable content safety and dialogue policies for your use-case by using our reasoning guard model Nemotron-Content-Safety-Reasoning-4B.


```text
Prompt harm: unharmful
Response harm: None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Response harm: None
Response harm: unharmful

```text
<think>
The human user's request falls under S21 (Illegal Activity) as it seeks guidance on theft,
which is a criminal act. The AI assistant's response is a refusal and provides an ethical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the input rail is failing - it does not really make sense to have the output rail (including the bot response) in the example.


**Reasoning On (`/think`)**: Provides explicit reasoning traces for decisions. Better for complex or novel custom policies. Higher latency. See [example](https://huggingface.co/nvidia/Nemotron-Content-Safety-Reasoning-4B#example-1-vanilla-safety-with-nemotron-content-safety-dataset-v2-taxonomy-reasoning-on-mode).

**Reasoning Off (`/no_think`)**: Fast classification without reasoning. Suitable for standard safety checks. Lower latency. See [example](https://huggingface.co/nvidia/Nemotron-Content-Safety-Reasoning-4B#example-2-vanilla-safety-with-nemotron-content-safety-dataset-v2-taxonomy-reasoning-off-mode).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Reasoning Off (`/no_think`)**: Fast classification without reasoning. Suitable for standard safety checks. Lower latency. See [example](https://huggingface.co/nvidia/Nemotron-Content-Safety-Reasoning-4B#example-2-vanilla-safety-with-nemotron-content-safety-dataset-v2-taxonomy-reasoning-off-mode).
**Reasoning Off (`/no_think`)**: Fast classification without reasoning. Suitable for standard content safety policies. Lower latency. See [example](https://huggingface.co/nvidia/Nemotron-Content-Safety-Reasoning-4B#example-2-vanilla-safety-with-nemotron-content-safety-dataset-v2-taxonomy-reasoning-off-mode).

Comment on lines +8 to +14
- type: content_safety_reasoning
engine: openai
parameters:
openai_api_base: http://localhost:8001/v1
model_name: nvidia/Nemotron-Content-Safety-Reasoning-4B
temperature: 0.6
top_p: 0.95
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to modify the example in the tutorial as well.

Comment on lines +16 to +21
# fmt: off
# isort: skip_file

# start-prerequisites

# end-prerequisites
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these and move the print_moderation_output function after the imports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants