feat(cache): Add LFU caching system for models (currently applied to content safety checks) #1436

hazai · 2025-10-06T13:29:44Z

This PR replaces previous (closed) PR #1404

The PR contains a new nemoguardrails/cache folder with an LFU cache implementation (and interface).
The cache can be configured for any model.
It supports:

configuration
stats tracking
logging
thread-safety
(a very minimal) normalization of cache key
@Pouyanpi @tgasser-nv

This PR replaces previous (closed) PR #1404

Readme

Content Safety LLM Call Caching

Overview

The content safety checks in actions.py now use an LFU (Least Frequently Used) cache to improve performance by avoiding redundant LLM calls for identical safety checks.

Implementation Details

Cache Configuration

Per-model caches: Each model gets its own LFU cache instance
Default capacity: 50,000 entries per model
Eviction policy: LFU with LRU tiebreaker
Statistics tracking: Disabled by default (configurable)
Tracks timestamps: created_at and accessed_at for each entry
Cache creation: Automatic when a model is initialized with cache enabled
Supported model types: Any non-main and non-embeddings model type (typically content safety models)

Cached Functions

content_safety_check_input() - Caches safety checks for user inputs

Cache Key Components

The cache key is generated from:

The rendered prompt (normalized for whitespace)

Since temperature is fixed (1e-20) and stop/max_tokens are derived from the model configuration, they don't need to be part of the cache key.

How It Works

Before LLM Call:
- Generate cache key from request parameters
- Check if result exists in cache
- If found, return cached result (cache hit)
After LLM Call:
- If not in cache, make the actual LLM call
- Store the result in cache for future use

Cache Management

The caching system automatically creates and manages separate caches for each model. Key features:

Automatic Creation: Caches are created when the model is initialized with cache configuration
Isolated Storage: Each model maintains its own cache, preventing cross-model interference
Default Settings: Each cache has 50,000 entry capacity (configurable)
Per-Model Configuration: Cache is configured per model in the YAML configuration

Statistics and Monitoring

The cache supports detailed statistics tracking and periodic logging for monitoring cache performance:

models:
  - type: content_safety
    engine: nim
    model: nvidia/llama-3.1-nemoguard-8b-content-safety
    cache:
      enabled: true
      capacity_per_model: 10000
      store: memory  # Currently only 'memory' is supported
      stats:
        enabled: true      # Enable stats tracking
        log_interval: 60.0 # Log stats every minute

Statistics Features:

Tracking Only: Set stats.enabled: true with no log_interval to track stats without logging
Automatic Logging: Set both stats.enabled: true and log_interval for periodic logging

Statistics Tracked:

Hits: Number of cache hits (successful lookups)
Misses: Number of cache misses (failed lookups)
Hit Rate: Percentage of requests served from cache
Evictions: Number of items removed due to capacity
Puts: Number of new items added to cache
Updates: Number of existing items updated
Current Size: Number of items currently in cache

Log Format:

LFU Cache Statistics - Size: 2456/10000 | Hits: 15234 | Misses: 2456 | Hit Rate: 86.11% | Evictions: 0 | Puts: 2456 | Updates: 0

Usage Examples:

The cache is managed internally by the NeMo Guardrails framework. When you configure a model with caching enabled, the framework automatically:

Creates an LFU cache instance for that model
Passes the cache to content safety actions via kwargs
Tracks statistics if configured
Logs statistics at the specified interval

Configuration Options:

stats.enabled: Enable/disable statistics tracking (default: false)
stats.log_interval: Seconds between automatic stats logs (None = no logging)

Notes:

Stats logging requires stats tracking to be enabled
Logs appear at INFO level in the nemoguardrails.cache.lfu logger
Stats are reset when cache is cleared or when reset_stats() is called
Each model maintains independent statistics

Example Configuration

# config.yml
models:
  - type: main
    engine: openai
    model: gpt-4

  - type: content_safety
    engine: nim
    model: nvidia/llama-3.1-nemoguard-8b-content-safety
    cache:
      enabled: true
      capacity_per_model: 50000
      store: memory
      stats:
        enabled: true
        log_interval: 300.0  # Log stats every 5 minutes

rails:
  input:
    flows:
      - content safety check input model="content_safety"

Example Usage

from nemoguardrails import RailsConfig, LLMRails

# The cache is automatically configured based on your YAML config
config = RailsConfig.from_path("./config.yml")
rails = LLMRails(config)

# Content safety checks will be cached automatically
response = await rails.generate_async(
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)

Thread Safety

The content safety caching system is thread-safe for single-node deployments:

LFUCache Implementation:
- Uses threading.RLock for all operations
- All public methods (get, put, size, clear, etc.) are protected by locks
- Supports atomic get_or_compute() operations that prevent duplicate computations
LLMRails Model Initialization:
- Thread-safe cache creation during model initialization
- Ensures only one cache instance per model across all threads
Key Features:
- No Data Corruption: Concurrent operations maintain data integrity
- No Race Conditions: Proper locking prevents race conditions
- Atomic Operations: get_or_compute() ensures expensive computations happen only once
- Minimal Lock Contention: Efficient locking patterns minimize performance impact
Usage in Web Servers:
- Safe for use in multi-threaded web servers (FastAPI, Flask, etc.)
- Handles concurrent requests without issues
- Each thread sees consistent cache state

Note: This implementation is designed for single-node deployments. For distributed systems, consider using external caching solutions like Redis.

Benefits

Performance: Eliminates redundant LLM calls for identical inputs
Cost Savings: Reduces API calls to LLM services
Consistency: Ensures identical inputs always produce identical outputs
Smart Eviction: LFU policy keeps frequently checked content in cache
Model Isolation: Each model has its own cache, preventing interference between different safety models
Statistics Tracking: Monitor cache performance with hit rates, evictions, and more per model
Timestamp Tracking: Track when entries were created and last accessed
Efficiency: LFU eviction algorithm ensures the most useful entries remain in cache
Thread Safety: Safe for concurrent access in multi-threaded environments

Example Usage Pattern

# First call - takes ~500ms (LLM API call)
result = await content_safety_check_input(
    llms=llms,
    llm_task_manager=task_manager,
    model_name="content_safety",
    context={"user_message": "Hello world"}
)

# Subsequent identical calls - takes ~1ms (cache hit)
result = await content_safety_check_input(
    llms=llms,
    llm_task_manager=task_manager,
    model_name="content_safety",
    context={"user_message": "Hello world"}
)

Logging

The implementation includes debug logging:

Cache creation: "Created cache for model '{model_name}' with capacity {capacity}"
Cache hits: "Content safety cache hit for model '{model_name}'"
Cache stores: "Content safety result cached for model '{model_name}'"

Enable debug logging to monitor cache behavior:

import logging
logging.getLogger("nemoguardrails.library.content_safety.actions").setLevel(logging.DEBUG)

codecov-commenter · 2025-10-06T13:31:10Z

Codecov Report

❌ Patch coverage is 91.30435% with 34 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
nemoguardrails/llm/cache/lfu.py	87.61%	28 Missing ⚠️
nemoguardrails/llm/cache/interface.py	86.04%	6 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2025-10-06T13:31:24Z

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1436

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull Request Overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

examples/configs/content_safety/prompts.yml

nemoguardrails/library/content_safety/actions.py

nemoguardrails/rails/llm/llmrails.py

nemoguardrails/llm/cache/lfu.py

nemoguardrails/cache/README.md

nemoguardrails/rails/llm/config.py

nemoguardrails/rails/llm/llmrails.py

nemoguardrails/library/content_safety/actions.py

nemoguardrails/llm/cache/lfu.py

Pouyanpi

Thank you @hazai , looks very good overall 👍🏻

Please have a look at the review comments. We need to make sure the tests caught those edge cases.

.vscode/settings.json

nemoguardrails/library/content_safety/actions.py

Pouyanpi · 2025-10-15T11:08:52Z

@hazai for telemetry and logging, we can get llm_stats_var after llm_call:

    result = await llm_call(
        llm,
        check_input_prompt,
        stop=stop,
        llm_params={"temperature": 1e-20, "max_tokens": max_tokens},
    )
    print("llm_stats_var:", llm_stats_var.get())

so similar to the result it should be cached, but the second time when it is read from the cache we should set it

llm_stat_var.set(value_from_cache)

langchain for example returns the token usage metrics as is but change the duration to the cache read duration.

…d interface) add tests for lfu cache new content safety dynamic cache + integration add stats logging remove redundant test thread safety support for content-safety caching fixed failing tests update documentation to reflect thread-safety support for cache fixes following test failures on race conditions fixes following test failures remove a test update cache interface per model config without defaults

nemoguardrails/llm/cache/__init__.py

Signed-off-by: Pouyan <[email protected]>

Pouyanpi

@hazai thank you very much for the hard work to getting this feature ready. We are good to merge this PR 🚀

…content safety checks) #1436 Implement a pluggable caching infrastructure to reduce redundant LLM calls in content safety checks. The system features a Least Frequently Used (LFU) eviction policy with optional statistics tracking and periodic logging. Key components: - CacheInterface: Abstract base defining cache contract - LFUCache: Thread-safe LFU implementation with configurable stats - Cache utilities: Key normalization, LLM stats extraction/restoration - Content safety integration: Automatic caching in check_input action - Configuration: Cache settings in RailsConfig with per-model caches The caching layer is transparent to existing code and can be enabled via configuration without code changes. --------- Signed-off-by: Pouyan <[email protected]> Co-authored-by: Pouyan <[email protected]>

hazai requested review from Pouyanpi, Copilot and tgasser-nv October 6, 2025 13:29

hazai self-assigned this Oct 6, 2025

hazai changed the title ~~Feature/dynamic caching cs mod 2~~ dynamic caching for models Oct 6, 2025

Copilot AI reviewed Oct 6, 2025

View reviewed changes

Pouyanpi requested a review from Copilot October 10, 2025 05:14

Copilot AI reviewed Oct 10, 2025

View reviewed changes