-
Notifications
You must be signed in to change notification settings - Fork 559
feat(cache): Add LFU caching system for models (currently applied to content safety checks) #1436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Documentation preview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Pouyanpi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @hazai , looks very good overall 👍🏻
Please have a look at the review comments. We need to make sure the tests caught those edge cases.
|
@hazai for telemetry and logging, we can get result = await llm_call(
llm,
check_input_prompt,
stop=stop,
llm_params={"temperature": 1e-20, "max_tokens": max_tokens},
)
print("llm_stats_var:", llm_stats_var.get())so similar to the result it should be cached, but the second time when it is read from the cache we should set it
langchain for example returns the token usage metrics as is but change the duration to the cache read duration. |
7d379ca to
71ec05e
Compare
…d interface) add tests for lfu cache new content safety dynamic cache + integration add stats logging remove redundant test thread safety support for content-safety caching fixed failing tests update documentation to reflect thread-safety support for cache fixes following test failures on race conditions fixes following test failures remove a test update cache interface per model config without defaults
71ec05e to
026980d
Compare
Signed-off-by: Pouyan <[email protected]>
Pouyanpi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hazai thank you very much for the hard work to getting this feature ready. We are good to merge this PR 🚀
…content safety checks) #1436 Implement a pluggable caching infrastructure to reduce redundant LLM calls in content safety checks. The system features a Least Frequently Used (LFU) eviction policy with optional statistics tracking and periodic logging. Key components: - CacheInterface: Abstract base defining cache contract - LFUCache: Thread-safe LFU implementation with configurable stats - Cache utilities: Key normalization, LLM stats extraction/restoration - Content safety integration: Automatic caching in check_input action - Configuration: Cache settings in RailsConfig with per-model caches The caching layer is transparent to existing code and can be enabled via configuration without code changes. --------- Signed-off-by: Pouyan <[email protected]> Co-authored-by: Pouyan <[email protected]>
…content safety checks) #1436 Implement a pluggable caching infrastructure to reduce redundant LLM calls in content safety checks. The system features a Least Frequently Used (LFU) eviction policy with optional statistics tracking and periodic logging. Key components: - CacheInterface: Abstract base defining cache contract - LFUCache: Thread-safe LFU implementation with configurable stats - Cache utilities: Key normalization, LLM stats extraction/restoration - Content safety integration: Automatic caching in check_input action - Configuration: Cache settings in RailsConfig with per-model caches The caching layer is transparent to existing code and can be enabled via configuration without code changes. --------- Signed-off-by: Pouyan <[email protected]> Co-authored-by: Pouyan <[email protected]>
This PR replaces previous (closed) PR #1404
The PR contains a new nemoguardrails/cache folder with an LFU cache implementation (and interface).
The cache can be configured for any model.
It supports:
configuration
stats tracking
logging
thread-safety
(a very minimal) normalization of cache key
@Pouyanpi @tgasser-nv
This PR replaces previous (closed) PR #1404
Readme
Content Safety LLM Call Caching
Overview
The content safety checks in
actions.pynow use an LFU (Least Frequently Used) cache to improve performance by avoiding redundant LLM calls for identical safety checks.Implementation Details
Cache Configuration
created_atandaccessed_atfor each entrymainand non-embeddingsmodel type (typically content safety models)Cached Functions
content_safety_check_input()- Caches safety checks for user inputsCache Key Components
The cache key is generated from:
Since temperature is fixed (1e-20) and stop/max_tokens are derived from the model configuration, they don't need to be part of the cache key.
How It Works
Before LLM Call:
After LLM Call:
Cache Management
The caching system automatically creates and manages separate caches for each model. Key features:
Statistics and Monitoring
The cache supports detailed statistics tracking and periodic logging for monitoring cache performance:
Statistics Features:
stats.enabled: truewith nolog_intervalto track stats without loggingstats.enabled: trueandlog_intervalfor periodic loggingStatistics Tracked:
Log Format:
Usage Examples:
The cache is managed internally by the NeMo Guardrails framework. When you configure a model with caching enabled, the framework automatically:
Configuration Options:
stats.enabled: Enable/disable statistics tracking (default: false)stats.log_interval: Seconds between automatic stats logs (None = no logging)Notes:
nemoguardrails.cache.lfuloggerreset_stats()is calledExample Configuration
Example Usage
Thread Safety
The content safety caching system is thread-safe for single-node deployments:
LFUCache Implementation:
threading.RLockfor all operationsget,put,size,clear, etc.) are protected by locksget_or_compute()operations that prevent duplicate computationsLLMRails Model Initialization:
Key Features:
get_or_compute()ensures expensive computations happen only onceUsage in Web Servers:
Note: This implementation is designed for single-node deployments. For distributed systems, consider using external caching solutions like Redis.
Benefits
Example Usage Pattern
Logging
The implementation includes debug logging:
"Created cache for model '{model_name}' with capacity {capacity}""Content safety cache hit for model '{model_name}'""Content safety result cached for model '{model_name}'"Enable debug logging to monitor cache behavior: