A curated list of awesome AI guardrails.
If you find this list helpful, give it a ⭐ on GitHub, share it, and contribute by submitting a pull request or issue!
| Name | Description |
|---|---|
security-and-privacy |
Security and privacy guardrails ensure content remains safe, ethical, and devoid of offensive material |
response-and-relevance |
Ensures model responses are accurate, focused, and aligned with user intent |
language-quality |
Ensures high standards of readability, coherence, and clarity |
content-validation |
Ensures factual correctness and logical coherence of content |
logic-validation |
Ensures logical and functional correctness of generated code and data |
| Sub Category | Description |
|---|---|
inappropriate-content |
Detects and filters inappropriate or explicit content |
offensive-language |
Identifies and filters profane or offensive language |
prompt-injection |
Prevents manipulation attempts through malicious prompts |
sensitive-content |
Flags culturally, politically, or socially sensitive topics |
deepfake-detection |
Detects and filters deepfake content |
pii |
Identifies and filters personally identifiable information |
Models in security-and-privacy
| Sub Category | Description |
|---|---|
relevance |
Validates semantic relevance between input and output |
prompt-address |
Confirms response correctly addresses user's prompt |
url-validation |
Verifies validity of generated URLs |
factuality |
Cross-references content with external knowledge sources |
refusal |
Refuses to answer questions that are not appropriate or relevant |
Models in response-and-relevance
| Name | Size | Task |
|---|---|---|
| protectai/distilroberta-base-rejection-v1 | 0.0821B |
text-classification |
| s-nlp/E5-EverGreen-Multilingual-Small | 0.118B |
text-classification |
| lytang/MiniCheck-RoBERTa-Large | 0.4B |
text-classification |
| lytang/MiniCheck-Flan-T5-Large | 0.8B |
text-classification |
| ibm-granite/granite-guardian-3.1-2b | 2B |
text-classification |
| bespokelabs/Bespoke-MiniCheck-7B | 7B |
text-classification |
| nvidia/prompt-task-and-complexity-classifier | 0.184B |
text-classification |
| PatronusAI/glider | 3.8B |
text-classification |
| flowaicom/Flow-Judge-v0.1 | 3.8B |
text-classification |
| Sub Category | Description |
|---|---|
quality |
Assesses structure, relevance, and coherence of output |
translation-accuracy |
Ensures contextually correct and linguistically accurate translations |
duplicate-elimination |
Detects and removes redundant content |
readability |
Evaluates text complexity for target audience |
Models in language-quality
| Name | Size | Task |
|---|---|---|
| HuggingFaceFW/fineweb-edu-classifier | 0.109B |
text-classification |
| nvidia/quality-classifier-deberta | 0.184B |
text-classification |
| facebook/nllb-200-distilled-600M | 0.6B |
text-to-text-generation |
| nvidia/prompt-task-and-complexity-classifier | 0.184B |
text-classification |
| PatronusAI/glider | 3.8B |
text-classification |
| flowaicom/Flow-Judge-v0.1 | 3.8B |
text-classification |
| Sub Category | Description |
|---|---|
competitor-blocking |
Screens for mentions of rival brands or companies |
price-validation |
Validates price-related data against verified sources |
source-verification |
Verifies accuracy of external quotes and references |
gibberish-filter |
Identifies and filters nonsensical or incoherent outputs |
Models in content-validation
| Name | Size | Task |
|---|---|---|
| s-nlp/mdistilbert-base-formality-ranker | 0.142B |
text-classification |
| d4data/bias-detection-model | 0.3B |
text-classification |
| NousResearch/Minos-v1 | 0.4B |
text-classification |
| osmosis-ai/Osmosis-Structure-0.6B | 0.6B |
token-classification |
| gliner-community/gliner_small-v2.5 | 0.7B |
token-classification |
| Sub Category | Description |
|---|---|
sql-validation |
Validates SQL queries for syntax and security |
api-validation |
Ensures API calls conform to OpenAPI standards |
json-validation |
Validates JSON structure and schema |
logical-consistency |
Checks for contradictory or illogical statements |
| Name | Size | Category | Sub Category |
|---|---|---|---|
| s-nlp/mdistilbert-base-formality-ranker | 0.142B |
content-validation |
quality |
| d4data/bias-detection-model | 0.3B |
content-validation |
bias |
| NousResearch/Minos-v1 | 0.4B |
content-validation |
refusal |
| HuggingFaceFW/fineweb-edu-classifier | 0.109B |
language-quality |
quality |
| nvidia/quality-classifier-deberta | 0.184B |
language-quality |
quality |
| protectai/distilroberta-base-rejection-v1 | 0.0821B |
response-and-relevance |
rejection |
| s-nlp/E5-EverGreen-Multilingual-Small | 0.118B |
response-and-relevance |
factuality |
| lytang/MiniCheck-RoBERTa-Large | 0.4B |
response-and-relevance |
factuality, logical-consistency, relevance |
| lytang/MiniCheck-Flan-T5-Large | 0.8B |
response-and-relevance |
factuality, logical-consistency, relevance |
| ibm-granite/granite-guardian-3.1-2b | 2B |
response-and-relevance |
factuality, logical-consistency, relevance |
| bespokelabs/Bespoke-MiniCheck-7B | 7B |
response-and-relevance |
factuality, logical-consistency, relevance |
| nvidia/prompt-task-and-complexity-classifier | 0.184B |
response-and-relevance, language-quality |
relevance, quality |
| PatronusAI/glider | 3.8B |
response-and-relevance, language-quality |
factuality, logical-consistency, relevance, quality |
| flowaicom/Flow-Judge-v0.1 | 3.8B |
response-and-relevance, language-quality |
factuality, logical-consistency, relevance, quality |
| meta-llama/Llama-Prompt-Guard-2-22M | 0.022B |
security-and-privacy |
prompt-injection, jailbreaks |
| eliasalbouzidi/distilbert-nsfw-text-classifier | 0.068B |
security-and-privacy |
inappropriate-content |
| meta-llama/Llama-Prompt-Guard-2-86M | 0.086B |
security-and-privacy |
prompt-injection, jailbreaks |
| ibm-granite/granite-guardian-hap-125m | 0.125B |
security-and-privacy |
toxicity, hallucination |
| ibm-granite/granite-guardian-hap-125m | 0.125B |
security-and-privacy |
toxicity, hallucination |
| protectai/deberta-v3-small-prompt-injection-v2 | 0.142B |
security-and-privacy |
prompt-injection |
| protectai/deberta-v3-base-prompt-injection-v2 | 0.182B |
security-and-privacy |
prompt-injection |
| TostAI/nsfw-text-detection-large | 0.355B |
security-and-privacy |
inappropriate-content |
| MoritzLaurer/ModernBERT-large-zeroshot-v2.0 | 0.4B |
security-and-privacy |
inappropriate-content, offensive-language, prompt-injection, sensitive-content |
| madhurjindal/Jailbreak-Detector-2-XL | 0.5B |
security-and-privacy |
jailbreaks |
| google/shieldgemma-2b | 2B |
security-and-privacy |
inappropriate-content, offensive-language, prompt-injection, sensitive-content |
| Name | Size | Category | Sub Category |
|---|---|---|---|
| osmosis-ai/Osmosis-Structure-0.6B | 0.6B |
content-validation, security-and-privacy |
pii, competitor-blocking |
| gliner-community/gliner_small-v2.5 | 0.7B |
content-validation, security-and-privacy |
pii, competitor-blocking |
| ai4privacy/llama-ai4privacy-multilingual-categorical-anonymiser-openpii | 0.15B |
security-and-privacy |
pii |
| Name | Size | Category | Sub Category |
|---|---|---|---|
| facebook/nllb-200-distilled-600M | 0.6B |
language-quality |
translation-accuracy |
| meta-llama/Llama-3.2-1B-Instruct | 1B |
security-and-privacy |
inappropriate-content, offensive-language, prompt-injection, sensitive-content |
| Name | Size | Category | Sub Category |
|---|---|---|---|
| Marqo/nsfw-image-detection-384 | 0.006B |
security-and-privacy |
inappropriate-content |
| Freepik/nsfw_image_detector | 0.086B |
security-and-privacy |
inappropriate-content |
| Organika/sdxl-detector | 0.086B |
security-and-privacy |
deepfake-detection |
| prithivMLmods/Deep-Fake-Detector-v2-Model | 0.086B |
security-and-privacy |
deepfake-detection |
| TostAI/nsfw-image-detection-large | 0.0871B |
security-and-privacy |
inappropriate-content |
| Ateeqq/nsfw-image-detection | 0.092B |
security-and-privacy |
inappropriate-content |
| Falconsai/nsfw_image_detection | 0.1B |
security-and-privacy |
inappropriate-content |
| OpenSafetyLab/ImageGuard | na |
security-and-privacy |
inappropriate-content |
| Name | Size | Category | Sub Category |
|---|---|---|---|
| meta-llama/Llama-Guard-4-12B | 12B |
security-and-privacy |
inappropriate-content, offensive-language, prompt-injection, sensitive-content |
| Name | Category | Description |
|---|---|---|
| guardrails | all |
Adding guardrails to large language models. |
| NeMo-Guardrails | all |
NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. |
| uqlm | hallucination |
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection. |
| llm-guard | all |
The Security Toolkit for LLM Interactions. |
| Name | Category | Description |
|---|---|---|
| Lakera | all |
Lakera is a company that provides a range of AI services. |
| Guardrails AI Pro | all |
Guardrails AI Pro is a commercial version of guardrails that provides additional features and support. |
| Name | Category | Description |
|---|---|---|
| lytang/LLM-AggreFact | factuality |
Bias in Bios is a dataset of 100000 bios of people with different biases. |
| Entreprise PII Masking | pii |
Entreprise PII Masking are datasets for enterprise PII masking focused on location, work, health, digital and financial information. |
| prithivMLmods/OpenDeepfake-Preview | deepfake-detection |
OpenDeepfake-Preview is a dataset of 20K deepfake images. |
| eliasalbouzidi/NSFW-Safe-Dataset | nsfw |
NSFW-Safe-Dataset is a dataset for NSFW content detection. |
| lmsys/toxic-chat | toxic-chat |
Toxic-Chat is a dataset for toxic chat detection. |
| Name | Category | Description |
|---|---|---|
| Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers | hallucination |
Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers |
| RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models | factuality |
RAGTruth is a dataset of 100000 bios of people with different biases. |
| MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents | factuality |
how to build small fact-checking models that have GPT-4-level performance but for 400x lower cost. |
| A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions | hallucination |
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions |
| Granite Guardian: A Guardrail Framework for Large Language Models | all |
Granite Guardian is a guardrail framework for large language models. |
| "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models | prompt-injection |
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models |
| "Tiny-Toxic-Detector: A compact transformer-based model for toxic content detection | toxic-chat |
"Tiny-Toxic-Detector: A compact transformer-based model for toxic content detection |
| T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation | toxic-chat |
T2ISafety is a benchmark for assessing fairness, toxicity, and privacy in image generation. |