Minimal yet powerful text anonymization library
PromptSan is a little library for anonymizing and deanonymizing text with pluggable strategies. It supports regex patterns, custom dictionaries, LLM-based anonymization, and streaming deanonymization.
- Multiple Strategies: Regex, dictionary, and LLM-based anonymization
- Streaming Support: Real-time deanonymization for LLM outputs
- Extensible: Easy to add custom anonymization strategies
- Type-Safe: Full type hints and Pydantic validation
- Minimal: Only 5 core files, clean architecture
- LLM Integration: Local LLM support via LangChain
- Streaming: Generator-based streaming deanonymization
pip install promptsangit clone https://github.com/lugnicca/prompt-san.git
cd prompt-san
pip install -e .git clone https://github.com/lugnicca/prompt-san.git
cd prompt-san
python -m pip install -e ".[dev]"from promptsan import PromptSanitizer, SanConfig
# Default configuration (regex strategy)
sanitizer = PromptSanitizer()
# Anonymize text
result = sanitizer.anonymize("Contact Lugnicca at [email protected]")
print(result.text) # "Contact Lugnicca at __EMAIL_1__"
print(result.mapping) # {"[email protected]": "__EMAIL_1__"}
# Deanonymize text
restored = sanitizer.deanonymize(result.text, result.mapping)
print(restored) # "Contact Lugnicca at [email protected]"from promptsan import PromptSanitizer, SanConfig
# Custom regex patterns
config = SanConfig(
strategies=["regex"],
regex_patterns={
r'\b[A-Z][a-z]+ [A-Z][a-z]+\b': 'PERSON',
r'\b\d{3}-\d{2}-\d{4}\b': 'SSN',
r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b': 'EMAIL'
}
)
sanitizer = PromptSanitizer(config)
result = sanitizer.anonymize("John Smith, SSN: 123-45-6789, email: [email protected]")config = SanConfig(
strategies=["dict"],
custom_dict={
"Project Alpha": "PROJECT",
"CONFIDENTIAL": "CLASSIFICATION",
"Acme Corp": "COMPANY"
}
)
sanitizer = PromptSanitizer(config)
result = sanitizer.anonymize("CONFIDENTIAL: Project Alpha by Acme Corp")config = SanConfig(
strategies=["regex", "dict"], # Applied in sequence
custom_dict={"OpenAI": "AI_COMPANY"},
regex_patterns={r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b': 'EMAIL'}
)
sanitizer = PromptSanitizer(config)
result = sanitizer.anonymize("OpenAI researcher at [email protected]")PromptSan supports local LLM anonymization via LangChain:
# Load prompt template
with open('examples/llm_prompt_basic.txt', 'r') as f:
prompt_template = f.read()
config = SanConfig(
strategies=["llm"],
llm_model="dolphin3.0-llama3.1-8b",
llm_base_url="http://localhost:1234/v1",
llm_prompt_template=prompt_template
)
sanitizer = PromptSanitizer(config)
result = sanitizer.anonymize("Dr. Smith at General Hospital treats Keanu Reeves")Requirements:
- Local LLM server (e.g., LM Studio, Ollama)
- Server running on
localhost:1234(configurable)
Perfect for real-time LLM output processing:
def llm_response_stream():
"""Simulate LLM generating response with tokens"""
yield "Patient __PERSON_1__ "
yield "at __HOSPITAL_2__ "
yield "has condition __CONDITION_3__"
# Deanonymize stream in real-time
mapping = {
"Miyamoto Musashi": "__PERSON_1__",
"General Hospital": "__HOSPITAL_2__",
"hypertension": "__CONDITION_3__"
}
sanitizer = PromptSanitizer()
deanonymized = sanitizer.deanonymize_stream(llm_response_stream(), mapping)
for chunk in deanonymized:
print(chunk, end='', flush=True)
# Output: "Patient Miyamoto Musashi at General Hospital has condition hypertension"(check examples/example_real_llm.py for a real-time LLM integration example)
Extend PromptSan with custom anonymization logic:
from promptsan.mapping import MappingStore
import re
def credit_card_strategy(text: str, mapping: MappingStore, cfg: SanConfig) -> str:
"""Custom strategy for credit card numbers"""
pattern = r'\b(?:4\d{3}|5[1-5]\d{2})\s?(?:\d{4}\s?){2}\d{4}\b'
for match in re.finditer(pattern, text):
cc_number = match.group().replace(' ', '')
token = mapping.add(cc_number, "CREDIT_CARD")
text = text.replace(match.group(), token)
return text
# Register and use custom strategy
sanitizer = PromptSanitizer()
sanitizer.register_strategy("credit_card", credit_card_strategy)
config = SanConfig(strategies=["credit_card"])
sanitizer = PromptSanitizer(config)
sanitizer.register_strategy("credit_card", credit_card_strategy)
result = sanitizer.anonymize("Card: 4532 1234 5678 9012")PromptSan includes a CLI for quick operations:
# Anonymize text
promptsan anonymize --text "Lugnicca at [email protected]"
# Anonymize from file
promptsan anonymize --input-file document.txt --output-file anonymized.txt --mapping-file mapping.json
# Deanonymize text
promptsan deanonymize --text "__EMAIL_1__" --mapping-file mapping.json
# Stream deanonymize
promptsan stream-deanonymize --input-file stream.txt --mapping-file mapping.json
# Use custom configuration
promptsan --config examples/config_medical.json anonymize --input-file patient_data.txt
# JSON output (text and mapping)
promptsan anonymize --text "Email: [email protected]" --jsonPromptSan includes ready-to-use configuration files:
examples/config_basic.json- Basic regex patterns (emails, phones, dates, ZIPs)examples/config_enterprise.json- Enterprise environment (classifications, employee IDs, references)examples/config_medical.json- Healthcare data (MRN, DOB, doctor names, hospitals)
import json
from promptsan import PromptSanitizer, SanConfig
# Load a configuration file
with open('examples/config_medical.json', 'r') as f:
config_data = json.load(f)
config = SanConfig(**config_data)
sanitizer = PromptSanitizer(config)
result = sanitizer.anonymize("Patient MRN: 123456, Dr. Smith at General Hospital"){
"strategies": ["regex", "dict"],
"custom_dict": {
"CONFIDENTIAL": "CLASSIFICATION",
"Project Alpha": "PROJECT"
},
"regex_patterns": {
"\\b[A-Z]{2,}\\d{6,8}\\b": "REFERENCE",
"\\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}\\b": "EMAIL"
},
"llm_model": "dolphin3.0-llama3.1-8b",
"llm_base_url": "http://localhost:1234/v1"
}The examples/ directory contains comprehensive demonstrations:
example_basic_usage.py- Getting started examplesexample_advanced_strategies.py- Dictionary and custom strategiesexample_streaming.py- Streaming deanonymizationexample_llm_integration.py- LLM-based anonymizationexample_configuration.py- Configuration patternsexample_config_files.py- Using provided JSON config filesexample_real_llm.py- Real LLM integration (OpenAI/OpenRouter)example_complete.py- Full feature showcase
PromptSan follows a clean, minimal architecture:
promptsan/
βββ __init__.py # Public API
βββ config.py # Configuration dataclass
βββ mapping.py # Bidirectional entity mapping
βββ strategies.py # Anonymization strategies
βββ sanitizer.py # Main sanitizer class
βββ cli.py # Command-line interface
- SanConfig: Immutable configuration with Pydantic validation
- MappingStore: Bidirectional entity β token mapping
- AnonymizationStrategy: Protocol for pluggable strategies
- PromptSanitizer: Main facade for all operations
- Local Processing: No data sent to external services (except configured LLM)
- Reversible: Perfect bidirectional mapping
- Configurable: Control exactly what gets anonymized
- Auditable: Clear mapping of what was changed
- LLM Privacy: Anonymize prompts before sending to cloud LLMs
- Data Sharing: Safe sharing of sensitive documents
- Development: Use real data structure with fake content
- Compliance: GDPR, HIPAA, SOX data protection
- Testing: Anonymize production data for testing
# Run example scripts
python examples/example_basic_usage.py
python examples/example_advanced_strategies.py
python examples/example_streaming.py
python examples/example_complete.py
# Run with custom configuration
python examples/example_llm_integration.py # Requires local LLM- Python 3.10+
- pydantic >= 2.0.0
- langchain >= 0.1.0 (for LLM strategy)
- langchain-openai >= 0.0.5 (for LLM strategy)
- openai >= 1.0.0 (for LLM strategy)
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all examples work
- Submit a pull request
MIT License - see LICENSE file for details.
- Built with LangChain for LLM integration
- Validation powered by Pydantic
- Inspired by privacy-first development practices
PromptSan: Minimal yet powerful text anonymization for the AI age.