PromptSan

Minimal yet powerful text anonymization library

PromptSan is a little library for anonymizing and deanonymizing text with pluggable strategies. It supports regex patterns, custom dictionaries, LLM-based anonymization, and streaming deanonymization.

Features

Multiple Strategies: Regex, dictionary, and LLM-based anonymization
Streaming Support: Real-time deanonymization for LLM outputs
Extensible: Easy to add custom anonymization strategies
Type-Safe: Full type hints and Pydantic validation
Minimal: Only 5 core files, clean architecture
LLM Integration: Local LLM support via LangChain
Streaming: Generator-based streaming deanonymization

Installation

From PyPI (when published)

pip install promptsan

From Source

git clone https://github.com/lugnicca/prompt-san.git
cd prompt-san
pip install -e .

Development Installation

git clone https://github.com/lugnicca/prompt-san.git
cd prompt-san
python -m pip install -e ".[dev]"

Quick Start

Basic Usage

from promptsan import PromptSanitizer, SanConfig

# Default configuration (regex strategy)
sanitizer = PromptSanitizer()

# Anonymize text
result = sanitizer.anonymize("Contact Lugnicca at [email protected]")
print(result.text)     # "Contact Lugnicca at __EMAIL_1__"
print(result.mapping)  # {"[email protected]": "__EMAIL_1__"}

# Deanonymize text
restored = sanitizer.deanonymize(result.text, result.mapping)
print(restored)  # "Contact Lugnicca at [email protected]"

Custom Configuration

from promptsan import PromptSanitizer, SanConfig

# Custom regex patterns
config = SanConfig(
    strategies=["regex"],
    regex_patterns={
        r'\b[A-Z][a-z]+ [A-Z][a-z]+\b': 'PERSON',
        r'\b\d{3}-\d{2}-\d{4}\b': 'SSN',
        r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b': 'EMAIL'
    }
)

sanitizer = PromptSanitizer(config)
result = sanitizer.anonymize("John Smith, SSN: 123-45-6789, email: [email protected]")

Dictionary Strategy

config = SanConfig(
    strategies=["dict"],
    custom_dict={
        "Project Alpha": "PROJECT",
        "CONFIDENTIAL": "CLASSIFICATION",
        "Acme Corp": "COMPANY"
    }
)

sanitizer = PromptSanitizer(config)
result = sanitizer.anonymize("CONFIDENTIAL: Project Alpha by Acme Corp")

Combined Strategies

config = SanConfig(
    strategies=["regex", "dict"],  # Applied in sequence
    custom_dict={"OpenAI": "AI_COMPANY"},
    regex_patterns={r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b': 'EMAIL'}
)

sanitizer = PromptSanitizer(config)
result = sanitizer.anonymize("OpenAI researcher at [email protected]")

🤖 LLM Integration

PromptSan supports local LLM anonymization via LangChain:

# Load prompt template
with open('examples/llm_prompt_basic.txt', 'r') as f:
    prompt_template = f.read()

config = SanConfig(
    strategies=["llm"],
    llm_model="dolphin3.0-llama3.1-8b",
    llm_base_url="http://localhost:1234/v1",
    llm_prompt_template=prompt_template
)

sanitizer = PromptSanitizer(config)
result = sanitizer.anonymize("Dr. Smith at General Hospital treats Keanu Reeves")

Requirements:

Local LLM server (e.g., LM Studio, Ollama)
Server running on localhost:1234 (configurable)

🌊 Streaming Deanonymization

Perfect for real-time LLM output processing:

def llm_response_stream():
    """Simulate LLM generating response with tokens"""
    yield "Patient __PERSON_1__ "
    yield "at __HOSPITAL_2__ "
    yield "has condition __CONDITION_3__"

# Deanonymize stream in real-time
mapping = {
    "Miyamoto Musashi": "__PERSON_1__",
    "General Hospital": "__HOSPITAL_2__", 
    "hypertension": "__CONDITION_3__"
}

sanitizer = PromptSanitizer()
deanonymized = sanitizer.deanonymize_stream(llm_response_stream(), mapping)

for chunk in deanonymized:
    print(chunk, end='', flush=True)
# Output: "Patient Miyamoto Musashi at General Hospital has condition hypertension"

(check examples/example_real_llm.py for a real-time LLM integration example)

🔧 Custom Strategies

Extend PromptSan with custom anonymization logic:

from promptsan.mapping import MappingStore
import re

def credit_card_strategy(text: str, mapping: MappingStore, cfg: SanConfig) -> str:
    """Custom strategy for credit card numbers"""
    pattern = r'\b(?:4\d{3}|5[1-5]\d{2})\s?(?:\d{4}\s?){2}\d{4}\b'
    
    for match in re.finditer(pattern, text):
        cc_number = match.group().replace(' ', '')
        token = mapping.add(cc_number, "CREDIT_CARD")
        text = text.replace(match.group(), token)
    
    return text

# Register and use custom strategy
sanitizer = PromptSanitizer()
sanitizer.register_strategy("credit_card", credit_card_strategy)

config = SanConfig(strategies=["credit_card"])
sanitizer = PromptSanitizer(config)
sanitizer.register_strategy("credit_card", credit_card_strategy)

result = sanitizer.anonymize("Card: 4532 1234 5678 9012")

🖥️ Command Line Interface

PromptSan includes a CLI for quick operations:

# Anonymize text
promptsan anonymize --text "Lugnicca at [email protected]"

# Anonymize from file
promptsan anonymize --input-file document.txt --output-file anonymized.txt --mapping-file mapping.json

# Deanonymize text
promptsan deanonymize --text "__EMAIL_1__" --mapping-file mapping.json

# Stream deanonymize
promptsan stream-deanonymize --input-file stream.txt --mapping-file mapping.json

# Use custom configuration
promptsan --config examples/config_medical.json anonymize --input-file patient_data.txt

# JSON output (text and mapping)
promptsan anonymize --text "Email: [email protected]" --json

Configuration Files

PromptSan includes ready-to-use configuration files:

Provided Configurations

examples/config_basic.json - Basic regex patterns (emails, phones, dates, ZIPs)
examples/config_enterprise.json - Enterprise environment (classifications, employee IDs, references)
examples/config_medical.json - Healthcare data (MRN, DOB, doctor names, hospitals)

Using Configuration Files

import json
from promptsan import PromptSanitizer, SanConfig

# Load a configuration file
with open('examples/config_medical.json', 'r') as f:
    config_data = json.load(f)

config = SanConfig(**config_data)
sanitizer = PromptSanitizer(config)

result = sanitizer.anonymize("Patient MRN: 123456, Dr. Smith at General Hospital")

Custom Configuration

{
  "strategies": ["regex", "dict"],
  "custom_dict": {
    "CONFIDENTIAL": "CLASSIFICATION",
    "Project Alpha": "PROJECT"
  },
  "regex_patterns": {
    "\\b[A-Z]{2,}\\d{6,8}\\b": "REFERENCE",
    "\\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}\\b": "EMAIL"
  },
  "llm_model": "dolphin3.0-llama3.1-8b",
  "llm_base_url": "http://localhost:1234/v1"
}

📚 Examples

The examples/ directory contains comprehensive demonstrations:

example_basic_usage.py - Getting started examples
example_advanced_strategies.py - Dictionary and custom strategies
example_streaming.py - Streaming deanonymization
example_llm_integration.py - LLM-based anonymization
example_configuration.py - Configuration patterns
example_config_files.py - Using provided JSON config files
example_real_llm.py - Real LLM integration (OpenAI/OpenRouter)
example_complete.py - Full feature showcase

🏗️ Architecture

PromptSan follows a clean, minimal architecture:

promptsan/
├── __init__.py          # Public API
├── config.py            # Configuration dataclass
├── mapping.py           # Bidirectional entity mapping
├── strategies.py        # Anonymization strategies
├── sanitizer.py         # Main sanitizer class
└── cli.py              # Command-line interface

Core Concepts

SanConfig: Immutable configuration with Pydantic validation
MappingStore: Bidirectional entity ↔ token mapping
AnonymizationStrategy: Protocol for pluggable strategies
PromptSanitizer: Main facade for all operations

🔒 Privacy & Security

Local Processing: No data sent to external services (except configured LLM)
Reversible: Perfect bidirectional mapping
Configurable: Control exactly what gets anonymized
Auditable: Clear mapping of what was changed

🎯 Use Cases

LLM Privacy: Anonymize prompts before sending to cloud LLMs
Data Sharing: Safe sharing of sensitive documents
Development: Use real data structure with fake content
Compliance: GDPR, HIPAA, SOX data protection
Testing: Anonymize production data for testing

🧪 Testing

# Run example scripts
python examples/example_basic_usage.py
python examples/example_advanced_strategies.py
python examples/example_streaming.py
python examples/example_complete.py

# Run with custom configuration
python examples/example_llm_integration.py  # Requires local LLM

📋 Requirements

Python 3.10+
pydantic >= 2.0.0
langchain >= 0.1.0 (for LLM strategy)
langchain-openai >= 0.0.5 (for LLM strategy)
openai >= 1.0.0 (for LLM strategy)

🤝 Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all examples work
Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🔗 Links

🙏 Acknowledgments

Built with LangChain for LLM integration
Validation powered by Pydantic
Inspired by privacy-first development practices

PromptSan: Minimal yet powerful text anonymization for the AI age.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
examples		examples
promptsan		promptsan
tests		tests
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PromptSan

Features

Installation

From PyPI (when published)

From Source

Development Installation

Quick Start

Basic Usage

Custom Configuration

Dictionary Strategy

Combined Strategies

🤖 LLM Integration

🌊 Streaming Deanonymization

🔧 Custom Strategies

🖥️ Command Line Interface

Configuration Files

Provided Configurations

Using Configuration Files

Custom Configuration

📚 Examples

🏗️ Architecture

Core Concepts

🔒 Privacy & Security

🎯 Use Cases

🧪 Testing

📋 Requirements

🤝 Contributing

📄 License

🔗 Links

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

lugnicca/prompt-san

Folders and files

Latest commit

History

Repository files navigation

PromptSan

Features

Installation

From PyPI (when published)

From Source

Development Installation

Quick Start

Basic Usage

Custom Configuration

Dictionary Strategy

Combined Strategies

🤖 LLM Integration

🌊 Streaming Deanonymization

🔧 Custom Strategies

🖥️ Command Line Interface

Configuration Files

Provided Configurations

Using Configuration Files

Custom Configuration

📚 Examples

🏗️ Architecture

Core Concepts

🔒 Privacy & Security

🎯 Use Cases

🧪 Testing

📋 Requirements

🤝 Contributing

📄 License

🔗 Links

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages