Skip to content

Commit dd1623f

Browse files
authored
Add Azure OpenAI support for LangExtract recognizer (#1801)
* Add Azure OpenAI support for LangExtract recognizer * ruff * add redundnant tests to achieve 100% coverage * Add error handling and tests for Azure OpenAI provider initialization * Fix Microsoft Defender secret scanning false positives Replace test API keys with obviously fake placeholders: - test-api-key → PLACEHOLDER_NOT_A_REAL_KEY - test-key-123 → PLACEHOLDER_NOT_A_REAL_KEY - env-key → PLACEHOLDER_FROM_ENV - key → PLACEHOLDER_KEY These are unit test mock values, not real secrets. Using placeholder patterns that won't trigger security scanners while maintaining test validity. All 29 tests passing. * remove bandit * Add bandit tool to Microsoft Security DevOps workflow * Remove bandit from Microsoft Security DevOps workflow tools * Update AzureOpenAILangExtractRecognizer to use deployment name from environment variable * Refactor Azure OpenAI integration: remove legacy provider, update recognizer, and adjust tests * Improve error handling during Azure OpenAI provider registration by logging as error and raising exception * Refactor Azure OpenAI provider initialization logging for improved readability * Refactor test imports in Azure OpenAI recognizer tests for improved clarity and organization * Refactor Azure OpenAI provider tests to remove unnecessary variable assignments for improved clarity * Refactor langextract configuration: reorder supported entities and update entity mappings for consistency * Refactor Azure OpenAI integration: enhance documentation, improve endpoint validation, and streamline provider registration * Refactor Azure OpenAI LangExtract Recognizer: remove unused import and clean up code formatting * Refactor Azure OpenAI provider tests: update imports to use the correct module and remove obsolete test for langextract availability * Refactor Azure authentication handling: consolidate credential management into azure_auth_helper and update related recognizers and tests * Refactor Azure OpenAI recognizers: enhance module imports for registration, streamline model ID handling, and improve test coverage for credential selection * Refactor AHDS Surrogate operator: streamline error handling by mocking Azure credentials and client, and improve code readability * Refactor AHDS Recognizer tests: replace multiple credential mocks with a single get_azure_credential mock for improved clarity and maintainability
1 parent e606fee commit dd1623f

17 files changed

Lines changed: 1416 additions & 191 deletions

.github/workflows/defender-for-devops.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,9 @@ jobs:
5050
env:
5151
GDN_CHECKOV_SKIPPATH: 'docs'
5252
with:
53-
tools: bandit, checkov, templateanalyzer, trivy
53+
# bandit removed due to tool bug preventing CI from passing
54+
# TODO: re-add bandit when fixed
55+
tools: checkov, templateanalyzer, trivy
5456

5557
- name: Upload results to Security tab
5658
uses: github/codeql-action/upload-sarif@v4

docs/samples/python/langextract/index.md

Lines changed: 216 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -25,23 +25,42 @@ For the default entity mappings and examples, see the [default configuration](ht
2525

2626
Presidio supports the following language model providers through LangExtract:
2727

28-
1. **Ollama** - Local language model deployment (open-source models like Gemma, Llama, etc.)
29-
2. **Azure OpenAI** - _Documentation coming soon_
28+
1. **Azure OpenAI** - Cloud-based Azure OpenAI Service (GPT-4o, GPT-4, GPT-3.5-turbo, etc.)
29+
2. **Ollama** - Local language model deployment (open-source models like Gemma, Llama, etc.)
30+
31+
## Choosing Between Azure OpenAI and Ollama
32+
33+
| Feature | Azure OpenAI | Ollama |
34+
|---------|--------------|--------|
35+
| **Deployment** | Cloud (Azure) | Local (on-premises) |
36+
| **Cost** | Pay-per-use (tokens) | Free (hardware required) |
37+
| **Models** | GPT-4o, GPT-4, GPT-3.5-turbo | Open-source (Gemma, Llama, etc.) |
38+
| **Privacy** | Microsoft Azure compliance | Complete data control |
39+
| **Setup** | Azure Portal + API key/Managed Identity | Docker/local installation |
40+
| **Authentication** | API Key or Managed Identity (RBAC) | None (local) |
41+
| **Best For** | Production, enterprise compliance | Development, on-premises requirements |
42+
43+
**Recommendations**:
44+
45+
- **Use Azure OpenAI** for production workloads requiring enterprise security, compliance (HIPAA, SOC 2, etc.), and managed infrastructure
46+
- **Use Ollama** for local development, testing, or when data must stay on-premises
3047

3148
## Language Model-based Recognizer Implementation
3249

3350
Presidio provides a hierarchy of recognizers for language model-based PII/PHI detection:
3451

3552
- **`LMRecognizer`**: Abstract base class for all language model recognizers (LLMs, SLMs, etc.)
3653
- **`LangExtractRecognizer`**: Abstract base class for LangExtract library integration (model-agnostic)
54+
- **`AzureOpenAILangExtractRecognizer`**: Concrete implementation for Azure OpenAI Service
55+
- [Implementation](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/third_party/azure_openai_langextract_recognizer.py)
3756
- **`OllamaLangExtractRecognizer`**: Concrete implementation for Ollama local language models
38-
- **`AzureOpenAILangExtractRecognizer`**: _Documentation coming soon_
39-
40-
[OllamaLangExtractRecognizer implementation](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/third_party/ollama_langextract_recognizer.py)
57+
- [Implementation](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/third_party/ollama_langextract_recognizer.py)
4158

4259
---
4360

44-
## Using Ollama (Local Models)
61+
## Using Azure OpenAI (Cloud Models)
62+
63+
Azure OpenAI provides cloud-based access to OpenAI models (GPT-4o, GPT-4, GPT-3.5-turbo) with enterprise security and compliance features.
4564

4665
### Prerequisites
4766

@@ -172,10 +191,197 @@ See the [configuration file](https://github.com/microsoft/presidio/blob/main/pre
172191

173192
## Using Azure OpenAI (Cloud Models)
174193

175-
_Documentation coming soon_
194+
Azure OpenAI provides cloud-based access to OpenAI models (GPT-4o, GPT-4, GPT-3.5-turbo) with enterprise security and compliance features.
176195

177-
---
196+
### Prerequisites
197+
198+
1. **Install Presidio with LangExtract support**:
199+
```sh
200+
pip install presidio-analyzer[langextract]
201+
```
202+
203+
This installs `langextract` with OpenAI support, including the OpenAI Python SDK and Azure Identity libraries.
204+
205+
2. **Azure Subscription**: Create one at [azure.microsoft.com](https://azure.microsoft.com)
206+
207+
3. **Azure OpenAI Resource**:
208+
- Create an Azure OpenAI resource in [Azure Portal](https://portal.azure.com)
209+
- Request access if needed (some regions require approval)
210+
- Deploy a model and note the **deployment name** you choose (e.g., "gpt-4", "my-gpt-deployment")
211+
212+
4. **Optional: Download config file** (only if customizing entities/prompts):
213+
214+
```sh
215+
# On macOS/Linux/PowerShell:
216+
wget https://raw.githubusercontent.com/microsoft/presidio/main/presidio-analyzer/presidio_analyzer/conf/langextract_config_azureopenai.yaml
217+
218+
# Or download manually from:
219+
# https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/langextract_config_azureopenai.yaml
220+
```
221+
222+
### Authentication Options
223+
224+
Azure OpenAI supports multiple authentication methods with flexible configuration:
225+
226+
#### Option 1: Direct Parameters (Recommended for Most Users)
227+
228+
Simplest approach - pass credentials and deployment name as parameters:
229+
230+
```python
231+
from presidio_analyzer import AnalyzerEngine
232+
from presidio_analyzer.predefined_recognizers import AzureOpenAILangExtractRecognizer
233+
234+
# Initialize with deployment name and credentials
235+
azure_openai = AzureOpenAILangExtractRecognizer(
236+
model_id="gpt-4", # Your Azure deployment name
237+
azure_endpoint="https://your-resource.openai.azure.com/",
238+
api_key="your-api-key-here",
239+
api_version="2024-02-15-preview" # Optional
240+
)
241+
242+
analyzer = AnalyzerEngine()
243+
analyzer.registry.add_recognizer(azure_openai)
244+
245+
results = analyzer.analyze(
246+
text="My email is [email protected] and my phone is 555-123-4567",
247+
language="en"
248+
)
249+
```
250+
251+
#### Option 2: Environment Variables
252+
253+
Use environment variables for credentials:
254+
255+
```python
256+
import os
257+
from presidio_analyzer import AnalyzerEngine
258+
from presidio_analyzer.predefined_recognizers import AzureOpenAILangExtractRecognizer
259+
260+
# Set environment variables
261+
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-resource.openai.azure.com/"
262+
os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key-here"
263+
264+
# Initialize with just deployment name
265+
azure_openai = AzureOpenAILangExtractRecognizer(
266+
model_id="gpt-4" # Your Azure deployment name
267+
)
268+
269+
analyzer = AnalyzerEngine()
270+
analyzer.registry.add_recognizer(azure_openai)
271+
272+
results = analyzer.analyze(
273+
text="My email is [email protected] and my phone is 555-123-4567",
274+
language="en"
275+
)
276+
```
178277

179-
## Choosing Between Ollama and Azure OpenAI
278+
#### Option 3: Managed Identity (Production)
180279

181-
_Comparison documentation coming soon_
280+
**More secure** - No API keys in code, uses Azure RBAC:
281+
282+
```python
283+
import os
284+
from presidio_analyzer import AnalyzerEngine
285+
from presidio_analyzer.predefined_recognizers import AzureOpenAILangExtractRecognizer
286+
287+
# Set endpoint (no API key = uses managed identity)
288+
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-resource.openai.azure.com/"
289+
290+
# Initialize without API key (uses managed identity)
291+
azure_openai = AzureOpenAILangExtractRecognizer(
292+
model_id="gpt-4" # Your Azure deployment name
293+
)
294+
295+
analyzer = AnalyzerEngine()
296+
analyzer.registry.add_recognizer(azure_openai)
297+
298+
results = analyzer.analyze(
299+
text="Patient John Smith has SSN 123-45-6789",
300+
language="en"
301+
)
302+
```
303+
304+
**Managed Identity Authentication Flow** (Production):
305+
306+
When `api_key` is not provided, the provider automatically uses `ChainedTokenCredential` which tries credentials in order:
307+
308+
1. **EnvironmentCredential** - Service principal from environment variables
309+
2. **WorkloadIdentityCredential** - Azure Kubernetes Service workload identity
310+
3. **ManagedIdentityCredential** - Azure VM/App Service managed identity
311+
312+
For local development, set `ENV=development` to use `DefaultAzureCredential` instead:
313+
314+
```python
315+
import os
316+
os.environ["ENV"] = "development"
317+
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-resource.openai.azure.com/"
318+
# AZURE_OPENAI_API_KEY not set - uses DefaultAzureCredential in development mode
319+
# (includes Azure CLI, VS Code, etc.)
320+
321+
azure_openai = AzureOpenAILangExtractRecognizer()
322+
```
323+
324+
**Setup Managed Identity**:
325+
326+
1. Enable managed identity on your Azure resource (VM, App Service, Container Instance, etc.)
327+
2. Grant the managed identity **"Cognitive Services OpenAI User"** role on your Azure OpenAI resource
328+
3. No API keys needed - authentication is automatic
329+
330+
See [Azure Managed Identity documentation](https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/overview) for details.
331+
332+
### Configuration File (Optional)
333+
334+
The configuration file is **optional** for basic usage. You only need it to customize:
335+
336+
- Supported entity types
337+
- Entity mappings (LangExtract → Presidio)
338+
- Prompts and examples
339+
- Detection parameters
340+
341+
For basic usage, just pass `model_id` as a parameter (see examples above).
342+
343+
**When you need a custom config:**
344+
345+
1. **Download** the default config:
346+
347+
```sh
348+
wget https://raw.githubusercontent.com/microsoft/presidio/main/presidio-analyzer/presidio_analyzer/conf/langextract_config_azureopenai.yaml
349+
```
350+
351+
2. **Customize** entities, prompts, or other settings in the file
352+
353+
3. **Use** the customized config:
354+
355+
```python
356+
recognizer = AzureOpenAILangExtractRecognizer(
357+
model_id="gpt-4", # Can override config's model_id
358+
config_path="./custom_config.yaml",
359+
azure_endpoint="...",
360+
api_key="..."
361+
)
362+
```
363+
364+
**Configuration Reference**:
365+
366+
The config file contains two main sections:
367+
368+
**`lm_recognizer` section** (LLM recognizer settings):
369+
370+
- `supported_entities`: List of PII/PHI entity types to detect
371+
- `labels_to_ignore`: Entity labels to skip during processing
372+
- `enable_generic_consolidation`: Whether to consolidate unknown entities to GENERIC_PII_ENTITY
373+
- `min_score`: Minimum confidence score threshold (0.0-1.0)
374+
375+
**`langextract` section** (LangExtract-specific settings):
376+
377+
- `model.model_id`: Azure OpenAI deployment name (e.g., "gpt-4o", "gpt-4", "gpt-35-turbo")
378+
- `model.temperature`: Model temperature for generation (null = use model default)
379+
- `prompt_file`: Path to custom prompt template file
380+
- `examples_file`: Path to few-shot examples file
381+
- `entity_mappings`: Map LangExtract entity classes to Presidio entity names
382+
383+
See the [full config file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/langextract_config_azureopenai.yaml) for details.
384+
385+
---
386+
387+
## Using Ollama (Local Models)

presidio-analyzer/README.md

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,53 @@ Named Entity Recognition and other types of logic to detect PII in unstructured
1414

1515
### Language Model-based PII/PHI Detection
1616

17-
Presidio analyzer supports language model-based PII/PHI detection (LLMs, SLMs) for flexible entity recognition. The current implementation uses [LangExtract](https://github.com/google/langextract) with Ollama for local model deployment.
17+
Presidio analyzer supports language model-based PII/PHI detection (LLMs, SLMs) for flexible entity recognition. The current implementation uses [LangExtract](https://github.com/google/langextract) with support for multiple providers:
18+
19+
- **Ollama** - Local model deployment for privacy-sensitive environments
20+
- **Azure OpenAI** - Cloud-based deployment with enterprise features
1821

1922
```bash
2023
pip install presidio-analyzer[langextract]
2124
```
2225

23-
**Note:** The Ollama recognizer does not validate server connectivity or model availability during initialization. Connection errors or missing models will be reported when `analyze()` is first called. Ensure Ollama is running and the required model is installed before analysis.
26+
#### Quick Usage
27+
28+
**Ollama** (local models):
29+
30+
```python
31+
from presidio_analyzer.predefined_recognizers import OllamaLangExtractRecognizer
32+
recognizer = OllamaLangExtractRecognizer() # Uses default config
33+
```
34+
35+
**Azure OpenAI** (cloud models):
36+
37+
```python
38+
from presidio_analyzer.predefined_recognizers import AzureOpenAILangExtractRecognizer
39+
40+
# Simple usage - pass everything as parameters
41+
recognizer = AzureOpenAILangExtractRecognizer(
42+
model_id="gpt-4", # Your Azure deployment name
43+
azure_endpoint="https://your-resource.openai.azure.com/",
44+
api_key="your-api-key"
45+
)
46+
47+
# Or use environment variables (AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY):
48+
recognizer = AzureOpenAILangExtractRecognizer(
49+
model_id="gpt-4" # Your Azure deployment name
50+
)
51+
52+
# Advanced: Customize entities/prompts with config file
53+
recognizer = AzureOpenAILangExtractRecognizer(
54+
model_id="gpt-4",
55+
config_path="./custom_config.yaml", # Optional: for custom entities/prompts
56+
azure_endpoint="https://your-resource.openai.azure.com/",
57+
api_key="your-api-key"
58+
)
59+
```
60+
61+
**Note:** LangExtract recognizers do not validate connectivity during initialization. Connection errors or missing models will be reported when `analyze()` is first called.
2462

25-
See the [Language Model-based PII/PHI Detection guide](https://microsoft.github.io/presidio/samples/python/langextract/) for setup and usage.
63+
See the [Language Model-based PII/PHI Detection guide](https://microsoft.github.io/presidio/samples/python/langextract/) for complete setup and usage instructions.
2664

2765
## Deploy Presidio analyzer to Azure
2866

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Azure OpenAI Configuration for LangExtract
2+
#
3+
# This config file is OPTIONAL for basic usage. You can pass model_id and credentials
4+
# as parameters instead of using this file.
5+
#
6+
# Use this file when you need to customize:
7+
# - Supported entities
8+
# - Entity mappings
9+
# - Prompts and examples
10+
# - Detection parameters
11+
#
12+
# IMPORTANT: The model_id below is a placeholder. You can:
13+
# 1. Pass model_id as a parameter: AzureOpenAILangExtractRecognizer(model_id="your-deployment-name")
14+
# 2. OR update model_id below to match your Azure OpenAI deployment name
15+
16+
lm_recognizer:
17+
supported_entities:
18+
- PERSON
19+
- EMAIL_ADDRESS
20+
- PHONE_NUMBER
21+
- US_SSN
22+
- LOCATION
23+
- ORGANIZATION
24+
- DATE_TIME
25+
- CREDIT_CARD
26+
- MEDICAL_LICENSE
27+
- IP_ADDRESS
28+
- URL
29+
- IBAN_CODE
30+
31+
labels_to_ignore:
32+
- payment_status
33+
- metadata
34+
- annotation
35+
36+
enable_generic_consolidation: true
37+
min_score: 0.5
38+
39+
langextract:
40+
prompt_file: "presidio-analyzer/presidio_analyzer/conf/langextract_prompts/default_pii_phi_prompt.j2"
41+
examples_file: "presidio-analyzer/presidio_analyzer/conf/langextract_prompts/default_pii_phi_examples.yaml"
42+
43+
model:
44+
# Azure OpenAI deployment name (e.g., "gpt-4", "gpt-4o", "my-gpt-deployment")
45+
# This is the deployment name from Azure Portal, NOT the model name
46+
# You can override this by passing model_id parameter to the recognizer
47+
model_id: "gpt-4o"
48+
temperature: null
49+
50+
entity_mappings:
51+
person: PERSON
52+
name: PERSON
53+
full_name: PERSON
54+
email: EMAIL_ADDRESS
55+
phone: PHONE_NUMBER
56+
ssn: US_SSN
57+
location: LOCATION
58+
address: LOCATION
59+
organization: ORGANIZATION
60+
date: DATE_TIME
61+
credit_card: CREDIT_CARD
62+
medical_record: MEDICAL_LICENSE
63+
medical_license: MEDICAL_LICENSE
64+
ip_address: IP_ADDRESS
65+
url: URL
66+
iban: IBAN_CODE

0 commit comments

Comments
 (0)