Remove poetry cache from docker images #1785

tamirkamara · 2025-11-16T13:24:22Z

Change Description

Poetry keeps a cache directory with things not needed for runtime so removing the cache reduces the docker images.
Analyzer was 2.15GB and is now 1.99GB
Anonymizer was 388MB and is now 375MB

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

SharonHart

Thanks, multistage build? :)

omri374 · 2025-11-17T17:28:06Z

presidio-analyzer/Dockerfile

+    && rm -rf $(poetry config cache-dir)
+
 # install nlp models specified in NLP_CONF_FILE
 COPY ./install_nlp_models.py /app/


@tamirkamara the install_nlp_models also does a bunch of pip installs. Maybe a better solution would be to clear cache after this step. Have you considered it?

@omri374 we disable pip cache within the docker files. For example: https://github.com/microsoft/presidio/blob/main/presidio-analyzer/Dockerfile#L6

* fix unit tests (microsoft#1778) * intial commit * Remove skip marker for spacy_nlp_engine fixture * Remove skip markers for stanza and transformers NLP engine fixtures * move poetry cache dir (microsoft#1784) * remove poetry cache from docker images (microsoft#1785) * rename dockerignore files (microsoft#1787) * Remove build-essential from the Analyzer docker image (microsoft#1789) * update docker * more ignores * Bump actions/checkout from 5 to 6 (microsoft#1793) Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v5...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix dev container permission issues (microsoft#1788) * Fix dev container permission issues by removing USER directive Removes USER directive from Dockerfile.dev files to fix permission denied errors when accessing bind-mounted workspaces. Dev containers now run as root, which is standard practice for local development environments and matches the original working configuration before PR microsoft#1759. Fixes microsoft#1782 * Fix dev container permission issues and Poetry 2.0 compatibility - Remove USER directive from Dockerfile.dev files to fix permission errors - Remove poetry shell commands (not available in Poetry 2.0) - Configure VS Code to use Poetry venv automatically Fixes microsoft#1782 --------- Co-authored-by: Omri Mendels <[email protected]> Co-authored-by: Sharon Hart <[email protected]> * CI coverage test (microsoft#1794) * Add coverage checks to CI and include pytest-cov in dependencies * Add coverage configuration to pyproject.toml files for all packages * Add coverage reporting to CI workflow with combined report generation * Enhance coverage reporting in CI by generating detailed summaries and status badges * Refactor coverage reporting in CI to streamline summary generation and improve output formatting * updating the final table * Refactor coverage report generation in CI to build message dynamically and improve formatting * Add diff-cover support and enhance coverage reporting in CI * Refactor coverage reporting in CI to simplify coverage data handling and enhance PR comment functionality * Enhance coverage reporting in CI by renaming coverage files for better merging and updating upload paths * Add coverage configuration for relative paths and enhance artifact fetching * Refactor coverage reporting in CI to use a relative coverage configuration file and simplify coverage file handling * Refactor CI coverage reporting to check for file changes and comment on PRs, removing combined coverage report steps * Refactor CI unit test coverage reporting for Python 3.12 to include HTML report and improve file change detection logic * Add test coverage trigger comment to AnalyzerEngine docstring * trigger tests * Refactor CI unit test coverage reporting for Python 3.12 to simplify conditions and improve coverage output handling * Update coverage command to use pyproject.toml for configuration * asd * Enhance coverage reporting in CI workflow to include detailed PR diff coverage and upload coverage metrics * Improve diff coverage percentage formatting in CI workflow * Fix diff coverage percentage calculation in CI workflow to handle missing values * trigger coverage change * Update coverage threshold to 80% in CI workflow * fix ruff * Add calculate_pii_density method to analyze PII density in text * Enhance coverage check in CI workflow with 80% threshold and detailed summary * Refactor CI test job to streamline coverage reporting and remove unnecessary permissions * Enhance CI coverage checks: enforce diff coverage on PRs and append summary * Set fetch-depth to 0 for actions/checkout to ensure full history is available * Enhance test coverage reporting: include branch coverage and missing lines in output * Enhance coverage check in CI: show uncovered lines in diff coverage report * Enhance coverage check in CI: include git diff options for more detailed output * Refine coverage check in CI: remove redundant git diff options from diff-cover command * Enhance coverage reporting in CI: update coverage command and add PR comment action * Enhance coverage check in CI: parameterize coverage threshold for consistency * Enhance coverage check in CI: dynamically set package name for coverage reporting * Refine coverage check in CI: remove minimum coverage threshold environment variable * Enhance CI configuration: add coverage path for component-specific coverage data * Fix coverage path in CI: update to use component path directly * Enhance CI configuration: add permissions for pull requests in test job * Refactor AnalyzerEngine: update class docstring and remove unused calculate_pii_density method * Update CI configuration: use environment variables for Python versions and primary Python in tests * Add coverage threshold environment variable to CI job * Update CI configuration: set Python versions directly in matrix and define coverage threshold * changing threshold to 90 * Update CI job permissions to allow write access for contents * Fix typo in coverage check message for clarity * remove duplicate component name * Enable credential persistence for checkout action in CI workflow * remove the if for PRS only to allow run on the default branch main. * Fix coverage job to use component path for SUBPROJECT_ID (microsoft#1798) * Language models integration (LangExtract) (microsoft#1775) * Add LangExtract recognizer for PII extraction - Introduced LangExtract recognizer to enhance PII detection capabilities. - Added configuration files for LangExtract prompts and examples. - Implemented LangExtractRecognizer class to handle PII extraction using LangExtract. - Created tests for LangExtract recognizer to ensure functionality and reliability. - Added a simple standalone test script for quick validation of LangExtract setup. - Updated pyproject.toml to include langextract as a dependency. * refine the docs * narrow support for oollama only * Refactor LangExtract tests to use Ollama; remove API key dependency * adding first draft of docker compose * Update model_id in tests to use 'gemma2:2b' instead of 'gemini-2.5-flash' * Refactor LangExtract documentation to focus on Ollama support; remove references to other LLM providers * Update README to remove Ollama setup instructions and clarify integration guide reference * Enhance Ollama installation script with progress messages and error handling; update model download method for better user feedback * auto ruff fixes * Enhance LangExtractRecognizer tests with real Ollama integration - Updated `langextract_recognizer_class` fixture to create a test-specific configuration for LangExtractRecognizer, enabling it for testing. - Refactored tests in `test_langextract_recognizer.py` to utilize the new configuration and validate the recognizer's behavior with real Ollama. - Removed mock-based tests for LangExtract and replaced them with integration tests that check the recognizer's functionality against a running Ollama instance. - Added tests to verify the recognizer's initialization, entity detection, and error handling when the Ollama server is unreachable. - Ensured that only requested entities are returned and that results include analysis explanations. * Remove unnecessary line breaks in LLM-based PII detection section of README * Add LangExtract LLM-based PII detection test and configuration * Improve Ollama availability check with setup attempt message * Increase wait time for services and update healthcheck parameters for Ollama service * Add Ollama setup for Analyzer tests and improve availability check * Set timeout for Ollama setup in Analyzer tests to 8 minutes * Enhance Ollama setup for Analyzer tests with improved installation and readiness checks * Update Ollama model references from gemma2:2b to llama3.2:1b across configuration and test files * Update model references from llama3.2:1b to gemma2:2b across configuration, scripts, and tests * Remove 'enabled' configuration from LangExtract settings in YAML files and update tests accordingly * Update Ollama service configuration: change port mapping and modify healthcheck command * Refactor logging in LangExtractRecognizer: reduce verbosity and improve clarity of extraction results * Update Ollama service configuration: modify port mapping and healthcheck command * Update CI workflow and tests: reduce sleep duration and add environment variable for LangExtract recognizer * Reduce sleep duration in CI workflow from 150 to 60 seconds * Update LangExtract model references from gemma2:2b to gemma3:1b and remove obsolete installation script * docs and prompt fixes * finalizing the pr * Update Ollama image to latest version and add LangExtract PII/PHI extraction examples and prompts * fix bad example * fix unit-tests * refactor: clean up .env file and simplify skip_engine logic in tests * chore: add a new line to .env file for better readability * chore: remove unnecessary blank line from .env file * revert .env * intial commit * Remove skip marker for spacy_nlp_engine fixture * Remove skip markers for stanza and transformers NLP engine fixtures * Remove Ollama recognizer test and update default recognizers configuration * Remove unused Ollama recognizer configuration and update prompt file references * Add end-to-end tests for API anonymization and redaction features - Implemented tests for the anonymization API in `test_api_anonymizer.py`, covering various scenarios including valid requests, empty inputs, malformed requests, and custom anonymizers. - Created integration tests in `test_api_e2e_integration_flows.py` to validate the analyze and anonymize workflow with PII detection. - Added tests for image redaction functionality in `test_api_image_redactor.py`, ensuring proper handling of image data and error responses. - Developed package-level tests in `test_package_e2e_integration_flows.py` to verify the functionality of the analyzer and anonymizer engines, including support for third-party recognizers. * Remove unused Ollama recognizer imports and related tests * Update requirements and improve Ollama recognizer availability checks in e2e tests * Fix formatting in requirements.txt for analyzer and anonymizer dependencies * Update Ollama model ID from gemma3:1b to gemma2:2b in configuration and tests * gemma2:2b * finalizing the pr * Remove unused ABC import from lm_recognizer.py * Fix indentation in docker-compose.yml for volumes section * Fix line break for clarity in adding_recognizers.md * Add timeout settings for Ollama recognizer and test cases * Refactor timeout comment for clarity in OllamaLangExtractRecognizer * Update Ollama model version and add configuration for LangExtract recognizer tests * Update examples_file path in configuration for Ollama recognizer * Remove timeout decorator from Ollama recognizer * Add rerun settings to unit and E2E tests for improved stability * Remove rerun settings from unit and E2E test commands for simplification * Set max-parallel to 2 for local build and E2E tests * Remove max-parallel setting from local build and E2E tests * move poetry cache dir * pr changes * code review changes * ruff check * remove unused json import in test_ollama_recognizer.py * refactor test names for clarity and consistency in test_ollama_recognizer.py * finalizing the PR * self code review fixes * Refactor LangExtractRecognizer to use yaml for configuration loading * ruff fixes * Update error messages in OllamaLangExtractRecognizer tests for clarity * CR comment addressed * Remove unused variables from Jinja2 prompt rendering in LangExtractRecognizer * exporting functionality to helpers enlarging composition * composition * Refactor entity mapper and langextract utilities for improved clarity and consistency * Refactor tests to use get_langextract_module for mocking LangExtract availability * Refactor langextract utilities to improve clarity and error handling; remove deprecated functions and update tests accordingly * Refactor LLM utilities by simplifying docstrings and consolidating imports for improved readability and maintainability * Update error message for missing Jinja2 installation to include poetry installation instructions * Add Ollama recognizer configuration and tests for YAML integration - Introduced `test_ollama_enabled_recognizers.yaml` to define recognizers including OllamaLangExtractRecognizer. - Enhanced `test_package_e2e_integration_flows.py` with a test to validate loading of Ollama recognizer from YAML configuration. - Updated `OllamaLangExtractRecognizer` to support configuration path and language parameters. - Improved handling of relative paths for configuration files. * Refactor docstrings in OllamaLangExtractRecognizer for improved clarity and formatting * Enhance OllamaLangExtractRecognizer initialization docstring to clarify kwargs usage * pr comments * Refactor OllamaLangExtractRecognizer to streamline config path handling and remove redundant comments * Refactor Ollama recognizer test to improve clarity and enhance entity detection validation * Refactor tests for Ollama recognizer and LMRecognizer to improve exception handling and configuration validation * Update config path for Ollama recognizer in test configuration * Update config paths for Ollama recognizer and add test configuration for LangExtract * Remove test configuration for Ollama LangExtract * Remove test configuration for Ollama LangExtract recognizer * Fix formatting in resolve_config_path function for improved readability * Enable UsLangExtractRecognizer and update its config path * change all configs to use gemma3:1b * Disable Ollama LangExtract recognizer and update its configuration path * Update langextract configuration paths to use absolute paths for prompt and examples files * Remove test script for Ollama recognizer configuration loading * Refactor config loading in examples and prompt loaders to use resolve_config_path; update logging level in LMRecognizer; add langextract availability check in OllamaLangExtractRecognizer. * Refactor parameter description in load_yaml_examples and clean up imports in prompt_loader * Update langextract paths to use repo-root-relative paths in tests and prompt loader * Enhance documentation for Ollama setup and improve __init__.py imports for clarity and maintainability * code review changes * pr comments & align to main --------- Co-authored-by: Tamir Kamara <[email protected]> Co-authored-by: Sharon Hart <[email protected]> * Coverage data has been included in the documentation. (microsoft#1799) * Add code coverage requirements and update component download table in documentation * Remove Presidio CLI from the downloads and coverage table in the documentation * Fix Redoc API Docs script Inclusion (microsoft#1796) * Bug fix: Remove **kwargs from recognizer __init__ methods (microsoft#1800) * Remove unnecessary kwargs from recognizer initializations * Remove unnecessary kwargs from recognizer initializations --------- Co-authored-by: Sharon Hart <[email protected]> * Add Azure OpenAI support for LangExtract recognizer (microsoft#1801) * Add Azure OpenAI support for LangExtract recognizer * ruff * add redundnant tests to achieve 100% coverage * Add error handling and tests for Azure OpenAI provider initialization * Fix Microsoft Defender secret scanning false positives Replace test API keys with obviously fake placeholders: - test-api-key → PLACEHOLDER_NOT_A_REAL_KEY - test-key-123 → PLACEHOLDER_NOT_A_REAL_KEY - env-key → PLACEHOLDER_FROM_ENV - key → PLACEHOLDER_KEY These are unit test mock values, not real secrets. Using placeholder patterns that won't trigger security scanners while maintaining test validity. All 29 tests passing. * remove bandit * Add bandit tool to Microsoft Security DevOps workflow * Remove bandit from Microsoft Security DevOps workflow tools * Update AzureOpenAILangExtractRecognizer to use deployment name from environment variable * Refactor Azure OpenAI integration: remove legacy provider, update recognizer, and adjust tests * Improve error handling during Azure OpenAI provider registration by logging as error and raising exception * Refactor Azure OpenAI provider initialization logging for improved readability * Refactor test imports in Azure OpenAI recognizer tests for improved clarity and organization * Refactor Azure OpenAI provider tests to remove unnecessary variable assignments for improved clarity * Refactor langextract configuration: reorder supported entities and update entity mappings for consistency * Refactor Azure OpenAI integration: enhance documentation, improve endpoint validation, and streamline provider registration * Refactor Azure OpenAI LangExtract Recognizer: remove unused import and clean up code formatting * Refactor Azure OpenAI provider tests: update imports to use the correct module and remove obsolete test for langextract availability * Refactor Azure authentication handling: consolidate credential management into azure_auth_helper and update related recognizers and tests * Refactor Azure OpenAI recognizers: enhance module imports for registration, streamline model ID handling, and improve test coverage for credential selection * Refactor AHDS Surrogate operator: streamline error handling by mocking Azure credentials and client, and improve code readability * Refactor AHDS Recognizer tests: replace multiple credential mocks with a single get_azure_credential mock for improved clarity and maintainability * Add a validation layer for YAML based configuration (microsoft#1780) * fix: Improve Korean RRN regex pattern validation (microsoft#1807) * fix: Improve Korean RRN regex pattern validation - Use negative lookahead/lookbehind instead of word boundaries - Add gender digit validation ([1-4] for first digit of last 7 digits) * Fix: correct invalid gender digits to valid ones - Changed gender digits from 7 and 0 to valid values to 1~4 * fix: Update KR_RRN test scores to match actual recognizer output - Updated test cases "050912-2000019" and "0509122000019" scores from (1.0, 1.0) to match actual recognizer behavior * add: invalid RRN test cases Added more invalid RRN cases to enhance test coverage. --------- Co-authored-by: Omri Mendels <[email protected]> * enabled `OllamaLangExtractRecognizer` by default --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Ron Shakutai <[email protected]> Co-authored-by: Tamir Kamara <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Dor Lugasi-Gal <[email protected]> Co-authored-by: Omri Mendels <[email protected]> Co-authored-by: Sharon Hart <[email protected]> Co-authored-by: kim <[email protected]>

remove poetry cache from docker images

f6ef089

tamirkamara marked this pull request as ready for review November 16, 2025 13:31

tamirkamara requested a review from SharonHart November 16, 2025 15:36

RonShakutai approved these changes Nov 16, 2025

View reviewed changes

SharonHart approved these changes Nov 16, 2025

View reviewed changes

tamirkamara merged commit 2f09d34 into main Nov 17, 2025
36 checks passed

tamirkamara deleted the tamirkamara/remove-cache-in-docker branch November 17, 2025 09:22

omri374 reviewed Nov 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove poetry cache from docker images #1785

Remove poetry cache from docker images #1785

Uh oh!

tamirkamara commented Nov 16, 2025

Uh oh!

SharonHart left a comment

Uh oh!

Uh oh!

omri374 Nov 17, 2025

Uh oh!

tamirkamara Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Remove poetry cache from docker images #1785

Remove poetry cache from docker images #1785

Uh oh!

Conversation

tamirkamara commented Nov 16, 2025

Change Description

Checklist

Uh oh!

SharonHart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

omri374 Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

tamirkamara Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants