feat(ingestion): Add secret masking framework #15188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

kyungsoo-datahub wants to merge 4 commits into master from feature/ing-1168/unsafe-handling-of-environment-variables

+4,842 −46

Contributor

kyungsoo-datahub commented Nov 3, 2025

Thread-safe SecretRegistry with copy-on-write pattern
Logging layer masking filter
Bootstrap initialization for different contexts
CLI integration
Unit and integration tests

github-actions bot added the ingestion label

datahub-cyborg bot added the needs-review label

github-actions bot deployed to datahub-wheels (Preview)

November 3, 2025 21:52

View deployment

codecov bot commented Nov 3, 2025 •

edited

Loading

Codecov Report

❌ Patch coverage is 69.49153% with 162 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...ta-ingestion/src/datahub/masking/masking_filter.py	65.57%	84 Missing ⚠️
...a-ingestion/src/datahub/masking/secret_registry.py	74.25%	26 Missing ⚠️
...etadata-ingestion/src/datahub/masking/bootstrap.py	70.00%	24 Missing ⚠️
...data-ingestion/src/datahub/configuration/common.py	67.30%	17 Missing ⚠️
metadata-ingestion/src/datahub/masking/__init__.py	0.00%	5 Missing ⚠️
...ata-ingestion/src/datahub/masking/logging_utils.py	73.68%	5 Missing ⚠️
...ta-ingestion/src/datahub/configuration/env_vars.py	50.00%	1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (69.49%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

📢 Thoughts on this report? Let us know!

vercel bot deployed to Preview

November 3, 2025 22:07

View deployment

sgomezvillamor reviewed

View reviewed changes

metadata-ingestion/src/datahub/cli/ingest_cli.py

    
                  """Ingest metadata into DataHub."""

                  # Initialize secret masking (before any logging)

                  initialize_secret_masking()

Contributor

sgomezvillamor Nov 6, 2025

does this mean that the masking only happens in datahub ingest command?
instead, shouldn't we make it by default for all output when using the python sdk?

Contributor Author

kyungsoo-datahub Nov 7, 2025

Some downstream libraries was stuck when I broaden the scope of this change. We can broaden the scope on next phase.

sgomezvillamor reviewed

View reviewed changes

metadata-ingestion/src/datahub/masking/__init__.py Show resolved Hide resolved

datahub-cyborg bot added pending-submitter-response and removed needs-review labels

sgomezvillamor reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/masking/bootstrap.py Outdated

Comment on lines 29 to 36

    
              class ExecutionContext(Enum):

                  """Execution context for DataHub ingestion."""

                  CLI = "cli"

                  UI_BACKEND = "ui_backend"

                  REMOTE_EXECUTOR = "remote"

                  SCHEDULED = "scheduled"

                  UNKNOWN = "unknown"

Contributor

sgomezvillamor Nov 6, 2025

I understand different context will have different sources of secret. So far, cli masks env vars. Remote executor will mask secrets from secret store, and so on.

Wondering if being aware of the context is responsibility of the masking component or instead the target component should be responsible to register the secrets?

Basically, I would like to reduce the coupling of this masking component; ideally, it should only depend on logging.

Contributor Author

kyungsoo-datahub Nov 7, 2025

Thank you for the comment. I removed contexts and sources as we discussed.

sgomezvillamor reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/masking/bootstrap.py Outdated

    
                  uninstall_masking_filter,

              )

              from datahub.ingestion.masking.secret_registry import SecretRegistry

Contributor

sgomezvillamor Nov 6, 2025

we could have some mechanism (env var) to disable masking

Contributor Author

kyungsoo-datahub Nov 7, 2025

I added the flag to disable this.

sgomezvillamor reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/masking/bootstrap.py Outdated

    
                  # Detect UI backend

                  if os.environ.get("DATAHUB_UI_INGESTION") == "1":

                      return ExecutionContext.UI_BACKEND

Contributor

sgomezvillamor Nov 6, 2025

UI_BACKEND
SCHEDULED

what are those context? may both refer to ingestion from managed ingestion ui?
managed ingestion can be scheduled or not, is that relevant for the masking?
ingestion can be triggered from managed ui or cli, shouldn't we also cover the second?

Contributor Author

kyungsoo-datahub Nov 7, 2025

Removed. Thanks

sgomezvillamor reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/masking/bootstrap.py Outdated

    
                  # Detect remote executor

                  if os.environ.get("DATAHUB_EXECUTOR_ID"):

                      return ExecutionContext.REMOTE_EXECUTOR

Contributor

sgomezvillamor Nov 6, 2025

is it relevant for the masking whether remote or embedded executor?

Contributor Author

kyungsoo-datahub Nov 7, 2025

I removed following our discussion.

sgomezvillamor reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/masking/bootstrap.py Outdated

    
                  Args:

                      context: Execution context (auto-detected if None)

                      secret_sources: List of sources to load (auto-detected if None)

                      max_message_size: Maximum log message size before truncation

Contributor

sgomezvillamor Nov 6, 2025

truncating log... that's sort of new feature, is that required?

Contributor Author

kyungsoo-datahub Nov 7, 2025 •

edited

Loading

After the revision, it would be truncated only when log masking is enabled. FYI, all debug logs will be enabled in debug mode and log masking would be disabled in debug mode.

sgomezvillamor reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/masking/bootstrap.py Outdated

Comment on lines 195 to 213

    
                          # Disable HTTP debug output (prevent deadlock)

                          try:

                              import http.client

                              http.client.HTTPConnection.debuglevel = 0

                          except Exception:

                              pass

                          # Set HTTP-related loggers to INFO (not DEBUG)

                          for logger_name in [

                              "urllib3",

                              "urllib3.connectionpool",

                              "urllib3.util.retry",

                              "requests",

                          ]:

                              try:

                                  logging.getLogger(logger_name).setLevel(logging.INFO)

                              except Exception:

                                  pass

Contributor

sgomezvillamor Nov 6, 2025

do we lose ability to have debug logs from these 3rd party libs?

Contributor Author

kyungsoo-datahub Nov 7, 2025

After the revision, the 3rd party libs debug log would be showing up when log masking is disabled.

sgomezvillamor reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/masking/bootstrap.py Outdated

    
                          elif source == "datahub_secrets_store":

                              # TODO: Implement when backend API available

                              logger.debug("DataHub secrets store not yet implemented")

                          else:

Contributor

sgomezvillamor Nov 6, 2025

what about SecretStr from configs? are we covering them?

Contributor Author

kyungsoo-datahub Nov 7, 2025

Good point. I added them. Thanks.

sgomezvillamor reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/masking/bootstrap.py Outdated

    
              # Note: Never add variables containing secrets, passwords, keys, or tokens.

              #       Examples to NEVER add: AWS_ACCESS_KEY_ID, DATABASE_PASSWORD, API_KEY,

              #       SECRET_KEY. If unsure, don't add it.

              _SYSTEM_ENV_VARS = {

Contributor

sgomezvillamor Nov 6, 2025

We may rename with a more descriptive name such as:

DATAHUB_MASKING_ENV_VARS_ALLOWED
DATAHUB_MASKING_ENV_VARS_SKIPPED
...

or something like that

Additionally, we may have a mechanism to incorporate values without requiring a release:

DATAHUB_MASKING_ENV_VARS_ALLOWED_PATTERN

so we can provide a regex pattern to skip additional env vars

Contributor Author

kyungsoo-datahub Nov 7, 2025

Thank you for the suggestion. Revised.

sgomezvillamor reviewed

View reviewed changes

Contributor

sgomezvillamor left a comment

I left some comments

My main concern is on the context because it will couple this masking components to eg datahub secret store and so on 🤔

datahub-cyborg bot added needs-review and removed pending-submitter-response labels

github-actions bot deployed to datahub-wheels (Preview)

November 7, 2025 04:42

View deployment

vercel bot deployed to Preview

November 7, 2025 04:57

View deployment

kyungsoo-datahub force-pushed the feature/ing-1168/unsafe-handling-of-environment-variables branch from ed608a2 to ebd9743 Compare

November 7, 2025 06:12

github-actions bot deployed to datahub-wheels (Preview)

November 7, 2025 06:15

View deployment

github-actions bot deployed to datahub-project-web-react (Preview)

November 7, 2025 06:17

View deployment

alwaysmeticulous bot commented Nov 7, 2025 •

edited

Loading

✅ Meticulous spotted 0 visual differences across 1038 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

_{Expected differences? Click here. Last updated for commit 482d3f5. This comment will update as new commits are pushed.}

codecov bot commented Nov 7, 2025

Bundle Report

Bundle size has no change ✅

vercel bot deployed to Preview

November 7, 2025 06:29

View deployment

sgomezvillamor reviewed

View reviewed changes

metadata-ingestion/src/datahub/configuration/common.py Outdated

Comment on lines 40 to 43

    
              if PYDANTIC_VERSION_2:

                  from pydantic import SecretStr, model_validator

              else:

                  from pydantic import SecretStr, root_validator

Contributor

sgomezvillamor Nov 7, 2025 •

edited

Loading

I recently merged #15057
So there should be no more pydantic v1 root_validator or validator

We should stop introducing pydantic v1 compatibility code and just focus (assume) python v1
And if we find some component still supporting pydantic v1 in codebase, masking will only work if pydantic v2 at runtime.
We need to move and push towards pydantic v2 only.

Contributor Author

kyungsoo-datahub Nov 7, 2025

Thank you for the info. After rebasing master, the problem with Pydantic 1 is gone. I removed.

sgomezvillamor reviewed

View reviewed changes

metadata-ingestion/src/datahub/configuration/common.py Outdated

Comment on lines 33 to 38

    
              try:

                  from datahub.masking.secret_registry import SecretRegistry

                  _MASKING_AVAILABLE = True

              except ImportError:

                  _MASKING_AVAILABLE = False

Contributor

sgomezvillamor Nov 7, 2025

In which scenario this import will fail?
IMO the optional dependency is not necessary

Contributor

sgomezvillamor Nov 7, 2025

As per my understanding:

there should be no code dependant on imports availability
if we have some conditional code, it should be dependant on is_masking_enabled

Contributor Author

kyungsoo-datahub Nov 7, 2025

Make sense. I removed this. Thanks.

vercel bot deployed to Preview

November 7, 2025 22:05

View deployment

kyungsoo-datahub force-pushed the feature/ing-1168/unsafe-handling-of-environment-variables branch from b224932 to f63874a Compare

November 7, 2025 22:28

github-actions bot deployed to datahub-wheels (Preview)

November 7, 2025 22:30

View deployment

vercel bot deployed to Preview

November 7, 2025 22:45

View deployment

kyungsoo-datahub force-pushed the feature/ing-1168/unsafe-handling-of-environment-variables branch 2 times, most recently from 0787e93 to a8c5618 Compare

November 7, 2025 22:56

github-actions bot requested a deployment to datahub-wheels (Preview)

November 7, 2025 22:58

Abandoned

kyungsoo-datahub force-pushed the feature/ing-1168/unsafe-handling-of-environment-variables branch from a8c5618 to 06fa600 Compare

November 7, 2025 22:58

github-actions bot deployed to datahub-wheels (Preview)

November 7, 2025 23:00

View deployment

vercel bot deployed to Preview

November 7, 2025 23:17

View deployment

Contributor

treff7es commented Nov 10, 2025

As we discussed last week, I wonder if this environment variable name regexp matching is a bit overkill.
The main problem is to mask secrets that are defined as Secret on the DataHub ui.
If we know these secret in advance then we should be able to set/mark it in the config as well.
Similarly, if you can set an environment variable as ${MY_ENV_VARIABLE} in the config, you should also be able to set MY_SECRET=ENC_BASE64:[base64-encrypted-string-here], which would be treated as a secret automatically.

I feel like with that we could remove most of the inefficient part of the code.

Wdyt?

datahub-cyborg bot added pending-submitter-response and removed needs-review labels

Contributor Author

kyungsoo-datahub commented Nov 10, 2025

As we discussed last week, I wonder if this environment variable name regexp matching is a bit overkill. The main problem is to mask secrets that are defined as Secret on the DataHub ui. If we know these secret in advance then we should be able to set/mark it in the config as well. Similarly, if you can set an environment variable as ${MY_ENV_VARIABLE} in the config, you should also be able to set MY_SECRET=ENC_BASE64:[base64-encrypted-string-here], which would be treated as a secret automatically.

I feel like with that we could remove most of the inefficient part of the code.

Thank you for the review and comment.

We still need the regex-based logging filter. While SecretStr protects direct logging, it can't prevent leaks in:

Third-party library exceptions (e.g., Snowflake error: "invalid password 'secret123'")
Tracebacks showing local variables
Library debug logging

The logging filter provides defense-in-depth by catching secrets regardless of how they leak into logs or exceptions.

I removed should_mask_env_var as we are registering only on recipe file.

datahub-cyborg bot added needs-review and removed pending-submitter-response labels

kyungsoo-datahub force-pushed the feature/ing-1168/unsafe-handling-of-environment-variables branch from 06fa600 to ba7c5ba Compare

November 10, 2025 23:11

github-actions bot deployed to datahub-wheels (Preview)

November 10, 2025 23:13

View deployment

github-actions bot deployed to datahub-project-web-react (Preview)

November 10, 2025 23:15

View deployment

vercel bot deployed to Preview

November 10, 2025 23:28

View deployment

github-actions bot deployed to datahub-wheels (Preview)

November 11, 2025 00:13

View deployment

vercel bot deployed to Preview

November 11, 2025 00:28

View deployment


          test(ingestion): improve masking test coverage

482d3f5

- Add test for __init__.py imports to cover module exports
- Add tests for is_bootstrapped() and get_bootstrap_error() functions
- Add test for initialization with masking disabled
- Add test for exception hook masking failure handling

These tests improve coverage of edge cases and error paths in the masking framework.

kyungsoo-datahub force-pushed the feature/ing-1168/unsafe-handling-of-environment-variables branch from af57963 to 482d3f5 Compare

November 11, 2025 23:20

github-actions bot deployed to datahub-wheels (Preview)

November 11, 2025 23:21

View deployment

github-actions bot deployed to datahub-project-web-react (Preview)

November 11, 2025 23:24

View deployment

vercel bot deployed to Preview

November 11, 2025 23:37

View deployment

sgomezvillamor approved these changes

View reviewed changes

datahub-cyborg bot added pending-submitter-merge and removed needs-review labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion pending-submitter-merge