Skip to content

Conversation

@kyungsoo-datahub
Copy link
Contributor

  • Thread-safe SecretRegistry with copy-on-write pattern
  • Logging layer masking filter
  • Bootstrap initialization for different contexts
  • CLI integration
  • Unit and integration tests

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Nov 3, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 3, 2025
@codecov
Copy link

codecov bot commented Nov 3, 2025

Codecov Report

❌ Patch coverage is 69.49153% with 162 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ta-ingestion/src/datahub/masking/masking_filter.py 65.57% 84 Missing ⚠️
...a-ingestion/src/datahub/masking/secret_registry.py 74.25% 26 Missing ⚠️
...etadata-ingestion/src/datahub/masking/bootstrap.py 70.00% 24 Missing ⚠️
...data-ingestion/src/datahub/configuration/common.py 67.30% 17 Missing ⚠️
metadata-ingestion/src/datahub/masking/__init__.py 0.00% 5 Missing ⚠️
...ata-ingestion/src/datahub/masking/logging_utils.py 73.68% 5 Missing ⚠️
...ta-ingestion/src/datahub/configuration/env_vars.py 50.00% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (69.49%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

📢 Thoughts on this report? Let us know!

"""Ingest metadata into DataHub."""

# Initialize secret masking (before any logging)
initialize_secret_masking()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean that the masking only happens in datahub ingest command?
instead, shouldn't we make it by default for all output when using the python sdk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some downstream libraries was stuck when I broaden the scope of this change. We can broaden the scope on next phase.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Nov 6, 2025
Comment on lines 29 to 36
class ExecutionContext(Enum):
"""Execution context for DataHub ingestion."""

CLI = "cli"
UI_BACKEND = "ui_backend"
REMOTE_EXECUTOR = "remote"
SCHEDULED = "scheduled"
UNKNOWN = "unknown"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand different context will have different sources of secret. So far, cli masks env vars. Remote executor will mask secrets from secret store, and so on.

Wondering if being aware of the context is responsibility of the masking component or instead the target component should be responsible to register the secrets?

Basically, I would like to reduce the coupling of this masking component; ideally, it should only depend on logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. I removed contexts and sources as we discussed.

uninstall_masking_filter,
)
from datahub.ingestion.masking.secret_registry import SecretRegistry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could have some mechanism (env var) to disable masking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the flag to disable this.


# Detect UI backend
if os.environ.get("DATAHUB_UI_INGESTION") == "1":
return ExecutionContext.UI_BACKEND
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UI_BACKEND
SCHEDULED

what are those context? may both refer to ingestion from managed ingestion ui?
managed ingestion can be scheduled or not, is that relevant for the masking?
ingestion can be triggered from managed ui or cli, shouldn't we also cover the second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Thanks


# Detect remote executor
if os.environ.get("DATAHUB_EXECUTOR_ID"):
return ExecutionContext.REMOTE_EXECUTOR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it relevant for the masking whether remote or embedded executor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed following our discussion.

Args:
context: Execution context (auto-detected if None)
secret_sources: List of sources to load (auto-detected if None)
max_message_size: Maximum log message size before truncation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncating log... that's sort of new feature, is that required?

Copy link
Contributor Author

@kyungsoo-datahub kyungsoo-datahub Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the revision, it would be truncated only when log masking is enabled. FYI, all debug logs will be enabled in debug mode and log masking would be disabled in debug mode.

Comment on lines 195 to 213
# Disable HTTP debug output (prevent deadlock)
try:
import http.client

http.client.HTTPConnection.debuglevel = 0
except Exception:
pass

# Set HTTP-related loggers to INFO (not DEBUG)
for logger_name in [
"urllib3",
"urllib3.connectionpool",
"urllib3.util.retry",
"requests",
]:
try:
logging.getLogger(logger_name).setLevel(logging.INFO)
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we lose ability to have debug logs from these 3rd party libs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the revision, the 3rd party libs debug log would be showing up when log masking is disabled.

elif source == "datahub_secrets_store":
# TODO: Implement when backend API available
logger.debug("DataHub secrets store not yet implemented")
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about SecretStr from configs? are we covering them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I added them. Thanks.

# Note: Never add variables containing secrets, passwords, keys, or tokens.
# Examples to NEVER add: AWS_ACCESS_KEY_ID, DATABASE_PASSWORD, API_KEY,
# SECRET_KEY. If unsure, don't add it.
_SYSTEM_ENV_VARS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may rename with a more descriptive name such as:

DATAHUB_MASKING_ENV_VARS_ALLOWED
DATAHUB_MASKING_ENV_VARS_SKIPPED
...

or something like that

Additionally, we may have a mechanism to incorporate values without requiring a release:

DATAHUB_MASKING_ENV_VARS_ALLOWED_PATTERN

so we can provide a regex pattern to skip additional env vars

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. Revised.

Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments

My main concern is on the context because it will couple this masking components to eg datahub secret store and so on 🤔

@alwaysmeticulous
Copy link

alwaysmeticulous bot commented Nov 7, 2025

✅ Meticulous spotted 0 visual differences across 1038 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

Expected differences? Click here. Last updated for commit 482d3f5. This comment will update as new commits are pushed.

@codecov
Copy link

codecov bot commented Nov 7, 2025

Bundle Report

Bundle size has no change ✅

Comment on lines 40 to 43
if PYDANTIC_VERSION_2:
from pydantic import SecretStr, model_validator
else:
from pydantic import SecretStr, root_validator
Copy link
Contributor

@sgomezvillamor sgomezvillamor Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently merged #15057
So there should be no more pydantic v1 root_validator or validator

We should stop introducing pydantic v1 compatibility code and just focus (assume) python v1
And if we find some component still supporting pydantic v1 in codebase, masking will only work if pydantic v2 at runtime.
We need to move and push towards pydantic v2 only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the info. After rebasing master, the problem with Pydantic 1 is gone. I removed.

Comment on lines 33 to 38
try:
from datahub.masking.secret_registry import SecretRegistry

_MASKING_AVAILABLE = True
except ImportError:
_MASKING_AVAILABLE = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which scenario this import will fail?
IMO the optional dependency is not necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my understanding:

  • there should be no code dependant on imports availability
  • if we have some conditional code, it should be dependant on is_masking_enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I removed this. Thanks.

@treff7es
Copy link
Contributor

As we discussed last week, I wonder if this environment variable name regexp matching is a bit overkill.
The main problem is to mask secrets that are defined as Secret on the DataHub ui.
If we know these secret in advance then we should be able to set/mark it in the config as well.
Similarly, if you can set an environment variable as ${MY_ENV_VARIABLE} in the config, you should also be able to set MY_SECRET=ENC_BASE64:[base64-encrypted-string-here], which would be treated as a secret automatically.

I feel like with that we could remove most of the inefficient part of the code.

Wdyt?

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Nov 10, 2025
@kyungsoo-datahub
Copy link
Contributor Author

As we discussed last week, I wonder if this environment variable name regexp matching is a bit overkill. The main problem is to mask secrets that are defined as Secret on the DataHub ui. If we know these secret in advance then we should be able to set/mark it in the config as well. Similarly, if you can set an environment variable as ${MY_ENV_VARIABLE} in the config, you should also be able to set MY_SECRET=ENC_BASE64:[base64-encrypted-string-here], which would be treated as a secret automatically.

I feel like with that we could remove most of the inefficient part of the code.

Thank you for the review and comment.

We still need the regex-based logging filter. While SecretStr protects direct logging, it can't prevent leaks in:

  • Third-party library exceptions (e.g., Snowflake error: "invalid password 'secret123'")
  • Tracebacks showing local variables
  • Library debug logging

The logging filter provides defense-in-depth by catching secrets regardless of how they leak into logs or exceptions.

I removed should_mask_env_var as we are registering only on recipe file.

- Add test for __init__.py imports to cover module exports
- Add tests for is_bootstrapped() and get_bootstrap_error() functions
- Add test for initialization with masking disabled
- Add test for exception hook masking failure handling

These tests improve coverage of edge cases and error paths in the masking framework.
@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants