Skip to content

Populate the data platform metadata catalog #1355

@blarghmatey

Description

@blarghmatey

User Story

  • As a data platform engineer I want to have all of the system metadata collected to improve data discovery and power data governance

Description/Context

Now that we have OpenMetadata deployed we need to populate it with metadata from all of the platform components. The data ingestion is managed with the OpenMetadata ingestion library (https://docs.open-metadata.org/latest/deployment/ingestion/external). The majority of the data sources can be managed with the connection workflows (https://docs.open-metadata.org/latest/connectors). Clicking a connector and selecting the "Run The Connector Externally" link will display the YAML configuration details.

Acceptance Criteria

Metadata from the following systems is ingested and regularly updated in our deployment of OpenMetadata

  • Trino (Starburst Galaxy)
  • dbt
  • Dagster
  • Redash
  • Superset
  • S3
  • Iceberg
  • Airbyte

Lineage information is from the following systems is ingested and maintained in OpenMetadata

  • Trino (Starburst Galaxy)
  • dbt

Profiling and quality information is collected from the following sources

  • Trino
  • Iceberg

Plan/Design

For the majority of sources we should be able to use the MetadataWorkflow object for managing ingestion from the out-of-the-box sources (https://docs.open-metadata.org/latest/deployment/ingestion/external). More detailed or custom metadata ingestion will be managed as custom Dagster assets. All of the execution will be managed via Dagster pipelines.

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions