Replies: 1 comment
-
|
What would be needed to implement this in bq #1755 Follow-up dbt-core would also be needed to implement to finish this proposal. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
🏺 Background
dbt source freshnessanswers a critical question before you build: is my source data recent enough? But there's a sibling question it doesn't answer: how much data arrived? Or whether anything meaningful arrived at all.If a BigQuery retention policy deletes your date-sharded export tables, or your ingestion pipeline partially fails and loads 5 rows instead of 50,000, freshness passes,
dbt runsucceeds, and you ship empty or near-empty tables to production. Nobody finds out until a dashboard looks wrong or a stakeholder asks a question. I've hit this enough times, and seen it come up enough in Slack and in #3142 that I think it's worth a concrete proposal.People have built workarounds.
dbt-expectationsrow count tests catch it after materialization, but by then the bad data is already live. Customrun_query()pre-hooks work but they're per-project boilerplate outside the source contract. Orchestrator-level checks (Airflow sensors, etc.) move source validation outside dbt entirely. They all work, but the gap in the standard is that freshness covers timeliness and nothing covers sufficiency.🔍 The problem
Imagine a source that feeds into a staging model, which feeds into 15 downstream models. One morning, the source is empty. Freshness passes, a timestamp exists, there's just no data behind it.
dbt runsucceeds. Every model builds, every exit code is zero. The fact table has no rows. By the time someone notices, that empty table has been live for hours, dashboards have been serving it, and a reverse ETL sync may have pushed it downstream.Rebuilding 15 empty tables is cheap. What actually hurts is that bad data was live and being consumed. And when a post-materialization test eventually catches it, the team still has to figure out whether the transformation was wrong or the input was empty. A pre-build volume check would have answered that immediately.
On BigQuery specifically, three patterns cause this:
Retention deletes export tables. Date-sharded daily exports (
events_YYYYMMDD) get cleaned up by retention. Wildcard queries silently return zero rows. There's often noloaded_atfield to check, these are individual tables that either exist or don't.Partitions exist but are empty. A load job creates a partition but fails before writing rows. The partition's timestamp is recent enough to pass freshness, but the partition itself is empty.
Degraded ingestion. A pipeline partially fails, loads a handful of rows.
loaded_atupdates, freshness passes, data is garbage.💡 Proposal
Add a
volumeblock to source tables in the BigQuery adapter, and a newdbt source volumesubcommand to check it. Just asfreshnesslets you declare "this source should have recent data,"volumelets you declare "this source should have enough data."The simplest case
warn_belowanderror_belowwork like their freshness counterparts —warnlogs a warning,errorhalts the pipeline. Either can be used alone. Volume doesn't requireloaded_at_field— you can definevolumewithoutfreshnessand vice versa:Under the hood, the adapter queries
INFORMATION_SCHEMA.TABLE_STORAGEfortotal_rows— a metadata lookup, not aSELECT COUNT(*). No table scan, negligible cost.Date-sharded tables
This is the motivating use case and the reason this belongs in the BigQuery adapter rather than dbt-core. In dbt today, a source maps to a single physical table, there's no native concept of a source that resolves to events_*. On BigQuery, date-sharded tables are everywhere, and they're exactly where the worst volume failures happen.
A new table_pattern property triggers wildcard mode:
When the adapter sees
table_pattern, it queriesINFORMATION_SCHEMA.TABLESfor matching tables and checks row count per table against the thresholds. Any table that falls belowerror_below(including missing tables that resolve to zero rows) gets flagged:Partition-level checks
For partitioned tables, table-level row counts can be misleading, 10 million historical rows will pass any reasonable threshold even if today's partition is empty.
partition_fieldandpartition_rangelet you check recent partitions individually:This queries
INFORMATION_SCHEMA.PARTITIONSfor the 3 most recent partitions. Still metadata, still free.🏗️ Execution
New subcommand —
dbt source volume— separate fromdbt source freshness. We're not changing the behavior or contract of an existing command. This is additive.Selection follows the same syntax:
Results go into
sources.jsonalongside freshness results, so they flow into whatever artifact-based workflows you've already built. A futuredbt source checkthat runs both together is a natural next step, but that's a separate conversation.Related
dbt-expectations.expect_table_row_count_to_be_between— post-materialization complementdbt source freshness— covers timeliness; this would cover sufficiencyQuestions:
table_patternis introduced here for volume checks — should it be a property on the source table itself, usable beyond volume?Beta Was this translation helpful? Give feedback.
All reactions