Is there enough data? Source volume checks for BigQuery #1754

RRap0so · 2026-03-13T05:22:14Z

RRap0so
Mar 13, 2026

🏺 Background

dbt source freshness answers a critical question before you build: is my source data recent enough? But there's a sibling question it doesn't answer: how much data arrived? Or whether anything meaningful arrived at all.

If a BigQuery retention policy deletes your date-sharded export tables, or your ingestion pipeline partially fails and loads 5 rows instead of 50,000, freshness passes, dbt run succeeds, and you ship empty or near-empty tables to production. Nobody finds out until a dashboard looks wrong or a stakeholder asks a question. I've hit this enough times, and seen it come up enough in Slack and in #3142 that I think it's worth a concrete proposal.

People have built workarounds. dbt-expectations row count tests catch it after materialization, but by then the bad data is already live. Custom run_query() pre-hooks work but they're per-project boilerplate outside the source contract. Orchestrator-level checks (Airflow sensors, etc.) move source validation outside dbt entirely. They all work, but the gap in the standard is that freshness covers timeliness and nothing covers sufficiency.

🔍 The problem

Imagine a source that feeds into a staging model, which feeds into 15 downstream models. One morning, the source is empty. Freshness passes, a timestamp exists, there's just no data behind it. dbt run succeeds. Every model builds, every exit code is zero. The fact table has no rows. By the time someone notices, that empty table has been live for hours, dashboards have been serving it, and a reverse ETL sync may have pushed it downstream.

Rebuilding 15 empty tables is cheap. What actually hurts is that bad data was live and being consumed. And when a post-materialization test eventually catches it, the team still has to figure out whether the transformation was wrong or the input was empty. A pre-build volume check would have answered that immediately.

On BigQuery specifically, three patterns cause this:

Retention deletes export tables. Date-sharded daily exports (events_YYYYMMDD) get cleaned up by retention. Wildcard queries silently return zero rows. There's often no loaded_at field to check, these are individual tables that either exist or don't.

Partitions exist but are empty. A load job creates a partition but fails before writing rows. The partition's timestamp is recent enough to pass freshness, but the partition itself is empty.

Degraded ingestion. A pipeline partially fails, loads a handful of rows. loaded_at updates, freshness passes, data is garbage.

💡 Proposal

Add a volume block to source tables in the BigQuery adapter, and a new dbt source volume subcommand to check it. Just as freshness lets you declare "this source should have recent data," volume lets you declare "this source should have enough data."

The simplest case

sources:
  - name: my_source
    tables:
      - name: my_table
        freshness:
          warn_after: {count: 24, period: hour}
          error_after: {count: 48, period: hour}

        volume:
          warn_below: 1000
          error_below: 100

warn_below and error_below work like their freshness counterparts — warn logs a warning, error halts the pipeline. Either can be used alone. Volume doesn't require loaded_at_field — you can define volume without freshness and vice versa:

tables:
  - name: dim_products
    volume:
      error_below: 1  # just make sure it's not empty

Under the hood, the adapter queries INFORMATION_SCHEMA.TABLE_STORAGE for total_rows — a metadata lookup, not a SELECT COUNT(*). No table scan, negligible cost.

On accuracy: BigQuery's metadata is authoritative and near-real-time, total_rows updates as data is committed. There can be brief lag after streaming inserts, but volume checks are designed to catch catastrophic failures (table is empty, retention wiped your data, ingestion loaded 5 rows), not to assert exact counts. If you need precise range assertions, that's still a post-materialization test — these are complementary.

Date-sharded tables

This is the motivating use case and the reason this belongs in the BigQuery adapter rather than dbt-core. In dbt today, a source maps to a single physical table, there's no native concept of a source that resolves to events_*. On BigQuery, date-sharded tables are everywhere, and they're exactly where the worst volume failures happen.

A new table_pattern property triggers wildcard mode:

tables:
  - name: my_events
    table_pattern: "events_*"

    volume:
      warn_below: 5000
      error_below: 0

When the adapter sees table_pattern, it queries INFORMATION_SCHEMA.TABLES for matching tables and checks row count per table against the thresholds. Any table that falls below error_below (including missing tables that resolve to zero rows) gets flagged:

[volume] my_source.events_* (resolved 7 tables):
  events_20260306: 52,100 rows  PASS
  events_20260307: 49,803 rows  PASS
  events_20260308:      0 rows  ERROR
  events_20260309: 51,445 rows  PASS
  events_20260310: 48,200 rows  PASS
  events_20260311:      0 rows  ERROR
  events_20260312: 50,102 rows  PASS

FAILED: 2 tables below threshold.

Partition-level checks

For partitioned tables, table-level row counts can be misleading, 10 million historical rows will pass any reasonable threshold even if today's partition is empty. partition_field and partition_range let you check recent partitions individually:

tables:
  - name: app_events
    volume:
      partition_field: event_date
      partition_range: 3
      warn_below: 10000
      error_below: 100

This queries INFORMATION_SCHEMA.PARTITIONS for the 3 most recent partitions. Still metadata, still free.

The design also leaves room for warn_above / error_above thresholds — for catching duplicate loads and retry storms — though that's not part of this proposal.

🏗️ Execution

New subcommand — dbt source volume — separate from dbt source freshness. We're not changing the behavior or contract of an existing command. This is additive.

dbt source freshness && dbt source volume && dbt build

Selection follows the same syntax:

dbt source volume --select source:my_source
dbt source volume --select source:my_source.my_table

Results go into sources.json alongside freshness results, so they flow into whatever artifact-based workflows you've already built. A future dbt source check that runs both together is a natural next step, but that's a separate conversation.

Does the three-mode approach (table / wildcard / partition) feel right, or is there a simpler abstraction that covers the same ground?
table_pattern is introduced here for volume checks — should it be a property on the source table itself, usable beyond volume?
For teams already working around this: what does your setup look like, and what's missing?

RRap0so · 2026-03-13T06:32:03Z

RRap0so
Mar 13, 2026
Author

What would be needed to implement this in bq #1755

Follow-up dbt-core would also be needed to implement to finish this proposal.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there enough data? Source volume checks for BigQuery #1754

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is there enough data? Source volume checks for BigQuery #1754

Uh oh!

Uh oh!

RRap0so Mar 13, 2026

🏺 Background

🔍 The problem

💡 Proposal

The simplest case

Date-sharded tables

Partition-level checks

🏗️ Execution

Related

Replies: 1 comment

Uh oh!

RRap0so Mar 13, 2026 Author

RRap0so
Mar 13, 2026

RRap0so
Mar 13, 2026
Author