Skip to content

Comments

Cache test database schema in CI to reduce compute time#37433

Draft
dannyroberts wants to merge 8 commits intomasterfrom
dmr/cache-test-db-schema
Draft

Cache test database schema in CI to reduce compute time#37433
dannyroberts wants to merge 8 commits intomasterfrom
dmr/cache-test-db-schema

Conversation

@dannyroberts
Copy link
Member

Product Description

N/A — CI-only change, no user-facing effects.

Technical Summary

Uses GitHub Actions cache to store a pg_dump of the test database schema. On cache hits (~every run without migration changes), restores the dump in seconds instead of running the full ~200s setup_databases() call via Django test framework.

How it works:

  1. New actions/cache@v4 step caches artifacts/test_db.dump, keyed on hashFiles('**/migrations/**/*.py')
  2. Before tests: if dump file exists (cache hit), start postgres, restore the dump, skip schema creation
  3. After tests: if no dump existed (cache miss), dump the DB for next run's cache
  4. On restore failure: delete the dump and fall back to normal setup

Scope: 3 non-sharded python test shards (05, 6a, bf). The python-sharded-and-javascript shard is excluded (different DB config, only 1 job).

Expected savings: ~600s total CI compute per run (3 shards × 200s) on cache hits, with no wall-clock impact.

Feature Flag

N/A

Safety Assurance

Safety story

  • CI-only change — no production code modified
  • Restore failure is handled gracefully: deletes the dump file and falls back to normal 200s schema setup
  • Dump failure is silently ignored — next run is just a cache miss
  • Cache key invalidates on any migration file change, so schema drift is not possible
  • The existing --reusedb=1 fast path in reusedb.py already handles the "DB already exists" case

Automated test coverage

The CI tests themselves serve as validation — if the restored schema is wrong, tests will fail.

QA Plan

  1. First CI run: cache miss, all shards do normal setup, dump is cached
  2. Push a no-op commit: cache hit, verify tests still pass and --durations output shows no ~200s setup time
  3. Verify python-sharded-and-javascript shard is unaffected

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

🤖 Generated with Claude Code

Use GitHub Actions cache to store a pg_dump of the test database schema.
On cache hits, restore the dump (~seconds) instead of running the full
200s setup_databases() call, saving ~600s of total CI compute per run
across the 3 non-sharded Python test shards.

Cache key is based on migration file hashes, so it invalidates whenever
the schema changes. On restore failure, falls back to normal setup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dannyroberts dannyroberts added the DON'T REVIEW YET This PR is not ready for review. Commits may be rebased or force-pushed. Don't waste your time. label Feb 24, 2026
docker compose exec -T postgres pg_dump -U commcarehq -Fc test_commcarehq > "$DB_DUMP" 2>/dev/null || rm -f "$DB_DUMP"
fi

exit $TEST_EXIT
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love how much boilerplate we need in order to do this, but first I just want to see if it'll even save any time at all. If it does, then I'll think about how to make this cleaner / more concise

dannyroberts and others added 7 commits February 24, 2026 09:44
The first run failed to cache because docker compose exec couldn't
reach the postgres container after the web container exited. Ensure
postgres is explicitly running before the dump, and show errors
instead of suppressing them for easier debugging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Docker container changes artifacts/ ownership to cchq:cchq via
docker/run.sh, making it unwritable by the host runner user. Use
sudo chown to reclaim ownership before writing the dump file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All 4 matrix jobs were using the same artifact name 'test-artifacts',
causing 409 Conflict errors. This also prevented the Post Cache step
from running (it skips when the job fails), so the DB dump was never
cached. Include the matrix values in the artifact name.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The artifacts/ directory has its ownership changed to cchq:cchq by the
Docker container, causing permission denied when writing the dump. It
also caused artifact upload conflicts since all shards share the same
artifact name. Use .test-db-cache/ instead, which the container never
touches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Docker container changes ownership of the repo checkout directory,
making it impossible to create new files/dirs in it from the host.
Use /tmp which is always writable and outside the container's reach.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the SQL database is restored from cache, sql_databases_ok()
returns True and the reusedb fast path skips ALL database setup
including CouchDB. But CouchDB is ephemeral in CI and needs setup.

Add FORCE_DB_SETUP env var that bypasses the fast path, forcing
setup_databases() to run with keepdb=True. This is fast for SQL
(tables already exist from restore) while still setting up CouchDB.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DON'T REVIEW YET This PR is not ready for review. Commits may be rebased or force-pushed. Don't waste your time.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant