Problem
There is no unified way to track all dataset versions across GCS and Hugging Face. Version information is scattered — GCS uses blob metadata, HF uses git tags — and there's no single source of truth that maps a semver string to the specific GCS generation numbers and HF commit SHAs for that release. This makes rollback discovery difficult and prevents consumers from programmatically querying the current data version.
Solution
Introduce a version registry (version_manifest.json) that:
- Lives on both GCS and HF as a single file containing all version entries
- Maps each semver version to its GCS generation numbers and HF commit SHA
- Provides a
current pointer to the latest deployed version
- Supports rollback by treating it as a new release with
special_operation="roll-back" metadata
- Exposes a public consumer API (
get_data_version(), get_data_manifest()) that fetches the registry from HF without credentials
Key design decisions
- Single registry file — all versions in one
version_manifest.json, not per-version blobs
- Backend separation —
upload_manifest() orchestrates, delegating to _upload_registry_to_gcs() and _upload_registry_to_hf()
- Rollback-as-release — rolling back copies old data to new GCS generations and HF commits, then publishes a new version entry
- Consumer API at package level —
from policyengine_us_data import get_data_version, get_data_manifest
Type hierarchy
VersionRegistry
├── current: str
└── versions: list[VersionManifest]
├── version: str
├── created_at: str
├── hf: HFVersionInfo (repo + commit SHA)
├── gcs: GCSVersionInfo (bucket + generation map)
├── special_operation: str?
└── roll_back_version: str?
Files
policyengine_us_data/utils/gcs_version.py — core module (types, registry I/O, query functions, rollback, consumer API)
policyengine_us_data/utils/data_upload.py — modified to return generations/commits and build manifests
policyengine_us_data/__init__.py — exports get_data_version and get_data_manifest
policyengine_us_data/tests/test_gcs_version.py — 39 tests covering all functionality
Problem
There is no unified way to track all dataset versions across GCS and Hugging Face. Version information is scattered — GCS uses blob metadata, HF uses git tags — and there's no single source of truth that maps a semver string to the specific GCS generation numbers and HF commit SHAs for that release. This makes rollback discovery difficult and prevents consumers from programmatically querying the current data version.
Solution
Introduce a version registry (
version_manifest.json) that:currentpointer to the latest deployed versionspecial_operation="roll-back"metadataget_data_version(),get_data_manifest()) that fetches the registry from HF without credentialsKey design decisions
version_manifest.json, not per-version blobsupload_manifest()orchestrates, delegating to_upload_registry_to_gcs()and_upload_registry_to_hf()from policyengine_us_data import get_data_version, get_data_manifestType hierarchy
Files
policyengine_us_data/utils/gcs_version.py— core module (types, registry I/O, query functions, rollback, consumer API)policyengine_us_data/utils/data_upload.py— modified to return generations/commits and build manifestspolicyengine_us_data/__init__.py— exportsget_data_versionandget_data_manifestpolicyengine_us_data/tests/test_gcs_version.py— 39 tests covering all functionality