Summary
Pipeline run metadata and diagnostics are currently stored on a Modal volume (pipeline-artifacts) under /pipeline/runs/{run_id}/. If the volume is deleted, all run history is lost. Diagnostics are also uploaded to the main data repo (policyengine/policyengine-us-data) under calibration/runs/{run_id}/diagnostics/, but only final diagnostics — not intermediate build artifacts.
Goal
Publish full run records (metadata + diagnostics + intermediate artifacts) to a dedicated HF model repo (PolicyEngine/policyengine-us-data-pipeline). Keep all existing uploads to the main data repo unchanged.
What gets archived
Run metadata & diagnostics (mirrored on every write)
{run_id}/meta.json
{run_id}/diagnostics/unified_diagnostics.csv
{run_id}/diagnostics/calibration_log.csv
{run_id}/diagnostics/unified_run_config.json
{run_id}/diagnostics/national_* variants
{run_id}/diagnostics/validation_results.csv
{run_id}/diagnostics/national_validation.txt
Intermediate build artifacts (Step 1 — not shipped elsewhere)
acs_2022.h5, irs_puf_2015.h5, puf_2024.h5
extended_cps_2024.h5, stratified_extended_cps_2024.h5
build_log.txt, calibration_log_legacy.csv, uprating_factors.csv
Package metadata (Step 2)
calibration_package_meta.json
Implementation
- New
_batched_hf_upload() shared helper in data_upload.py (deduplicates staging + pipeline upload logic)
- New
upload_to_pipeline_repo() thin wrapper for the pipeline archival repo
_mirror_to_pipeline_repo() in pipeline.py — non-fatal subprocess wrapper with timeout, env-var data passing
_archive_artifacts() — unified artifact archival helper
write_run_meta() gains mirror param (False in error handlers to prevent hangs)
- All archival is non-fatal — pipeline never fails due to archival issues
Prerequisites
- Create
PolicyEngine/policyengine-us-data-pipeline model repo on HuggingFace (one-time manual step)
Summary
Pipeline run metadata and diagnostics are currently stored on a Modal volume (
pipeline-artifacts) under/pipeline/runs/{run_id}/. If the volume is deleted, all run history is lost. Diagnostics are also uploaded to the main data repo (policyengine/policyengine-us-data) undercalibration/runs/{run_id}/diagnostics/, but only final diagnostics — not intermediate build artifacts.Goal
Publish full run records (metadata + diagnostics + intermediate artifacts) to a dedicated HF model repo (
PolicyEngine/policyengine-us-data-pipeline). Keep all existing uploads to the main data repo unchanged.What gets archived
Run metadata & diagnostics (mirrored on every write)
{run_id}/meta.json{run_id}/diagnostics/unified_diagnostics.csv{run_id}/diagnostics/calibration_log.csv{run_id}/diagnostics/unified_run_config.json{run_id}/diagnostics/national_*variants{run_id}/diagnostics/validation_results.csv{run_id}/diagnostics/national_validation.txtIntermediate build artifacts (Step 1 — not shipped elsewhere)
acs_2022.h5,irs_puf_2015.h5,puf_2024.h5extended_cps_2024.h5,stratified_extended_cps_2024.h5build_log.txt,calibration_log_legacy.csv,uprating_factors.csvPackage metadata (Step 2)
calibration_package_meta.jsonImplementation
_batched_hf_upload()shared helper indata_upload.py(deduplicates staging + pipeline upload logic)upload_to_pipeline_repo()thin wrapper for the pipeline archival repo_mirror_to_pipeline_repo()inpipeline.py— non-fatal subprocess wrapper with timeout, env-var data passing_archive_artifacts()— unified artifact archival helperwrite_run_meta()gainsmirrorparam (False in error handlers to prevent hangs)Prerequisites
PolicyEngine/policyengine-us-data-pipelinemodel repo on HuggingFace (one-time manual step)