Fix h5 files by saving calibration geography artifact, and model fit resume function#708
Fix h5 files by saving calibration geography artifact, and model fit resume function#708
Conversation
65a8134 to
1d25092
Compare
juaristi22
left a comment
There was a problem hiding this comment.
Approved. The new saved-geography flow fixes the real issue here: calibration now uses the same geography assignment when building the matrix and when producing calibrated H5s, instead of trying to regenerate it later from (n_records, n_clones, seed).
One note: the legacy fallback is no longer backward compatible for older artifacts that only contain weights and the dataset. If neither geography_assignment.npz nor stacked_blocks.npy is present, publish/worker now fail instead of rebuilding via the old regeneration path. I’m fine with that tradeoff, but it should be documented explicitly as a compatibility break.
|
Paused, waiting on #711 |
6acc3eb to
5eb5eb0
Compare
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Calibration now persists geography_assignment.npz alongside weights so that downstream publish and worker steps use the exact same geography instead of regenerating it randomly. Adds --resume-from and --checkpoint-output flags to unified_calibration for continuing fits from a saved checkpoint or warm-starting from weights. Also gitignores *.csv.gz to prevent accidental commits of cached ORG data. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
5eb5eb0 to
5ca1241
Compare
…nfig Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Fix calibration crash on string constraint variables (ssn_card_type) by falling back from float32 cast when values are non-numeric. Impute ITIN status for undocumented (code-0) persons: select tax units with code-0 earners via weighted random sampling targeting 4.4M ITIN returns (IRS NTA), then mark all code-0 members of those units. Updates has_tin = (ssn_card_type != 0) | has_itin_number so ITIN holders correctly qualify for ODC ($500 credit). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Summary
Continues the
fine-agi-bracketsbranch after #695 (which added fine AGI bracket targets from SOI stubs 9/10 and Table 1.4, re-enabledincome_tax_positive, and addednet_worthand district SNAP targets). This PR adds two infrastructure improvements to the calibration pipeline:assign_random_geography()to generate a fresh geography assignment, meaning the H5 files were built with a different geography than the weights were optimized against. Calibration now savesgeography_assignment.npzalongside weights, and all downstream steps load it instead of regenerating. Backward compatibility with legacystacked_blocks.npyis preserved viareconstruct_geography_from_blocks.--resume-fromand--checkpoint-outputflags tounified_calibration.py. Full checkpoint resume restores L0 gate state and continues epoch numbering; warm-start from.npyweights is also supported. Hyperparameter compatibility is validated on resume.has_tin = False, disqualifying them from ODC ($500 credit). Newimpute_itin_status()function selects tax units with code-0 earners via weighted random sampling targeting 4.4M ITIN returns (IRS NTA benchmark), then marks all code-0 members of selected units as ITIN holders. Updateshas_tin = (ssn_card_type != 0) | has_itin_number.ssn_card_typewhen casting to float32. Both sites now try float32 first and fall back to keeping raw values.Also:
--national-onlyflag topublish_local_area.pyfor building just the nationalUS.h5*.csv.gzto prevent accidental commit of cached ORG data