[codex] Align SPM thresholds and stabilize clone-half priors by MaxGhenis · Pull Request #702 · PolicyEngine/policyengine-us-data

MaxGhenis · 2026-04-08T13:50:54Z

What changed

align data-side SPM threshold helpers with the same reference-threshold and equivalence-scale logic used in policyengine-us
make CD geographic adjustments tenure-specific instead of applying a renter-style adjustment to every cloned SPM unit
replace the stale hardcoded housing target with a year-specific Census CPS ASEC SPM_CAPHOUSESUB benchmark for spm_unit_capped_housing_subsidy
add a separate HUD USER benchmark path for modeled housing_assistance, so Census SPM capped subsidy and HUD spending/assisted-household counts are no longer mixed together
stop stage-2 QRF from imputing spm_unit_spm_threshold for the PUF clone half
rebuild clone-half spm_unit_spm_threshold deterministically from the donor half's geography and the current threshold formula
replace the additive +1 sparse-reweighting prior with deterministic near-zero priors for zero-weight clone households, while keeping donor-half priors close to their survey weights
add regression tests for future-year threshold reconstruction, tenure-specific CD GEOADJ values, clone-half threshold rebuilding, and sparse-prior initialization

Why

The data pipeline had three distinct SPM issues:

threshold reconstruction in policyengine-us-data had drifted from the model-side logic in policyengine-us
local-area recalculation reused a renter-style GEOADJ across owners and renters
clone-half enhanced CPS generation was giving zero-weight synthetic households a meaningful starting prior and letting stage-2 QRF learn spm_unit_spm_threshold, even though thresholds should be derived from donor geography plus composition, not predicted statistically

The housing benchmark cleanup is separate but related concept hygiene:

spm_unit_capped_housing_subsidy is a Census SPM concept and should be benchmarked to CPS ASEC SPM_CAPHOUSESUB
housing_assistance is a HUD program/spending concept and should be benchmarked separately to HUD USER assisted-household counts and spending totals

Impact

CPS-derived thresholds and local-area cloned thresholds now use the same future-year reference threshold path as policyengine-us
CD-based local-area outputs no longer reuse one GEOADJ across owners and renters
national housing calibration for the SPM capped subsidy now uses the Census concept instead of a stale mixed HUD/Census hardcoded value
validation output now reports the Census capped-subsidy benchmark and HUD USER housing-assistance benchmark as separate rows
clone-half enhanced CPS generation now starts zero-weight synthetic households near zero in the sparse optimizer instead of around weight 1
clone-half SPM thresholds are rebuilt from donor geography instead of being stage-2 QRF outputs

Root cause

policyengine-us-data had:

a separate threshold forecast path from policyengine-us
one CD GEOADJ per district built with renter assumptions and reused for all tenures
a housing benchmark path that mixed Census SPM capped subsidy and HUD spending concepts
clone-half enhanced CPS generation that treated zero-weight synthetic households like ordinary weighted donors during sparse-prior initialization and allowed spm_unit_spm_threshold into the CPS-only QRF output set

Validation

uv run pytest -q tests/unit/test_extended_cps.py
uv run pytest -q tests/unit/calibration/test_calibration_puf_impute.py
uv run pytest -q policyengine_us_data/tests/test_local_area_calibration/test_spm_thresholds.py
uv run pytest -q tests/integration/test_enhanced_cps.py -k 'household_count or poverty_rate_reasonable'
git diff --check

MaxGhenis · 2026-04-08T14:48:52Z

Temporarily closing to retrigger GitHub Actions on the latest head; reopening immediately.

…-align-data # Conflicts: # policyengine_us_data/datasets/cps/extended_cps.py # tests/unit/test_extended_cps.py

baogorek

Exciting that this could help us with poverty:

  Tenure-specific geoadj — if you're applying renter housing shares to owner households, their SPM thresholds are     
  wrong, which means their poverty status is wrong, which means calibration is optimizing toward incorrect poverty    
  targets. Fixing this makes the targets mean what they're supposed to mean.

Watch for my comments about defining NYC with CDs, as I consider this a regression compared to what was merged in with #671

My robot seems to think this is a high risk / high reward PR. Finding out why the integration test failed might give us some more insight into that.

baogorek · 2026-04-09T20:18:57Z

policyengine_us_data/calibration/publish_local_area.py

+    "KINGS_COUNTY_NY",
+}
+
+NYC_CDS = [


Heads up! PR #702 switches to filtering by NYC_CDS (congressional districts) + NYC_COUNTIES (county name strings). This would be a regression from the block-based approach used currently in main, going back to the CD-level geography that the block assignment work was meant to improve.

Honestly, any time I see "CDs", I get nervous.

MaxGhenis added 3 commits April 7, 2026 09:48

Align SPM threshold recalculation

baf2a24

Split housing benchmarks by Census and HUD concepts

418d73c

Stabilize clone-half SPM thresholds and weight priors

c71e53a

MaxGhenis mentioned this pull request Apr 8, 2026

[codex] Align SPM thresholds and stabilize clone-half priors #696

Closed

MaxGhenis added 5 commits April 8, 2026 09:52

Format clone-half threshold fixes

84e4259

Format SPM benchmark and calibration files

289bbc7

Add changelog fragment for clone-half SPM fixes

9b72228

Retrigger PR checks

1151faa

Retrigger PR checks after cancel

9c01bc7

MaxGhenis closed this Apr 8, 2026

MaxGhenis reopened this Apr 8, 2026

MaxGhenis marked this pull request as ready for review April 8, 2026 14:50

MaxGhenis marked this pull request as draft April 8, 2026 14:50

MaxGhenis added 8 commits April 8, 2026 10:53

Merge upstream main into codex/spm-threshold-align-data

893d6ad

Fix Towncrier fragment layout

0caed79

Tighten clone prior and tenure validation

98a4e47

Fix local-area calibration imports and takeup args

aed202a

Harden housing target test stub

e2ae174

Trigger PR checks

f5750e3

Merge remote-tracking branch 'upstream/main' into codex/spm-threshold…

2bce19a

…-align-data # Conflicts: # policyengine_us_data/datasets/cps/extended_cps.py # tests/unit/test_extended_cps.py

Use valid year in housing target test

17e77b3

baogorek reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Align SPM thresholds and stabilize clone-half priors#702

[codex] Align SPM thresholds and stabilize clone-half priors#702
MaxGhenis wants to merge 16 commits intomainfrom
codex/spm-threshold-align-data

MaxGhenis commented Apr 8, 2026

Uh oh!

MaxGhenis commented Apr 8, 2026

Uh oh!

baogorek left a comment •

edited

Loading

Uh oh!

baogorek Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MaxGhenis commented Apr 8, 2026

What changed

Why

Impact

Root cause

Validation

Uh oh!

MaxGhenis commented Apr 8, 2026

Uh oh!

baogorek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baogorek Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

baogorek left a comment •

edited

Loading

baogorek Apr 9, 2026 •

edited

Loading