Summary
Use tax liability aggregates from the original PUF aggregate records as additional calibration targets when generating synthetic disaggregated records. This would prune implausible records via L0 regularization.
Motivation
The 4 aggregate PUF records contain not just income variables but also computed tax liabilities (E05800, E06200, E06500, E09600, E10300, etc.). These are per-return averages like everything else. Currently the disaggregation only calibrates to income totals — we could additionally require that the synthetic records, when scored through PolicyEngine, reproduce the known tax liability totals.
This is valuable because:
- A record with $500M AGI but implausible deduction structure would produce wrong tax liability
- Tax liabilities encode the joint plausibility of the entire return, not just individual variables
- L0 regularization naturally zeros out records that can't contribute to matching both income AND tax targets
Approach
- Overgenerate synthetic records (e.g., 200-400 per bucket instead of 20-40)
- Score each through PolicyEngine to compute tax liabilities
- Set up reweighting problem: find non-negative weights that minimize distance to:
- Income component totals (E00200, P23250, E00650, etc.)
- Tax liability totals (E05800, E09600, etc.)
- Count targets from GDB
- L0 regularization zeros out records that don't help match both sets of targets
- Output the surviving records with their optimized weights
Tax year considerations
The raw PUF is 2015 but gets uprated to 2021. The tax liability values also get uprated. We'd score the uprated synthetic records under 2021 tax law (which PolicyEngine supports) and compare to the uprated tax liability totals.
Integration point
This would replace the current Dirichlet + donor scaling approach in disaggregate_puf.py. The overgeneration step is fast; the PE scoring is the bottleneck (~1,000 records scored is feasible).
Dependencies
References
- L0 regularization: already used in
enhanced_cps.py for sparse reweighting
- Tax variables in aggregate records: E05800, E06200, E06500, E09600, E10300, E10700, E10900
Summary
Use tax liability aggregates from the original PUF aggregate records as additional calibration targets when generating synthetic disaggregated records. This would prune implausible records via L0 regularization.
Motivation
The 4 aggregate PUF records contain not just income variables but also computed tax liabilities (E05800, E06200, E06500, E09600, E10300, etc.). These are per-return averages like everything else. Currently the disaggregation only calibrates to income totals — we could additionally require that the synthetic records, when scored through PolicyEngine, reproduce the known tax liability totals.
This is valuable because:
Approach
Tax year considerations
The raw PUF is 2015 but gets uprated to 2021. The tax liability values also get uprated. We'd score the uprated synthetic records under 2021 tax law (which PolicyEngine supports) and compare to the uprated tax liability totals.
Integration point
This would replace the current Dirichlet + donor scaling approach in
disaggregate_puf.py. The overgeneration step is fast; the PE scoring is the bottleneck (~1,000 records scored is feasible).Dependencies
microcalibrateor custom L0 reweighting (already used in enhanced CPS)References
enhanced_cps.pyfor sparse reweighting