Summary
Improve top-tail income representation in the enhanced CPS through two complementary approaches:
Phase 1: Include PUF aggregate records (this PR)
The IRS PUF contains 4 aggregate records (MARS=0) that bundle ultra-high-income filers for anonymity protection. These have been dropped from the PUF pipeline (puf = puf[puf.MARS != 0]), discarding $140B+ in weighted AGI — mostly in the $10M+ bracket.
Changes:
- Assign demographics to aggregate records (filing status, age, gender) instead of filtering them out
- Inject high-income PUF records (AGI > $1M) directly into the ExtendedCPS dataset, giving the reweighter actual high-income observations
Phase 2: Forbes 400 synthetic records (future)
Add Forbes 400 records with wealth-to-income imputation for the extreme top tail.
Problem
The CPS has catastrophic under-representation at the top of the income distribution:
- $5M-$10M AGI bracket: -98.5% calibration error
- $10M+ AGI bracket: -95.1% calibration error
This means millionaire/billionaire tax scoring is unreliable, and calibration weights get distorted trying to compensate.
Key data findings
The 4 aggregate records contain:
- ~1,233 total weighted filers
- $140.3B weighted AGI ($152.9B in $10M+ bracket alone)
- Massive capital gains ($86.7B), dividends ($20.4B), partnership income ($11.2B)
- Each has XTOT=1 (single filer, not multiple bundled) with weights of 140-465
Approach
assign_aggregate_demographics() assigns MARS, age, gender to MARS=0 records
_inject_high_income_puf_records() appends PUF records with AGI > $1M to ExtendedCPS
- The reweighting optimizer adjusts weights to match SOI targets
Verification needed
Summary
Improve top-tail income representation in the enhanced CPS through two complementary approaches:
Phase 1: Include PUF aggregate records (this PR)
The IRS PUF contains 4 aggregate records (MARS=0) that bundle ultra-high-income filers for anonymity protection. These have been dropped from the PUF pipeline (
puf = puf[puf.MARS != 0]), discarding $140B+ in weighted AGI — mostly in the $10M+ bracket.Changes:
Phase 2: Forbes 400 synthetic records (future)
Add Forbes 400 records with wealth-to-income imputation for the extreme top tail.
Problem
The CPS has catastrophic under-representation at the top of the income distribution:
This means millionaire/billionaire tax scoring is unreliable, and calibration weights get distorted trying to compensate.
Key data findings
The 4 aggregate records contain:
Approach
assign_aggregate_demographics()assigns MARS, age, gender to MARS=0 records_inject_high_income_puf_records()appends PUF records with AGI > $1M to ExtendedCPSVerification needed