Skip to content

Add Python wheel splitter for host/device separation#3855

Open
stellaraccident wants to merge 4 commits intodevelopfrom
users/stellaraccident/wheel-splitter
Open

Add Python wheel splitter for host/device separation#3855
stellaraccident wants to merge 4 commits intodevelopfrom
users/stellaraccident/wheel-splitter

Conversation

@stellaraccident
Copy link
Contributor

@stellaraccident stellaraccident commented Mar 7, 2026

Summary

Adds split_python_wheels tool that post-processes fat PyTorch ROCm wheels into a host wheel (device code stripped) and per-arch device wheels containing .kpack archives and arch-specific database files. Also generates PEP 817 variant wheels for wheelnext-aware installers.

Tested against torch-2.10.0+rocm7.1 (5.1 GB compressed): host wheel shrinks to 431 MB, with 16 device wheels generated. A user installing for a single GPU downloads 87% less for lightweight targets (gfx1100: 672 MB), 67% less for the heaviest (gfx942: 2.0 GB).

Full split report: https://gist.github.com/stellaraccident/105d58437d903218c476594c2ba33cce

Catalog of changes

Please review per-commit. This PR will be landed as a merge commit to preserve history.

Commit 1: Add fast-path fatbin detection and fix PHDR in-place write regression

  • `elf/surgery.py`: Add `ElfSurgery.has_fatbin_section()` — lightweight `.hip_fatbin` section scan reading only ELF headers (a few KB vs loading multi-GB binaries)
  • `coff/surgery.py`: Add `CoffSurgery.has_fatbin_section()` — equivalent for PE/COFF `.hip_fat` sections
  • `kpack_transform.py`: `is_fat_binary()` now dispatches to fast-path methods instead of loading full binaries
  • `tests/elf/test_surgery.py`: Move inline imports to top-level; add docstring noting that spare PHDR slots (tested by `_build_synthetic_elf`) are commonly the result of prior patchelf transformations

Commit 2: Add database handlers for wheel splitting with arch-aware file detection

  • `database_handlers.py`: Five handlers for arch-specific file detection:
    • RocBLAS/HipBLASLt/HipSparseLt: Tensile library files by gfx pattern in filename
    • AOTriton: Kernel image directories — maps family names (`gfx11xx` → `gfx11`, `gfx120x` → `gfx12_0`) to rocm-bootstrap bundle keys. Coded to match the actual arrangement in the torch wheel (earlier version was from abstract spec)
    • MIOpen: Tuning databases with concatenated arch+CU filenames (e.g., `gfx90878` = gfx908 + 78 CUs). Uses explicit arch-ID regex. This handler is interim — we plan to adapt the MIOpen database format to be more amenable to splitting soon
  • Removed catch-all exception handlers from all `detect()` methods; added `_relative_path()` base class method with explicit `ValueError` when path is not under `prefix_root` (contract enforcement vs silent fallthrough)
  • `tests/test_database_handlers.py`: Updated outside-prefix tests to assert explicit error message

Commit 3: Add Python wheel splitter for host/device separation

  • `wheel_splitter.py`: Core `WheelSplitter` class:
    • Scans fat binaries via fast-path `is_fat_binary()` (cross-platform ELF/COFF)
    • Extracts device kernels into per-arch `.kpack` archives with zstd compression
    • Transforms host binaries (HIPF→HIPK magic rewrite, pointer redirection, zero-page)
    • Rewrites host METADATA with extras-based `Requires-Dist` per target architecture
    • Hierarchy-aware bundling: collapses xnack variants into bundle keys, generates correct dependency chains (e.g., `gfx1100` extra depends on both `gfx1100` and `gfx11` device wheels)
    • PEP 817 variant wheel generation (`--generate-variant-wheel`): adds `variant_properties` markers and `variant.json` with AMD variant provider metadata
    • `rocm_bootstrap` guarded as optional import
    • Replaced loose tuples with `TransformArgs` dataclass
    • Added cross-reference comments between METADATA generation and variant marker parsing
  • `tools/split_python_wheels.py`: CLI with `--wheel-type` presets, `--databases` for handler selection, `--compression`, parallel workers (`-j`), `--generate-variant-wheel`, `--variant-label`
  • `tests/test_wheel_splitter.py`: Comprehensive test suite including variant wheel tests

Commit 4: Format C++ kpack runtime ISA target matching code

  • Style-only: comment reflow to 80-col, include reordering (gtest before project headers), clang-format on test assertions and lambda formatting

What gets split

  1. ELF fat binary device code (3.8 GB) → extracted into per-arch `.kpack` archives; host binaries rewritten with kpack load references
  2. Kernel database files (6.2 GB) — rocBLAS, hipBLASLt, hipSPARSELt Tensile libraries, AOTriton images, MIOpen tuning DBs — relocated to per-arch device wheels

Test plan

  • `python -m pytest tests/ -x -q` — 456 tests pass, 2 skipped
  • End-to-end wheel→wheel split of real torch wheel with variant generation
  • Verify host wheel has no arch-specific content remaining
  • Verify device wheels contain correct kpack + database files per arch

🤖 Generated with Claude Code

@stellaraccident stellaraccident force-pushed the users/stellaraccident/wheel-splitter branch from 48f9ab0 to a5f3834 Compare March 11, 2026 01:34
stellaraccident and others added 4 commits March 10, 2026 18:41
ELF surgery:
- Add ElfSurgery.has_fatbin_section() static method for lightweight
  .hip_fatbin detection by scanning section headers only
- Fix PHDR in-place write: when spare PT_NULL slots exist (common after
  patchelf), write new segment in-place instead of failing. Prior code
  only handled the resize-and-shift path.
- Move test helpers' inline imports to top-level in test_surgery.py

COFF surgery:
- Add CoffSurgery.has_fatbin_section() static method for .hip_fat
  section detection (8-char PE section name limit)

kpack_transform:
- Add is_fat_binary() cross-format dispatcher using the new fast-path
  static methods instead of loading full binary
- Remove elf_has_hip_fatbin() from wheel_splitter (now uses is_fat_binary)

binutils/ccob_parser:
- Remove catch-all exception handlers that silently swallowed errors
- Let subprocess and parsing failures propagate naturally

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New plugin architecture for detecting architecture-specific kernel database
files during wheel splitting. Each handler knows its library's file layout
and extracts the GPU architecture from filenames/paths.

Handlers:
- rocBLAS: lib/rocblas/library/*_gfx*.{co,hsaco,dat}
- hipBLASLt: lib/hipblaslt/library/*_gfx*.{co,hsaco,dat}
- hipSPARSELt: lib/hipsparselt/library/*_gfx*.{co,hsaco,dat}
- AOTriton: lib/aotriton.images/amd-gfx*/... with bundle key mapping
  (gfx11xx->gfx11, gfx120x->gfx12_0). Updated to match actual torch
  wheel layout rather than abstract spec.
- MIOpen: share/miopen/db/gfx*.{db.txt,fdb.txt,model} with dedicated
  regex for concatenated arch+CU filenames (e.g., gfx90878 = gfx908 +
  78 CUs). Interim handler — MIOpen database format will be adapted for
  better splitting support soon.

Base class enforces path validation contract: detect() raises ValueError
with clear message if path is not under prefix_root, rather than letting
incidental errors from relative_to() propagate.

Wheel-type presets map preset names (e.g., "torch-fat") to the relevant
set of handlers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Core wheel splitting implementation that takes a fat torch wheel and
produces:
- Host wheel: shared libraries with device code zeroed out, plus a
  kpack overlay directory containing the kpack manifest
- Device wheels: per-architecture kernel databases and kpack bundles
- Variant wheel (optional): PEP 817 variant metadata for wheelnext-aware
  installers

Key components:
- WheelSplitter class orchestrating the full split pipeline
- Fat binary scanning via is_fat_binary() fast path
- Parallel kpack_offload_binary() transforms with ProcessPoolExecutor
- Database handler integration for arch-specific file routing
- RECORD/METADATA rewriting for valid wheel output
- TransformArgs dataclass for clean parallel dispatch

CLI tool (split_python_wheels.py) with ModuleNotFoundError guard for
the rocm-bootstrap dependency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Formatting-only changes to isa_target_match.h and its test file.
No functional changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@stellaraccident stellaraccident force-pushed the users/stellaraccident/wheel-splitter branch from a5f3834 to d260d7f Compare March 11, 2026 01:44
@stellaraccident stellaraccident marked this pull request as ready for review March 11, 2026 01:45
@stellaraccident
Copy link
Contributor Author

@nunnikri I would recommend a commit by commit review. I have already done a line-by-line review of the whole and one set of fixes. The wheel splitter is the most intricate part, and as such things go, is a lot of code (but fairly straight-line).

if ehdr[4] != 2: # ELFCLASS64
return False

e_shoff = struct.unpack_from("<Q", ehdr, 40)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        # All offsets and sizes below follow the System V ABI for ELF64.
        # e_shoff - File offset to the Section Header Table (SHT). 0x28 (40) bytes from ELF header
        # e_shentsize - Size in bytes of each section header entry. Usually 64. 
        # e_shnum - No of section headers 
        # e_shstrndx - Index of the section header string table within the SHT
        e_shoff = struct.unpack_from("<Q", ehdr, 40)[0]

Returns False for non-ELF files (wrong magic, too small, 32-bit).
Raises on I/O errors or corrupt ELF headers.
"""
with open(file_path, "rb") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    file_size = os.path.getsize(file_path)
    if file_size < ELF64_EHDR_SIZE:
        return False

    with open(file_path, "rb") as f:

if e_shoff == 0 or e_shnum == 0 or e_shstrndx >= e_shnum:
return False

# Read section header table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        # Bounds check to avoid any file overrun
        total_sht_size = e_shentsize * e_shnum
        if (e_shoff > file_size or
            total_sht_size > file_size or
            (e_shoff + total_sht_size) > file_size):
            return False

        # Read section header table

strtab_entry = e_shstrndx * e_shentsize
sh_offset = struct.unpack_from("<Q", shtab, strtab_entry + 24)[0]
sh_size = struct.unpack_from("<Q", shtab, strtab_entry + 32)[0]
f.seek(sh_offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

         # Bounds check .shstrtab region
         if (sh_offset > file_size or
             sh_size > file_size or
             sh_offset + sh_size > file_size):
             return False

        # Read .shstrtab contents
        f.seek(sh_offset)

Copy link
Contributor

@nunnikri nunnikri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Suggested some checks to make the code more robust.

)

binary_path.unlink()
shutil.move(str(temp_output), str(binary_path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is dealing with big files, is it better to use os.replace as it give more atomicity.

Copy link
Contributor

@nunnikri nunnikri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Suggested some checks to make the code more robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants