Add Python wheel splitter for host/device separation by stellaraccident · Pull Request #3855 · ROCm/rocm-systems

stellaraccident · 2026-03-07T05:23:18Z

Summary

Adds split_python_wheels tool that post-processes fat PyTorch ROCm wheels into a host wheel (device code stripped) and per-arch device wheels containing .kpack archives and arch-specific database files. Also generates PEP 817 variant wheels for wheelnext-aware installers.

Tested against torch-2.10.0+rocm7.1 (5.1 GB compressed): host wheel shrinks to 431 MB, with 16 device wheels generated. A user installing for a single GPU downloads 87% less for lightweight targets (gfx1100: 672 MB), 67% less for the heaviest (gfx942: 2.0 GB).

Full split report: https://gist.github.com/stellaraccident/105d58437d903218c476594c2ba33cce

Catalog of changes

Please review per-commit. This PR will be landed as a merge commit to preserve history.

Commit 1: Add fast-path fatbin detection and fix PHDR in-place write regression

`elf/surgery.py`: Add `ElfSurgery.has_fatbin_section()` — lightweight `.hip_fatbin` section scan reading only ELF headers (a few KB vs loading multi-GB binaries)
`coff/surgery.py`: Add `CoffSurgery.has_fatbin_section()` — equivalent for PE/COFF `.hip_fat` sections
`kpack_transform.py`: `is_fat_binary()` now dispatches to fast-path methods instead of loading full binaries
`tests/elf/test_surgery.py`: Move inline imports to top-level; add docstring noting that spare PHDR slots (tested by `_build_synthetic_elf`) are commonly the result of prior patchelf transformations

Commit 2: Add database handlers for wheel splitting with arch-aware file detection

`database_handlers.py`: Five handlers for arch-specific file detection:
- RocBLAS/HipBLASLt/HipSparseLt: Tensile library files by gfx pattern in filename
- AOTriton: Kernel image directories — maps family names (`gfx11xx` → `gfx11`, `gfx120x` → `gfx12_0`) to rocm-bootstrap bundle keys. Coded to match the actual arrangement in the torch wheel (earlier version was from abstract spec)
- MIOpen: Tuning databases with concatenated arch+CU filenames (e.g., `gfx90878` = gfx908 + 78 CUs). Uses explicit arch-ID regex. This handler is interim — we plan to adapt the MIOpen database format to be more amenable to splitting soon
Removed catch-all exception handlers from all `detect()` methods; added `_relative_path()` base class method with explicit `ValueError` when path is not under `prefix_root` (contract enforcement vs silent fallthrough)
`tests/test_database_handlers.py`: Updated outside-prefix tests to assert explicit error message

Commit 3: Add Python wheel splitter for host/device separation

`wheel_splitter.py`: Core `WheelSplitter` class:
- Scans fat binaries via fast-path `is_fat_binary()` (cross-platform ELF/COFF)
- Extracts device kernels into per-arch `.kpack` archives with zstd compression
- Transforms host binaries (HIPF→HIPK magic rewrite, pointer redirection, zero-page)
- Rewrites host METADATA with extras-based `Requires-Dist` per target architecture
- Hierarchy-aware bundling: collapses xnack variants into bundle keys, generates correct dependency chains (e.g., `gfx1100` extra depends on both `gfx1100` and `gfx11` device wheels)
- PEP 817 variant wheel generation (`--generate-variant-wheel`): adds `variant_properties` markers and `variant.json` with AMD variant provider metadata
- `rocm_bootstrap` guarded as optional import
- Replaced loose tuples with `TransformArgs` dataclass
- Added cross-reference comments between METADATA generation and variant marker parsing
`tools/split_python_wheels.py`: CLI with `--wheel-type` presets, `--databases` for handler selection, `--compression`, parallel workers (`-j`), `--generate-variant-wheel`, `--variant-label`
`tests/test_wheel_splitter.py`: Comprehensive test suite including variant wheel tests

Commit 4: Format C++ kpack runtime ISA target matching code

Style-only: comment reflow to 80-col, include reordering (gtest before project headers), clang-format on test assertions and lambda formatting

What gets split

ELF fat binary device code (3.8 GB) → extracted into per-arch `.kpack` archives; host binaries rewritten with kpack load references
Kernel database files (6.2 GB) — rocBLAS, hipBLASLt, hipSPARSELt Tensile libraries, AOTriton images, MIOpen tuning DBs — relocated to per-arch device wheels

Test plan

`python -m pytest tests/ -x -q` — 456 tests pass, 2 skipped
End-to-end wheel→wheel split of real torch wheel with variant generation
Verify host wheel has no arch-specific content remaining
Verify device wheels contain correct kpack + database files per arch

🤖 Generated with Claude Code

shared/kpack/python/rocm_kpack/tools/split_python_wheels.py

shared/kpack/python/rocm_kpack/database_handlers.py

shared/kpack/python/rocm_kpack/wheel_splitter.py

shared/kpack/tests/elf/test_surgery.py

ELF surgery: - Add ElfSurgery.has_fatbin_section() static method for lightweight .hip_fatbin detection by scanning section headers only - Fix PHDR in-place write: when spare PT_NULL slots exist (common after patchelf), write new segment in-place instead of failing. Prior code only handled the resize-and-shift path. - Move test helpers' inline imports to top-level in test_surgery.py COFF surgery: - Add CoffSurgery.has_fatbin_section() static method for .hip_fat section detection (8-char PE section name limit) kpack_transform: - Add is_fat_binary() cross-format dispatcher using the new fast-path static methods instead of loading full binary - Remove elf_has_hip_fatbin() from wheel_splitter (now uses is_fat_binary) binutils/ccob_parser: - Remove catch-all exception handlers that silently swallowed errors - Let subprocess and parsing failures propagate naturally Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New plugin architecture for detecting architecture-specific kernel database files during wheel splitting. Each handler knows its library's file layout and extracts the GPU architecture from filenames/paths. Handlers: - rocBLAS: lib/rocblas/library/*_gfx*.{co,hsaco,dat} - hipBLASLt: lib/hipblaslt/library/*_gfx*.{co,hsaco,dat} - hipSPARSELt: lib/hipsparselt/library/*_gfx*.{co,hsaco,dat} - AOTriton: lib/aotriton.images/amd-gfx*/... with bundle key mapping (gfx11xx->gfx11, gfx120x->gfx12_0). Updated to match actual torch wheel layout rather than abstract spec. - MIOpen: share/miopen/db/gfx*.{db.txt,fdb.txt,model} with dedicated regex for concatenated arch+CU filenames (e.g., gfx90878 = gfx908 + 78 CUs). Interim handler — MIOpen database format will be adapted for better splitting support soon. Base class enforces path validation contract: detect() raises ValueError with clear message if path is not under prefix_root, rather than letting incidental errors from relative_to() propagate. Wheel-type presets map preset names (e.g., "torch-fat") to the relevant set of handlers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Core wheel splitting implementation that takes a fat torch wheel and produces: - Host wheel: shared libraries with device code zeroed out, plus a kpack overlay directory containing the kpack manifest - Device wheels: per-architecture kernel databases and kpack bundles - Variant wheel (optional): PEP 817 variant metadata for wheelnext-aware installers Key components: - WheelSplitter class orchestrating the full split pipeline - Fat binary scanning via is_fat_binary() fast path - Parallel kpack_offload_binary() transforms with ProcessPoolExecutor - Database handler integration for arch-specific file routing - RECORD/METADATA rewriting for valid wheel output - TransformArgs dataclass for clean parallel dispatch CLI tool (split_python_wheels.py) with ModuleNotFoundError guard for the rocm-bootstrap dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Formatting-only changes to isa_target_match.h and its test file. No functional changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stellaraccident · 2026-03-11T01:46:47Z

@nunnikri I would recommend a commit by commit review. I have already done a line-by-line review of the whole and one set of fixes. The wheel splitter is the most intricate part, and as such things go, is a lot of code (but fairly straight-line).

nunnikri · 2026-03-11T05:11:26Z

shared/kpack/python/rocm_kpack/elf/surgery.py

+            if ehdr[4] != 2:  # ELFCLASS64
+                return False
+
+            e_shoff = struct.unpack_from("<Q", ehdr, 40)[0]


# All offsets and sizes below follow the System V ABI for ELF64. # e_shoff - File offset to the Section Header Table (SHT). 0x28 (40) bytes from ELF header # e_shentsize - Size in bytes of each section header entry. Usually 64. # e_shnum - No of section headers # e_shstrndx - Index of the section header string table within the SHT e_shoff = struct.unpack_from("<Q", ehdr, 40)[0]

nunnikri · 2026-03-11T05:16:39Z

shared/kpack/python/rocm_kpack/elf/surgery.py

+        Returns False for non-ELF files (wrong magic, too small, 32-bit).
+        Raises on I/O errors or corrupt ELF headers.
+        """
+        with open(file_path, "rb") as f:


file_size = os.path.getsize(file_path) if file_size < ELF64_EHDR_SIZE: return False with open(file_path, "rb") as f:

nunnikri · 2026-03-11T05:18:23Z

shared/kpack/python/rocm_kpack/elf/surgery.py

+            if e_shoff == 0 or e_shnum == 0 or e_shstrndx >= e_shnum:
+                return False
+
+            # Read section header table


# Bounds check to avoid any file overrun total_sht_size = e_shentsize * e_shnum if (e_shoff > file_size or total_sht_size > file_size or (e_shoff + total_sht_size) > file_size): return False # Read section header table

nunnikri · 2026-03-11T05:31:10Z

shared/kpack/python/rocm_kpack/elf/surgery.py

+            strtab_entry = e_shstrndx * e_shentsize
+            sh_offset = struct.unpack_from("<Q", shtab, strtab_entry + 24)[0]
+            sh_size = struct.unpack_from("<Q", shtab, strtab_entry + 32)[0]
+            f.seek(sh_offset)


# Bounds check .shstrtab region if (sh_offset > file_size or sh_size > file_size or sh_offset + sh_size > file_size): return False # Read .shstrtab contents f.seek(sh_offset)

nunnikri

Overall looks good to me. Suggested some checks to make the code more robust.

nunnikri · 2026-03-11T06:40:26Z

shared/kpack/python/rocm_kpack/wheel_splitter.py

+        )
+
+        binary_path.unlink()
+        shutil.move(str(temp_output), str(binary_path))


Since this is dealing with big files, is it better to use os.replace as it give more atomicity.

nunnikri

Overall looks good to me. Suggested some checks to make the code more robust.

systems-assistant bot added the organization: ROCm label Mar 7, 2026

stellaraccident commented Mar 11, 2026

View reviewed changes

stellaraccident force-pushed the users/stellaraccident/wheel-splitter branch from 48f9ab0 to a5f3834 Compare March 11, 2026 01:34

stellaraccident and others added 4 commits March 10, 2026 18:41

Format C++ kpack runtime ISA target matching code

d260d7f

Formatting-only changes to isa_target_match.h and its test file. No functional changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stellaraccident force-pushed the users/stellaraccident/wheel-splitter branch from a5f3834 to d260d7f Compare March 11, 2026 01:44

stellaraccident marked this pull request as ready for review March 11, 2026 01:45

stellaraccident requested a review from nunnikri March 11, 2026 01:45

nunnikri reviewed Mar 11, 2026

View reviewed changes

nunnikri approved these changes Mar 11, 2026

View reviewed changes

Conversation

stellaraccident commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Catalog of changes

Commit 1: Add fast-path fatbin detection and fix PHDR in-place write regression

Commit 2: Add database handlers for wheel splitting with arch-aware file detection

Commit 3: Add Python wheel splitter for host/device separation

Commit 4: Format C++ kpack runtime ISA target matching code

What gets split

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stellaraccident commented Mar 11, 2026

Uh oh!

nunnikri Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

nunnikri Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

nunnikri Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

nunnikri Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

nunnikri left a comment

Choose a reason for hiding this comment

Uh oh!

nunnikri Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

nunnikri left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stellaraccident commented Mar 7, 2026 •

edited

Loading