Add Python wheel splitter for host/device separation#3855
Add Python wheel splitter for host/device separation#3855stellaraccident wants to merge 4 commits intodevelopfrom
Conversation
48f9ab0 to
a5f3834
Compare
ELF surgery: - Add ElfSurgery.has_fatbin_section() static method for lightweight .hip_fatbin detection by scanning section headers only - Fix PHDR in-place write: when spare PT_NULL slots exist (common after patchelf), write new segment in-place instead of failing. Prior code only handled the resize-and-shift path. - Move test helpers' inline imports to top-level in test_surgery.py COFF surgery: - Add CoffSurgery.has_fatbin_section() static method for .hip_fat section detection (8-char PE section name limit) kpack_transform: - Add is_fat_binary() cross-format dispatcher using the new fast-path static methods instead of loading full binary - Remove elf_has_hip_fatbin() from wheel_splitter (now uses is_fat_binary) binutils/ccob_parser: - Remove catch-all exception handlers that silently swallowed errors - Let subprocess and parsing failures propagate naturally Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New plugin architecture for detecting architecture-specific kernel database
files during wheel splitting. Each handler knows its library's file layout
and extracts the GPU architecture from filenames/paths.
Handlers:
- rocBLAS: lib/rocblas/library/*_gfx*.{co,hsaco,dat}
- hipBLASLt: lib/hipblaslt/library/*_gfx*.{co,hsaco,dat}
- hipSPARSELt: lib/hipsparselt/library/*_gfx*.{co,hsaco,dat}
- AOTriton: lib/aotriton.images/amd-gfx*/... with bundle key mapping
(gfx11xx->gfx11, gfx120x->gfx12_0). Updated to match actual torch
wheel layout rather than abstract spec.
- MIOpen: share/miopen/db/gfx*.{db.txt,fdb.txt,model} with dedicated
regex for concatenated arch+CU filenames (e.g., gfx90878 = gfx908 +
78 CUs). Interim handler — MIOpen database format will be adapted for
better splitting support soon.
Base class enforces path validation contract: detect() raises ValueError
with clear message if path is not under prefix_root, rather than letting
incidental errors from relative_to() propagate.
Wheel-type presets map preset names (e.g., "torch-fat") to the relevant
set of handlers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Core wheel splitting implementation that takes a fat torch wheel and produces: - Host wheel: shared libraries with device code zeroed out, plus a kpack overlay directory containing the kpack manifest - Device wheels: per-architecture kernel databases and kpack bundles - Variant wheel (optional): PEP 817 variant metadata for wheelnext-aware installers Key components: - WheelSplitter class orchestrating the full split pipeline - Fat binary scanning via is_fat_binary() fast path - Parallel kpack_offload_binary() transforms with ProcessPoolExecutor - Database handler integration for arch-specific file routing - RECORD/METADATA rewriting for valid wheel output - TransformArgs dataclass for clean parallel dispatch CLI tool (split_python_wheels.py) with ModuleNotFoundError guard for the rocm-bootstrap dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Formatting-only changes to isa_target_match.h and its test file. No functional changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
a5f3834 to
d260d7f
Compare
|
@nunnikri I would recommend a commit by commit review. I have already done a line-by-line review of the whole and one set of fixes. The wheel splitter is the most intricate part, and as such things go, is a lot of code (but fairly straight-line). |
| if ehdr[4] != 2: # ELFCLASS64 | ||
| return False | ||
|
|
||
| e_shoff = struct.unpack_from("<Q", ehdr, 40)[0] |
There was a problem hiding this comment.
# All offsets and sizes below follow the System V ABI for ELF64.
# e_shoff - File offset to the Section Header Table (SHT). 0x28 (40) bytes from ELF header
# e_shentsize - Size in bytes of each section header entry. Usually 64.
# e_shnum - No of section headers
# e_shstrndx - Index of the section header string table within the SHT
e_shoff = struct.unpack_from("<Q", ehdr, 40)[0]
| Returns False for non-ELF files (wrong magic, too small, 32-bit). | ||
| Raises on I/O errors or corrupt ELF headers. | ||
| """ | ||
| with open(file_path, "rb") as f: |
There was a problem hiding this comment.
file_size = os.path.getsize(file_path)
if file_size < ELF64_EHDR_SIZE:
return False
with open(file_path, "rb") as f:
| if e_shoff == 0 or e_shnum == 0 or e_shstrndx >= e_shnum: | ||
| return False | ||
|
|
||
| # Read section header table |
There was a problem hiding this comment.
# Bounds check to avoid any file overrun
total_sht_size = e_shentsize * e_shnum
if (e_shoff > file_size or
total_sht_size > file_size or
(e_shoff + total_sht_size) > file_size):
return False
# Read section header table
| strtab_entry = e_shstrndx * e_shentsize | ||
| sh_offset = struct.unpack_from("<Q", shtab, strtab_entry + 24)[0] | ||
| sh_size = struct.unpack_from("<Q", shtab, strtab_entry + 32)[0] | ||
| f.seek(sh_offset) |
There was a problem hiding this comment.
# Bounds check .shstrtab region
if (sh_offset > file_size or
sh_size > file_size or
sh_offset + sh_size > file_size):
return False
# Read .shstrtab contents
f.seek(sh_offset)
nunnikri
left a comment
There was a problem hiding this comment.
Overall looks good to me. Suggested some checks to make the code more robust.
| ) | ||
|
|
||
| binary_path.unlink() | ||
| shutil.move(str(temp_output), str(binary_path)) |
There was a problem hiding this comment.
Since this is dealing with big files, is it better to use os.replace as it give more atomicity.
nunnikri
left a comment
There was a problem hiding this comment.
Overall looks good to me. Suggested some checks to make the code more robust.
Summary
Adds
split_python_wheelstool that post-processes fat PyTorch ROCm wheels into a host wheel (device code stripped) and per-arch device wheels containing.kpackarchives and arch-specific database files. Also generates PEP 817 variant wheels for wheelnext-aware installers.Tested against
torch-2.10.0+rocm7.1(5.1 GB compressed): host wheel shrinks to 431 MB, with 16 device wheels generated. A user installing for a single GPU downloads 87% less for lightweight targets (gfx1100: 672 MB), 67% less for the heaviest (gfx942: 2.0 GB).Full split report: https://gist.github.com/stellaraccident/105d58437d903218c476594c2ba33cce
Catalog of changes
Please review per-commit. This PR will be landed as a merge commit to preserve history.
Commit 1: Add fast-path fatbin detection and fix PHDR in-place write regression
Commit 2: Add database handlers for wheel splitting with arch-aware file detection
Commit 3: Add Python wheel splitter for host/device separation
Commit 4: Format C++ kpack runtime ISA target matching code
What gets split
Test plan
🤖 Generated with Claude Code