-
Notifications
You must be signed in to change notification settings - Fork 155
Yaakov/phase 2 b a #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Yaakov/phase 2 b a #443
+2,025
−4
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This pass lowers GPU dialect operations to LLVM dialect with Vortex-specific RISC-V intrinsics. It converts operations like gpu.thread_id and gpu.block_id to RISC-V CSR (Control Status Register) reads via inline assembly. Key features: - Lower gpu.thread_id to CSR read from VX_CSR_THREAD_ID (0xCC0) - Lower gpu.block_id to CSR read from VX_CSR_WARP_ID (0xCC1, placeholder) - Uses LLVM inline assembly with proper RISC-V csrr instruction format - Includes pattern rewriting infrastructure for extensibility This enables the HIP-to-Vortex compilation flow by providing the critical lowering step from platform-independent GPU operations to Vortex hardware intrinsics.
Replace CSR-based implementation with proper TLS variable access for
thread and block IDs. The Vortex spawn framework stores threadIdx and
blockIdx as __thread (TLS) variables, not CSRs.
Changes:
- Add helper functions to create and access TLS dim3_t globals
- Update ThreadIdOpLowering to access threadIdx TLS variable
- Update BlockIdOpLowering to access blockIdx TLS variable
- Support all 3 dimensions (x, y, z) via GEP into dim3_t struct
- Update tests to verify TLS access instead of CSR inline assembly
Technical details:
- threadIdx/blockIdx are external TLS globals of type {i32, i32, i32}
- Generated LLVM IR uses llvm.mlir.addressof + getelementptr + load
- Links against vx_spawn.c which provides the actual TLS definitions
- Matches Vortex spawn framework architecture from vx_spawn.h
Fixes: TODO at line 53 (3D thread ID calculation)
Fixes: TODO at line 92 (blockIdx TLS access)
Enable CUDA/HIP kernel syntax parsing without requiring CUDA toolkit installation. This is essential for Phase 2A work where we parse HIP code to extract GPU dialect IR using Polygeist. Changes: - Add POLYGEIST_ENABLE_CUDA_SYNTAX_ONLY CMake option - Add POLYGEIST_CUDA_FRONTEND_ONLY internal flag - Skip CUDA execution engine when in syntax-only mode - Separate POLYGEIST_ENABLE_CUDA (syntax) from POLYGEIST_CUDA_FULL (runtime) - Add stub for createGpuSerializeToCubinPass when CUDA toolkit unavailable Technical details: - Syntax-only mode enables Clang's CUDA frontend without linking CUDA libs - Execution engine requires actual CUDA toolkit (not needed for IR generation) - Driver and cgeist conditionally compile CUDA runtime code with POLYGEIST_CUDA_FULL - Allows building on systems without CUDA while still parsing .hip/.cu files This enables the HIP→GPU dialect→Vortex compilation pipeline without requiring NVIDIA CUDA toolkit installation.
Implement TDD-driven development for Developer A thread model operations: Changes: - Add BlockDimOpLowering pattern for gpu.block_dim operations - Add GridDimOpLowering pattern for gpu.grid_dim operations - Support all 3 dimensions (x, y, z) via TLS global access - Create comprehensive FileCheck test suite in gpu_to_vortex_thread_model.mlir Implementation details: - blockDim and gridDim accessed as regular globals (not thread-local) - Reuse createDim3TLSAccess helper for consistency - Tests verify GEP indices (0=x, 1=y, 2=z) and load operations TODOs: - gpu.barrier implementation (commented out, needs UnrealizedConversionCastOp fix) - gpu.launch_func (Developer A responsibility per work distribution) Tests pass for blockDim/gridDim operations. Barrier will be completed in next commit after resolving type conversion issues.
Add BarrierOpLowering pattern that converts GPU dialect barrier operations to Vortex hardware barrier calls: - Allocate unique barrier IDs for each gpu.barrier operation - Calculate total threads from blockDim.x * blockDim.y * blockDim.z - Generate vx_barrier(bar_id, num_threads) calls - Declare external vx_barrier function as needed Update FileCheck tests to verify: - blockDim/gridDim lowering with correct GEP indices - Barrier lowering with thread count calculation - Multiple barriers receive distinct barrier IDs
…ves and remove duplicate kernels Preprocessing steps before GPU-to-Vortex conversion: 1. Consolidate polygeist.alternatives - keep only first variant (32 threads) 2. Remove duplicate kernel functions - eliminate auto-tuning variants This eliminates need for external Python filter script and ensures clean single-variant IR for downstream processing. Verified: 6 variants → 1 kernel function + 1 launch call
…GPUToVortex Preprocessing (before pattern application): - Consolidate polygeist.alternatives → keep first variant (32 threads) - Remove duplicate kernel functions from auto-tuning GPU-to-Vortex lowering: - gpu.thread_id → llvm.call @vx_get_threadIdx() + GEP + load - gpu.block_id → llvm.call @vx_get_blockIdx() + GEP + load - gpu.block_dim → llvm.call @vx_get_blockDim() + GEP + load - gpu.grid_dim → llvm.call @vx_get_gridDim() + GEP + load - gpu.barrier → llvm.call @vx_barrier(bar_id, num_threads) Accessor functions declared within gpu.module for symbol visibility. Unrealized casts (index↔i32) will be resolved by subsequent passes. Verified: alternatives removed, 1 kernel, TLS accessors working
…PUToVortex Add metadata extraction pattern that analyzes gpu.launch_func operations and attaches kernel argument information as MLIR attributes. This metadata is required for the host-side kernel launch lowering (30% remaining work). Implementation details: - Extracts kernel name, argument count, types, and offsets - Assumes RV32 ABI: all arguments (pointers and scalars) are 4 bytes - Calculates total argument struct size for vx_upload_bytes() - Attaches metadata as vortex.kernel_metadata string attribute - Runs as separate greedy rewrite pass after main conversion This pattern completes the metadata extraction phase of Developer A's kernel launch infrastructure work. The metadata will be used by the LaunchFuncOpLowering pattern (to be implemented) to generate the vx_upload_bytes() call with correct struct layout. Example output: gpu.launch_func @module::@kernel args(...) {vortex.kernel_metadata = "Kernel: kernel\nNum args: 3\n..."} Also adds basic_hip_kernel.hip test file for verification.
Implement PrintfOpLowering pattern that transforms standard printf calls to Vortex-specific vx_printf calls with core ID injection. Transformation: - Matches: llvm.call @printf(format, args...) - Inserts: vx_core_id() call to get current core ID - Replaces with: llvm.call @vx_printf(format, cid, args...) This matches the Vortex kernel API where printf requires core ID as first argument after format string: vx_printf(format, cid, ...) Function declarations (vx_core_id, vx_printf) are added to gpu.module to ensure proper symbol resolution during lowering. Test results: 21/22 kernels successfully lower (printf_kernel verified)
- Add metadata_output_dir option to pass configuration - Generate .meta.json files with kernel argument metadata - Generate _args.h C headers for type-safe kernel arguments - Track argument names, types, sizes, offsets, and pointer status - Support RV32 architecture (4-byte pointers) This enables the runtime to understand kernel argument layouts without compile-time coupling between host and device code.
The vx_barrier(barrier_id, num_warps) signature expects warp count, not thread count. Changed barrier lowering to call vx_num_warps() instead of computing blockDim.x * blockDim.y * blockDim.z.
Two fixes to the GPUToVortex lowering pass: 1. Barrier lowering: Declare vx_barrier and vx_num_warps functions inside gpu.module instead of top-level module, so they are visible to kernel code during compilation. 2. Printf lowering: Remove incorrect core_id insertion. vx_printf has the same signature as standard printf (no core_id parameter). Previously the pass was corrupting printf arguments.
This pass generates the Vortex-specific entry point after gpu-to-llvm lowering: - main() function that reads args from VX_CSR_MSCRATCH (0x340) via inline asm - kernel_body(void* args) wrapper that unpacks arguments from offset 24 - vx_spawn_threads declaration for thread dispatch The args struct layout is: - bytes 0-11: grid_dim[3] - bytes 12-23: block_dim[3] - bytes 24+: user kernel arguments Usage: polygeist-opt --generate-vortex-main (run after --gpu-to-llvm)
LLVM 18+ does not accept struct-wrapped result types for inline
assembly with a single output. Changed from:
call { i32 } asm sideeffect "csrr $0, 0x340", "=r"()
extractvalue { i32 } %1, 0
To:
call i32 asm sideeffect "csrr $0, 0x340", "=r"()
This fixes compilation errors when generating Vortex binaries with
the upgraded llvm-vortex (LLVM 18.1.7).
…match - GenerateVortexMain.cpp: Derive arg0/arg1 from block_dim[0] header instead of user args. Polygeist adds these as leading scalars during GPU lowering. - ConvertGPUToVortex.cpp: Skip first 2 leading scalar args in metadata generation since they come from block_dim header, not user args. Emit metadata only for user-provided arguments. This fixes the argument marshaling mismatch where the kernel expected 6 transformed arguments but host was passing 4 HIP-style arguments.
Polygeist reorders kernel arguments (scalars before pointers) when lowering to GPU dialect. This commit extracts the original argument order from the host wrapper function and includes it in the kernel metadata JSON. The mapping allows the host stub generator to create launchers that accept arguments in original order (matching hipLaunchKernelGGL) and reorder them when packing for the device kernel. Also fixes extractBaseKernelName to handle kernel names that contain "_kernel" (e.g., vecadd_kernel_kernel94... now correctly extracts vecadd_kernel).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.