Skip to content

Conversation

@ymweiss
Copy link

@ymweiss ymweiss commented Dec 10, 2025

No description provided.

ymweiss added 16 commits November 16, 2025 09:17
This pass lowers GPU dialect operations to LLVM dialect with Vortex-specific
RISC-V intrinsics. It converts operations like gpu.thread_id and gpu.block_id
to RISC-V CSR (Control Status Register) reads via inline assembly.

Key features:
- Lower gpu.thread_id to CSR read from VX_CSR_THREAD_ID (0xCC0)
- Lower gpu.block_id to CSR read from VX_CSR_WARP_ID (0xCC1, placeholder)
- Uses LLVM inline assembly with proper RISC-V csrr instruction format
- Includes pattern rewriting infrastructure for extensibility

This enables the HIP-to-Vortex compilation flow by providing the critical
lowering step from platform-independent GPU operations to Vortex hardware
intrinsics.
Replace CSR-based implementation with proper TLS variable access for
thread and block IDs. The Vortex spawn framework stores threadIdx and
blockIdx as __thread (TLS) variables, not CSRs.

Changes:
- Add helper functions to create and access TLS dim3_t globals
- Update ThreadIdOpLowering to access threadIdx TLS variable
- Update BlockIdOpLowering to access blockIdx TLS variable
- Support all 3 dimensions (x, y, z) via GEP into dim3_t struct
- Update tests to verify TLS access instead of CSR inline assembly

Technical details:
- threadIdx/blockIdx are external TLS globals of type {i32, i32, i32}
- Generated LLVM IR uses llvm.mlir.addressof + getelementptr + load
- Links against vx_spawn.c which provides the actual TLS definitions
- Matches Vortex spawn framework architecture from vx_spawn.h

Fixes: TODO at line 53 (3D thread ID calculation)
Fixes: TODO at line 92 (blockIdx TLS access)
Enable CUDA/HIP kernel syntax parsing without requiring CUDA toolkit
installation. This is essential for Phase 2A work where we parse HIP
code to extract GPU dialect IR using Polygeist.

Changes:
- Add POLYGEIST_ENABLE_CUDA_SYNTAX_ONLY CMake option
- Add POLYGEIST_CUDA_FRONTEND_ONLY internal flag
- Skip CUDA execution engine when in syntax-only mode
- Separate POLYGEIST_ENABLE_CUDA (syntax) from POLYGEIST_CUDA_FULL (runtime)
- Add stub for createGpuSerializeToCubinPass when CUDA toolkit unavailable

Technical details:
- Syntax-only mode enables Clang's CUDA frontend without linking CUDA libs
- Execution engine requires actual CUDA toolkit (not needed for IR generation)
- Driver and cgeist conditionally compile CUDA runtime code with POLYGEIST_CUDA_FULL
- Allows building on systems without CUDA while still parsing .hip/.cu files

This enables the HIP→GPU dialect→Vortex compilation pipeline without
requiring NVIDIA CUDA toolkit installation.
Implement TDD-driven development for Developer A thread model operations:

Changes:
- Add BlockDimOpLowering pattern for gpu.block_dim operations
- Add GridDimOpLowering pattern for gpu.grid_dim operations
- Support all 3 dimensions (x, y, z) via TLS global access
- Create comprehensive FileCheck test suite in gpu_to_vortex_thread_model.mlir

Implementation details:
- blockDim and gridDim accessed as regular globals (not thread-local)
- Reuse createDim3TLSAccess helper for consistency
- Tests verify GEP indices (0=x, 1=y, 2=z) and load operations

TODOs:
- gpu.barrier implementation (commented out, needs UnrealizedConversionCastOp fix)
- gpu.launch_func (Developer A responsibility per work distribution)

Tests pass for blockDim/gridDim operations. Barrier will be completed
in next commit after resolving type conversion issues.
Add BarrierOpLowering pattern that converts GPU dialect barrier operations to Vortex hardware barrier calls:
- Allocate unique barrier IDs for each gpu.barrier operation
- Calculate total threads from blockDim.x * blockDim.y * blockDim.z
- Generate vx_barrier(bar_id, num_threads) calls
- Declare external vx_barrier function as needed

Update FileCheck tests to verify:
- blockDim/gridDim lowering with correct GEP indices
- Barrier lowering with thread count calculation
- Multiple barriers receive distinct barrier IDs
…ves and remove duplicate kernels

Preprocessing steps before GPU-to-Vortex conversion:
1. Consolidate polygeist.alternatives - keep only first variant (32 threads)
2. Remove duplicate kernel functions - eliminate auto-tuning variants

This eliminates need for external Python filter script and ensures clean
single-variant IR for downstream processing.

Verified: 6 variants → 1 kernel function + 1 launch call
…GPUToVortex

Preprocessing (before pattern application):
- Consolidate polygeist.alternatives → keep first variant (32 threads)
- Remove duplicate kernel functions from auto-tuning

GPU-to-Vortex lowering:
- gpu.thread_id → llvm.call @vx_get_threadIdx() + GEP + load
- gpu.block_id → llvm.call @vx_get_blockIdx() + GEP + load
- gpu.block_dim → llvm.call @vx_get_blockDim() + GEP + load
- gpu.grid_dim → llvm.call @vx_get_gridDim() + GEP + load
- gpu.barrier → llvm.call @vx_barrier(bar_id, num_threads)

Accessor functions declared within gpu.module for symbol visibility.
Unrealized casts (index↔i32) will be resolved by subsequent passes.

Verified: alternatives removed, 1 kernel, TLS accessors working
…PUToVortex

Add metadata extraction pattern that analyzes gpu.launch_func operations and
attaches kernel argument information as MLIR attributes. This metadata is
required for the host-side kernel launch lowering (30% remaining work).

Implementation details:
- Extracts kernel name, argument count, types, and offsets
- Assumes RV32 ABI: all arguments (pointers and scalars) are 4 bytes
- Calculates total argument struct size for vx_upload_bytes()
- Attaches metadata as vortex.kernel_metadata string attribute
- Runs as separate greedy rewrite pass after main conversion

This pattern completes the metadata extraction phase of Developer A's
kernel launch infrastructure work. The metadata will be used by the
LaunchFuncOpLowering pattern (to be implemented) to generate the
vx_upload_bytes() call with correct struct layout.

Example output:
  gpu.launch_func @module::@kernel args(...)
    {vortex.kernel_metadata = "Kernel: kernel\nNum args: 3\n..."}

Also adds basic_hip_kernel.hip test file for verification.
Implement PrintfOpLowering pattern that transforms standard printf calls
to Vortex-specific vx_printf calls with core ID injection.

Transformation:
- Matches: llvm.call @printf(format, args...)
- Inserts: vx_core_id() call to get current core ID
- Replaces with: llvm.call @vx_printf(format, cid, args...)

This matches the Vortex kernel API where printf requires core ID as
first argument after format string: vx_printf(format, cid, ...)

Function declarations (vx_core_id, vx_printf) are added to gpu.module
to ensure proper symbol resolution during lowering.

Test results: 21/22 kernels successfully lower (printf_kernel verified)
- Add metadata_output_dir option to pass configuration
- Generate .meta.json files with kernel argument metadata
- Generate _args.h C headers for type-safe kernel arguments
- Track argument names, types, sizes, offsets, and pointer status
- Support RV32 architecture (4-byte pointers)

This enables the runtime to understand kernel argument layouts
without compile-time coupling between host and device code.
The vx_barrier(barrier_id, num_warps) signature expects warp count,
not thread count. Changed barrier lowering to call vx_num_warps()
instead of computing blockDim.x * blockDim.y * blockDim.z.
Two fixes to the GPUToVortex lowering pass:

1. Barrier lowering: Declare vx_barrier and vx_num_warps functions
   inside gpu.module instead of top-level module, so they are visible
   to kernel code during compilation.

2. Printf lowering: Remove incorrect core_id insertion. vx_printf has
   the same signature as standard printf (no core_id parameter).
   Previously the pass was corrupting printf arguments.
This pass generates the Vortex-specific entry point after gpu-to-llvm lowering:

- main() function that reads args from VX_CSR_MSCRATCH (0x340) via inline asm
- kernel_body(void* args) wrapper that unpacks arguments from offset 24
- vx_spawn_threads declaration for thread dispatch

The args struct layout is:
- bytes 0-11: grid_dim[3]
- bytes 12-23: block_dim[3]
- bytes 24+: user kernel arguments

Usage: polygeist-opt --generate-vortex-main (run after --gpu-to-llvm)
LLVM 18+ does not accept struct-wrapped result types for inline
assembly with a single output. Changed from:
  call { i32 } asm sideeffect "csrr $0, 0x340", "=r"()
  extractvalue { i32 } %1, 0

To:
  call i32 asm sideeffect "csrr $0, 0x340", "=r"()

This fixes compilation errors when generating Vortex binaries with
the upgraded llvm-vortex (LLVM 18.1.7).
…match

- GenerateVortexMain.cpp: Derive arg0/arg1 from block_dim[0] header instead
  of user args. Polygeist adds these as leading scalars during GPU lowering.

- ConvertGPUToVortex.cpp: Skip first 2 leading scalar args in metadata
  generation since they come from block_dim header, not user args.
  Emit metadata only for user-provided arguments.

This fixes the argument marshaling mismatch where the kernel expected
6 transformed arguments but host was passing 4 HIP-style arguments.
Polygeist reorders kernel arguments (scalars before pointers) when lowering
to GPU dialect. This commit extracts the original argument order from the
host wrapper function and includes it in the kernel metadata JSON.

The mapping allows the host stub generator to create launchers that accept
arguments in original order (matching hipLaunchKernelGGL) and reorder them
when packing for the device kernel.

Also fixes extractBaseKernelName to handle kernel names that contain "_kernel"
(e.g., vecadd_kernel_kernel94... now correctly extracts vecadd_kernel).
@ymweiss ymweiss closed this Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant