[AMDGPU] Implement llvm.sponentry #176357

rovka · 2026-01-16T12:22:13Z

In some of our use cases, the GPU runtime stores some data at the top of
the stack. It figures out where it's safe to store it by using the PAL
metadata generated by the backend, which includes the total stack size.
However, the metadata does not include the space reserved at the bottom
of the stack for the trap handler when CWSR is enabled in dynamic VGPR
mode. This space is reserved dynamically based on whether or not the
code is running on the compute queue. Therefore, the runtime needs a way
to take that into account.

Add support for llvm.sponentry, which should return the base of the stack,
skipping over any reserved areas. This allows us to keep this computation in
one place rather than duplicate it between the backend and the runtime.

The implementation for functions that set up their own stack uses a pseudo
that is expanded to the same code sequence as that used in the prolog to
set up the stack in the first place.

In callable functions, we generate a fixed stack object and use that instead,
similar to the Arm/AArch64 approach. This wastes some stack space but that's
not a problem for now because we're not planning to use this in callable
functions yet.

In some of our use cases, the GPU runtime stores some data at the top of the stack. It figures out where it's safe to store it by using the PAL metadata generated by the backend, which includes the total stack size. However, the metadata does not include the space reserved at the bottom of the stack for the trap handler when CWSR is enabled in dynamic VGPR mode. This space is reserved dynamically based on whether or not the code is running on the compute queue. Therefore, the runtime needs a way to take that into account. Add a new intrinsic, `llvm.amdgcn.get.stack.base`, which returns the offset of the "actual" stack, skipping over any reserved areas. This allows us to keep this computation in one place rather than duplicate it between the backend and the runtime. The implementation uses a pseudo that is expanded to the same code sequence as that used in the prolog to set up the stack in the first place. The intrinsic can be called from arbitrary code, so we can't be sure that the value computed in the prolog is still in FP. There's some potential for optimization here but it's left as an exercise for the future since this is pretty much guaranteed to be called only on the cold path.

github-actions · 2026-01-16T12:33:36Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvmbot · 2026-01-16T12:34:46Z

@llvm/pr-subscribers-llvm-selectiondag
@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-amdgpu

Author: Diana Picus (rovka)

Changes

In some of our use cases, the GPU runtime stores some data at the top of
the stack. It figures out where it's safe to store it by using the PAL
metadata generated by the backend, which includes the total stack size.
However, the metadata does not include the space reserved at the bottom
of the stack for the trap handler when CWSR is enabled in dynamic VGPR
mode. This space is reserved dynamically based on whether or not the
code is running on the compute queue. Therefore, the runtime needs a way
to take that into account.

Add a new intrinsic, llvm.amdgcn.get.stack.base, which returns the
offset of the "actual" stack, skipping over any reserved areas. This
allows us to keep this computation in one place rather than duplicate
it between the backend and the runtime. The implementation uses a pseudo
that is expanded to the same code sequence as that used in the prolog to
set up the stack in the first place. The intrinsic can be called from
arbitrary code, so we can't be sure that the value computed in the
prolog is still in FP. There's some potential for optimization here but
it's left as an exercise for the future since this is pretty much
guaranteed to be called only on the cold path.

Full diff: https://github.com/llvm/llvm-project/pull/176357.diff

8 Files Affected:

(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+6)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td (+1)
(modified) llvm/lib/Target/AMDGPU/SIFrameLowering.cpp (+1-8)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+33-1)
(modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+9-2)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll (+8)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.get.stack.base.ll (+101)

diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index a8eba9ed126b7..66bd5b0c44b1e 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3799,6 +3799,12 @@ def int_amdgcn_cooperative_atomic_store_16x8B : AMDGPUCooperativeAtomicStore<llv
 def int_amdgcn_cooperative_atomic_load_8x16B : AMDGPUCooperativeAtomicLoad<llvm_v4i32_ty>;
 def int_amdgcn_cooperative_atomic_store_8x16B : AMDGPUCooperativeAtomicStore<llvm_v4i32_ty>;
 
+// Return the offset for the actual base of the stack, skipping over any
+// reserved areas (e.g. the area reserved for saving the dynamic VGPRs when CWSR
+// is active). The returned value only makes sense in functions that set up
+// their own stack.
+def int_amdgcn_get_stack_base : PureIntrinsic<[llvm_i32_ty]>;
+
 //===----------------------------------------------------------------------===//
 // Special Intrinsics for backend internal use only. No frontend
 // should emit calls to these.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 7470fecd3c03f..888f801f950ef 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4882,6 +4882,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       if (Subtarget.hasSALUFloatInsts() && isSALUMapping(MI))
         return getDefaultMappingSOP(MI);
       return getDefaultMappingVOP(MI);
+    case Intrinsic::amdgcn_get_stack_base:
     case Intrinsic::amdgcn_kernarg_segment_ptr:
     case Intrinsic::amdgcn_s_getpc:
     case Intrinsic::amdgcn_groupstaticsize:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
index 58a9b5511f2d0..c85c8f566bef9 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
@@ -411,6 +411,7 @@ def : AlwaysUniform<int_amdgcn_s_getpc>;
 def : AlwaysUniform<int_amdgcn_s_getreg>;
 def : AlwaysUniform<int_amdgcn_s_memrealtime>;
 def : AlwaysUniform<int_amdgcn_s_memtime>;
+def : AlwaysUniform<int_amdgcn_get_stack_base>;
 
 def AMDGPUImageDMaskIntrinsicTable : GenericTable {
   let FilterClass = "AMDGPUImageDMaskIntrinsic";
diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index ec3e720ef8887..03c3ec5f0168b 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -724,14 +724,7 @@ void SIFrameLowering::emitEntryFunctionPrologue(MachineFunction &MF,
         FrameInfo.getMaxAlign());
     MFI->setScratchReservedForDynamicVGPRs(VGPRSize);
 
-    BuildMI(MBB, I, DL, TII->get(AMDGPU::S_GETREG_B32), FPReg)
-        .addImm(AMDGPU::Hwreg::HwregEncoding::encode(
-            AMDGPU::Hwreg::ID_HW_ID2, AMDGPU::Hwreg::OFFSET_ME_ID, 2));
-    // The MicroEngine ID is 0 for the graphics queue, and 1 or 2 for compute
-    // (3 is unused, so we ignore it). Unfortunately, S_GETREG doesn't set
-    // SCC, so we need to check for 0 manually.
-    BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CMP_LG_U32)).addImm(0).addReg(FPReg);
-    BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CMOVK_I32), FPReg).addImm(VGPRSize);
+    BuildMI(MBB, I, DL, TII->get(AMDGPU::GET_STACK_BASE), FPReg);
     if (requiresStackPointerReference(MF)) {
       Register SPReg = MFI->getStackPtrOffsetReg();
       assert(SPReg != AMDGPU::SP_REG);
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 057f4adcafd62..c9f111f0d9d86 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -2537,7 +2537,7 @@ bool SIInstrInfo::expandPostRAPseudo(MachineInstr &MI) const {
     }
     break;
 
-  case AMDGPU::V_MAX_BF16_PSEUDO_e64:
+  case AMDGPU::V_MAX_BF16_PSEUDO_e64: {
     assert(ST.hasBF16PackedInsts());
     MI.setDesc(get(AMDGPU::V_PK_MAX_NUM_BF16));
     MI.addOperand(MachineOperand::CreateImm(0)); // op_sel
@@ -2550,6 +2550,38 @@ bool SIInstrInfo::expandPostRAPseudo(MachineInstr &MI) const {
     break;
   }
 
+  case AMDGPU::GET_STACK_BASE:
+    // The stack starts at offset 0 unless we need to reserve some space at the
+    // bottom.
+    if (ST.getFrameLowering()->mayReserveScratchForCWSR(*MBB.getParent())) {
+      // When CWSR is used in dynamic VGPR mode, the trap handler needs to save
+      // some of the VGPRs. The size of the required scratch space has already
+      // been computed by prolog epilog insertion.
+      const SIMachineFunctionInfo *MFI =
+          MBB.getParent()->getInfo<SIMachineFunctionInfo>();
+      unsigned VGPRSize = MFI->getScratchReservedForDynamicVGPRs();
+      Register DestReg = MI.getOperand(0).getReg();
+      BuildMI(MBB, MI, DL, get(AMDGPU::S_GETREG_B32), DestReg)
+          .addImm(AMDGPU::Hwreg::HwregEncoding::encode(
+              AMDGPU::Hwreg::ID_HW_ID2, AMDGPU::Hwreg::OFFSET_ME_ID, 2));
+      // The MicroEngine ID is 0 for the graphics queue, and 1 or 2 for compute
+      // (3 is unused, so we ignore it). Unfortunately, S_GETREG doesn't set
+      // SCC, so we need to check for 0 manually.
+      BuildMI(MBB, MI, DL, get(AMDGPU::S_CMP_LG_U32)).addImm(0).addReg(DestReg);
+      MI.setDesc(get(AMDGPU::S_CMOVK_I32));
+      MI.addOperand(MachineOperand::CreateImm(VGPRSize));
+      // Change the implicif-def of SCC to an explicit use (but first remove
+      // the dead flag if present).
+      MI.getOperand(MI.getNumExplicitOperands()).setIsDead(false);
+      MI.getOperand(MI.getNumExplicitOperands()).setIsUse();
+    } else {
+      MI.setDesc(get(AMDGPU::S_MOV_B32));
+      MI.addOperand(MachineOperand::CreateImm(0));
+      MI.removeOperand(MI.getNumExplicitOperands()); // Drop implicit def of SCC.
+    }
+    break;
+  }
+
   return true;
 }
 
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index e06bc912113a8..83685b630075e 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -926,6 +926,7 @@ multiclass si_cs_chain_tc_dvgpr_patterns<
 
 defm : si_cs_chain_tc_dvgpr_patterns<i32>; // On GFX12, dVGPR mode is wave32-only.
 
+let Defs = [SCC] in {
 def ADJCALLSTACKUP : SPseudoInstSI<
   (outs), (ins i32imm:$amt0, i32imm:$amt1),
   [(callseq_start timm:$amt0, timm:$amt1)],
@@ -935,7 +936,6 @@ def ADJCALLSTACKUP : SPseudoInstSI<
   let hasSideEffects = 1;
   let usesCustomInserter = 1;
   let SchedRW = [WriteSALU];
-  let Defs = [SCC];
 }
 
 def ADJCALLSTACKDOWN : SPseudoInstSI<
@@ -946,9 +946,16 @@ def ADJCALLSTACKDOWN : SPseudoInstSI<
   let hasSideEffects = 1;
   let usesCustomInserter = 1;
   let SchedRW = [WriteSALU];
-  let Defs = [SCC];
 }
 
+// Get the offset of the base of the stack, skipping any reserved areas.
+def GET_STACK_BASE : SPseudoInstSI<(outs SGPR_32:$dst), (ins),
+  [(set SGPR_32:$dst, (int_amdgcn_get_stack_base))]> {
+  let hasSideEffects = 0;
+  let SchedRW = [WriteSALU];
+}
+} // End Defs = [SCC]
+
 let Defs = [M0, EXEC, SCC],
   UseNamedOperandTable = 1 in {
 
diff --git a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll
index 9ff670bee0f89..e77d88255acc6 100644
--- a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll
+++ b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll
@@ -199,6 +199,13 @@ define void @s_memrealtime(ptr addrspace(1) inreg %out) {
   ret void
 }
 
+; CHECK-LABEL: for function 'get_stack_base':
+; CHECK: ALL VALUES UNIFORM
+define amdgpu_cs void @get_stack_base(ptr addrspace(1) inreg %out) {
+  %v = call i32 @llvm.amdgcn.get.stack.base()
+  store i32 %v, ptr addrspace(1) %out
+  ret void
+}
 
 declare i32 @llvm.amdgcn.workitem.id.x() #0
 declare i32 @llvm.amdgcn.readfirstlane(i32) #0
@@ -216,6 +223,7 @@ declare i32 @llvm.amdgcn.cluster.workgroup.max.id.x()
 declare i32 @llvm.amdgcn.cluster.workgroup.max.id.y()
 declare i32 @llvm.amdgcn.cluster.workgroup.max.id.z()
 declare i32 @llvm.amdgcn.cluster.workgroup.max.flat.id()
+declare i32 @llvm.amdgcn.get.stack.base()
 
 attributes #0 = { nounwind readnone }
 attributes #1 = { nounwind readnone convergent }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.get.stack.base.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.get.stack.base.ll
new file mode 100644
index 0000000000000..eace3f778515a
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.get.stack.base.ll
@@ -0,0 +1,101 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=-real-true16 < %s | FileCheck %s
+; RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck %s
+; RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx1200 -mattr=-real-true16 < %s | FileCheck %s
+
+; Test that the llvm.amdgcn.get.stack.base intrinsic returns the correct value:
+; - for functions that need to reserve space for CWSR, it should return the offset
+; past the reserved area (i.e. the offset of the first spill or local variables)
+; - for functions that don't reserve any space, it should return 0
+
+define amdgpu_cs i32 @stack_base_cs_dvgpr_16(i32 %val) #0 {
+; CHECK-LABEL: stack_base_cs_dvgpr_16:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_getreg_b32 s33, hwreg(HW_REG_WAVE_HW_ID2, 8, 2)
+; CHECK-NEXT:    s_getreg_b32 s0, hwreg(HW_REG_WAVE_HW_ID2, 8, 2)
+; CHECK-NEXT:    s_cmp_lg_u32 0, s33
+; CHECK-NEXT:    s_cmovk_i32 s33, 0x1c0
+; CHECK-NEXT:    s_cmp_lg_u32 0, s0
+; CHECK-NEXT:    scratch_store_b32 off, v0, s33 scope:SCOPE_SYS
+; CHECK-NEXT:    s_wait_storecnt 0x0
+; CHECK-NEXT:    s_cmovk_i32 s0, 0x1c0
+; CHECK-NEXT:    ; return to shader part epilog
+  %local = alloca i32, addrspace(5)
+  store volatile i32 %val, ptr addrspace(5) %local
+  %stack.base = call i32 @llvm.amdgcn.get.stack.base()
+  ret i32 %stack.base
+}
+
+define amdgpu_cs i32 @stack_base_cs_dvgpr_32(i32 %val) #1 {
+; CHECK-LABEL: stack_base_cs_dvgpr_32:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_getreg_b32 s33, hwreg(HW_REG_WAVE_HW_ID2, 8, 2)
+; CHECK-NEXT:    s_getreg_b32 s0, hwreg(HW_REG_WAVE_HW_ID2, 8, 2)
+; CHECK-NEXT:    s_cmp_lg_u32 0, s33
+; CHECK-NEXT:    s_cmovk_i32 s33, 0x380
+; CHECK-NEXT:    s_cmp_lg_u32 0, s0
+; CHECK-NEXT:    scratch_store_b32 off, v0, s33 scope:SCOPE_SYS
+; CHECK-NEXT:    s_wait_storecnt 0x0
+; CHECK-NEXT:    s_cmovk_i32 s0, 0x380
+; CHECK-NEXT:    ; return to shader part epilog
+  %local = alloca i32, addrspace(5)
+  store volatile i32 %val, ptr addrspace(5) %local
+  %stack.base = call i32 @llvm.amdgcn.get.stack.base()
+  ret i32 %stack.base
+}
+
+define amdgpu_cs i32 @stack_base_cs_no_dvgpr(i32 %val) #2 {
+; CHECK-LABEL: stack_base_cs_no_dvgpr:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_mov_b32 s0, 0
+; CHECK-NEXT:    scratch_store_b32 off, v0, off scope:SCOPE_SYS
+; CHECK-NEXT:    s_wait_storecnt 0x0
+; CHECK-NEXT:    ; return to shader part epilog
+  %local = alloca i32, addrspace(5)
+  store volatile i32 %val, ptr addrspace(5) %local
+  %stack.base = call i32 @llvm.amdgcn.get.stack.base()
+  ret i32 %stack.base
+}
+
+define amdgpu_cs i32 @stack_base_cs_dvgpr_control_flow(i32 %val) #0 {
+; CHECK-LABEL: stack_base_cs_dvgpr_control_flow:
+; CHECK:       ; %bb.0: ; %entry
+; CHECK-NEXT:    s_getreg_b32 s33, hwreg(HW_REG_WAVE_HW_ID2, 8, 2)
+; CHECK-NEXT:    s_mov_b32 s0, exec_lo
+; CHECK-NEXT:    s_cmp_lg_u32 0, s33
+; CHECK-NEXT:    s_cmovk_i32 s33, 0x1c0
+; CHECK-NEXT:    scratch_store_b32 off, v0, s33 scope:SCOPE_SYS
+; CHECK-NEXT:    s_wait_storecnt 0x0
+; CHECK-NEXT:    v_cmpx_gt_i32_e32 0x43, v0
+; CHECK-NEXT:  ; %bb.1: ; %if.then
+; CHECK-NEXT:    s_getreg_b32 s1, hwreg(HW_REG_WAVE_HW_ID2, 8, 2)
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT:    s_cmp_lg_u32 0, s1
+; CHECK-NEXT:    s_cmovk_i32 s1, 0x1c0
+; CHECK-NEXT:    v_mov_b32_e32 v0, s1
+; CHECK-NEXT:  ; %bb.2: ; %if.end
+; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-NEXT:    s_wait_alu depctr_va_sdst(0)
+; CHECK-NEXT:    ; return to shader part epilog
+entry:
+  %local = alloca i32, addrspace(5)
+  store volatile i32 %val, ptr addrspace(5) %local
+  %which = icmp slt i32 %val, 67
+  br i1 %which, label %if.then, label %if.end
+
+if.then:
+  %stack.base = call i32 @llvm.amdgcn.get.stack.base()
+  br label %if.end
+
+if.end:
+  %ret = phi i32 [ %stack.base, %if.then ], [ %val, %entry ]
+  ret i32 %ret
+}
+
+
+attributes #0 = { nounwind "amdgpu-dynamic-vgpr-block-size"="16" }
+attributes #1 = { nounwind "amdgpu-dynamic-vgpr-block-size"="32" }
+attributes #2 = { nounwind "amdgpu-dynamic-vgpr-block-size"="0" }

arsenm

There are generic intrinsics for this that should be implemented instead of adding a new one

nikic

In case you're not aware, llvm.stackaddress was recently added (https://llvm.org/docs/LangRef.html#llvm-stackaddress-intrinsic). I'm not sure whether or not it exactly matches the semantics you want.

rovka · 2026-01-19T08:25:14Z

Oh, thanks! stackaddress wasn't upstream when I started working on this, and I hadn't noticed llvm.sponentry. I think I can get the latter to work for our usecase :)

…trinsic

arsenm

In a follow up can you make sure computeKnownBits knows about the alignment and that the top bits must be 0?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

arsenm · 2026-01-30T12:01:44Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+  // For everything else, create a dummy stack object.
+  EVT VT = getPointerTy(DAG.getDataLayout(), AMDGPUAS::PRIVATE_ADDRESS);
+  int FI = MF.getFrameInfo().CreateFixedObject(1, 0, /*IsImmutable=*/false);
+  return DAG.getFrameIndex(FI, VT);


Suggested change

return DAG.getFrameIndex(FI, VT);

return DAG.getFrameIndex(FI, Op.getValueType());

You should almost ever need to use getPointerTy, usually the correct type is implied by the original operation. This also will avoid asserting if someone uses the wrong address space for the call

Ack, thanks!

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

arsenm · 2026-01-30T12:03:26Z

llvm/lib/Target/AMDGPU/SIInstrInfo.td

+def AMDGPUsponentry : SDNode<
+  "ISD::SPONENTRY", SDTypeProfile <1, 0, [SDTCisPtrTy<0>]>
+>;


Should just move this to the generic code

Ok, done! I wasn't sure if that was welcome, most targets can probably get away with the same trick as AArch64.

arsenm · 2026-01-30T12:04:49Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+// Get the offset of the base of the stack, skipping any reserved areas.
+def GET_STACK_BASE : SPseudoInstSI<(outs SGPR_32:$dst), (ins),
+  [(set p5:$dst, (AMDGPUsponentry))]> {
+  let FixedSize = 0;


This does not expand to 0 bytes. I think it's at least 12?

That's actually a boolean. But I realized now that it doesn't mean exactly what I thought it meant, so I just put the worst case size instead.

arsenm · 2026-01-30T12:06:08Z

llvm/test/CodeGen/AMDGPU/llvm.sponentry.ll

+}
+
+; CHECK: ScratchSize: 16
+


Can you test what happens if you use this with the wrong addrspace / p0? It should not crash

I added a check for the address space in the Verifier, similar to what we have for allocas.

arsenm · 2026-01-30T12:09:14Z

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp

+      // FIXME: The imported pattern checks for i32 instead of p5; if we fix
+      // that we can remove this cast.
+      const LLT S32 = LLT::scalar(32);


If you add an explicit p5 does it work? I thought this was a solved problem

I'm not sure where to add the explicit p5. Did you mean in the TableGen definition?

The pattern in SIInstructions.td is already written with p5 for the output, but the generated code in AMDGPUGenGlobalISel.inc looks like this (note the GILLT_s32):

/* 2023706 */ // Label 168: @2023706 /* 2023706 */ GIM_Try, /*On fail goto*//*Label 29065*/ GIMT_Encode4(2023731), // Rule ID 4295 // /* 2023711 */ GIM_RootCheckType, /*Op*/0, /*Type*/GILLT_s32, /* 2023714 */ GIM_RootCheckRegBankForClass, /*Op*/0, /*RC*/GIMT_Encode2(AMDGPU::SGPR_32RegClassID), /* 2023718 */ // (sponentry:{ *:[i32] }) => (GET_STACK_BASE:{ *:[i32] }:{ *:[i1] }) /* 2023718 */ GIR_MutateOpcode, /*InsnID*/0, /*RecycleInsnID*/0, /*Opcode*/GIMT_Encode2(AMDGPU::GET_STACK_BASE), /* 2023723 */ GIR_AddImplicitDef, /*InsnID*/0, GIMT_Encode2(AMDGPU::SCC), GIMT_Encode2(static_cast<unsigned>(RegState::Dead)), /* 2023729 */ GIR_RootConstrainSelectedInstOperands, /* 2023730 */ // GIR_Coverage, 4295, /* 2023730 */ GIR_Done,

Did you have some workaround in mind?

For reference, this is what I was trying to select without the cast: LLVM ERROR: cannot select: %2:sreg_32(p5) = G_AMDGPU_SPONENTRY (in function: sponentry_cs_dvgpr_16).

arsenm · 2026-02-03T12:04:18Z

llvm/test/Verifier/AMDGPU/intrinsic-sponentry.ll

@@ -0,0 +1,64 @@
+; RUN: not opt -mtriple=amdgcn -mcpu=gfx1250 -passes=verify  -disable-output <%s 2>&1 | FileCheck %s


verifier tests should use llvm-as, not opt. Also don't need -mcpu or the triple.This can go off the datalayout

arsenm

lgtm with test cleanup

shiltian · 2026-02-03T15:21:17Z

Maybe update https://llvm.org/docs/LangRef.html#id420 as well?

rovka added 2 commits January 16, 2026 12:58

Add test I forgot about

2f4eef1

rovka requested review from Flakebi, jasilvanus, nhaehnle and tsymalla January 16, 2026 12:22

llvmbot added backend:AMDGPU llvm:ir llvm:analysis Includes value tracking, cost tables and constant folding labels Jan 16, 2026

arsenm requested changes Jan 16, 2026

View reviewed changes

nikic reviewed Jan 17, 2026

View reviewed changes

rovka added 5 commits January 29, 2026 14:01

Implement llvm.sponentry

3ce0b6e

Revert rest of get_stack_base intrinsic

43203f4

Use fixed stack object instead of forcing BP

fef05c1

Merge remote-tracking branch 'remotes/origin/main' into stack-base-in…

8f3caa8

…trinsic

Fix merge, add comments

7df545f

rovka changed the title ~~[AMDGPU] Add intrinsic that returns the actual base of the stack~~ [AMDGPU] Implement llvm.sponentry Jan 30, 2026

Some stuff I forgot to add to the last commit...

0c76e1b

arsenm reviewed Jan 30, 2026

View reviewed changes

Address review comments

c6143b3

llvmbot added the llvm:SelectionDAG SelectionDAGISel as well label Feb 3, 2026

arsenm reviewed Feb 3, 2026

View reviewed changes

Fix RUN line for verifier test

aa51607

arsenm reviewed Feb 3, 2026

View reviewed changes

arsenm approved these changes Feb 3, 2026

View reviewed changes

rovka merged commit 9022f47 into llvm:main Feb 3, 2026
10 checks passed

	return DAG.getFrameIndex(FI, VT);
	return DAG.getFrameIndex(FI, Op.getValueType());

		@@ -0,0 +1,64 @@
		; RUN: not opt -mtriple=amdgcn -mcpu=gfx1250 -passes=verify -disable-output <%s 2>&1 \| FileCheck %s

[AMDGPU] Implement llvm.sponentry #176357

[AMDGPU] Implement llvm.sponentry #176357

Uh oh!

Conversation

rovka commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

rovka commented Jan 19, 2026

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shiltian commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rovka commented Jan 16, 2026 •

edited

Loading

github-actions bot commented Jan 16, 2026 •

edited

Loading

llvmbot commented Jan 16, 2026 •

edited

Loading