-
Notifications
You must be signed in to change notification settings - Fork 16k
[AMDGPU] Implement llvm.sponentry #176357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In some of our use cases, the GPU runtime stores some data at the top of the stack. It figures out where it's safe to store it by using the PAL metadata generated by the backend, which includes the total stack size. However, the metadata does not include the space reserved at the bottom of the stack for the trap handler when CWSR is enabled in dynamic VGPR mode. This space is reserved dynamically based on whether or not the code is running on the compute queue. Therefore, the runtime needs a way to take that into account. Add a new intrinsic, `llvm.amdgcn.get.stack.base`, which returns the offset of the "actual" stack, skipping over any reserved areas. This allows us to keep this computation in one place rather than duplicate it between the backend and the runtime. The implementation uses a pseudo that is expanded to the same code sequence as that used in the prolog to set up the stack in the first place. The intrinsic can be called from arbitrary code, so we can't be sure that the value computed in the prolog is still in FP. There's some potential for optimization here but it's left as an exercise for the future since this is pretty much guaranteed to be called only on the cold path.
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
|
@llvm/pr-subscribers-llvm-selectiondag @llvm/pr-subscribers-backend-amdgpu Author: Diana Picus (rovka) ChangesIn some of our use cases, the GPU runtime stores some data at the top of Add a new intrinsic, Full diff: https://github.com/llvm/llvm-project/pull/176357.diff 8 Files Affected:
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index a8eba9ed126b7..66bd5b0c44b1e 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3799,6 +3799,12 @@ def int_amdgcn_cooperative_atomic_store_16x8B : AMDGPUCooperativeAtomicStore<llv
def int_amdgcn_cooperative_atomic_load_8x16B : AMDGPUCooperativeAtomicLoad<llvm_v4i32_ty>;
def int_amdgcn_cooperative_atomic_store_8x16B : AMDGPUCooperativeAtomicStore<llvm_v4i32_ty>;
+// Return the offset for the actual base of the stack, skipping over any
+// reserved areas (e.g. the area reserved for saving the dynamic VGPRs when CWSR
+// is active). The returned value only makes sense in functions that set up
+// their own stack.
+def int_amdgcn_get_stack_base : PureIntrinsic<[llvm_i32_ty]>;
+
//===----------------------------------------------------------------------===//
// Special Intrinsics for backend internal use only. No frontend
// should emit calls to these.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 7470fecd3c03f..888f801f950ef 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4882,6 +4882,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
if (Subtarget.hasSALUFloatInsts() && isSALUMapping(MI))
return getDefaultMappingSOP(MI);
return getDefaultMappingVOP(MI);
+ case Intrinsic::amdgcn_get_stack_base:
case Intrinsic::amdgcn_kernarg_segment_ptr:
case Intrinsic::amdgcn_s_getpc:
case Intrinsic::amdgcn_groupstaticsize:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
index 58a9b5511f2d0..c85c8f566bef9 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
@@ -411,6 +411,7 @@ def : AlwaysUniform<int_amdgcn_s_getpc>;
def : AlwaysUniform<int_amdgcn_s_getreg>;
def : AlwaysUniform<int_amdgcn_s_memrealtime>;
def : AlwaysUniform<int_amdgcn_s_memtime>;
+def : AlwaysUniform<int_amdgcn_get_stack_base>;
def AMDGPUImageDMaskIntrinsicTable : GenericTable {
let FilterClass = "AMDGPUImageDMaskIntrinsic";
diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index ec3e720ef8887..03c3ec5f0168b 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -724,14 +724,7 @@ void SIFrameLowering::emitEntryFunctionPrologue(MachineFunction &MF,
FrameInfo.getMaxAlign());
MFI->setScratchReservedForDynamicVGPRs(VGPRSize);
- BuildMI(MBB, I, DL, TII->get(AMDGPU::S_GETREG_B32), FPReg)
- .addImm(AMDGPU::Hwreg::HwregEncoding::encode(
- AMDGPU::Hwreg::ID_HW_ID2, AMDGPU::Hwreg::OFFSET_ME_ID, 2));
- // The MicroEngine ID is 0 for the graphics queue, and 1 or 2 for compute
- // (3 is unused, so we ignore it). Unfortunately, S_GETREG doesn't set
- // SCC, so we need to check for 0 manually.
- BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CMP_LG_U32)).addImm(0).addReg(FPReg);
- BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CMOVK_I32), FPReg).addImm(VGPRSize);
+ BuildMI(MBB, I, DL, TII->get(AMDGPU::GET_STACK_BASE), FPReg);
if (requiresStackPointerReference(MF)) {
Register SPReg = MFI->getStackPtrOffsetReg();
assert(SPReg != AMDGPU::SP_REG);
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 057f4adcafd62..c9f111f0d9d86 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -2537,7 +2537,7 @@ bool SIInstrInfo::expandPostRAPseudo(MachineInstr &MI) const {
}
break;
- case AMDGPU::V_MAX_BF16_PSEUDO_e64:
+ case AMDGPU::V_MAX_BF16_PSEUDO_e64: {
assert(ST.hasBF16PackedInsts());
MI.setDesc(get(AMDGPU::V_PK_MAX_NUM_BF16));
MI.addOperand(MachineOperand::CreateImm(0)); // op_sel
@@ -2550,6 +2550,38 @@ bool SIInstrInfo::expandPostRAPseudo(MachineInstr &MI) const {
break;
}
+ case AMDGPU::GET_STACK_BASE:
+ // The stack starts at offset 0 unless we need to reserve some space at the
+ // bottom.
+ if (ST.getFrameLowering()->mayReserveScratchForCWSR(*MBB.getParent())) {
+ // When CWSR is used in dynamic VGPR mode, the trap handler needs to save
+ // some of the VGPRs. The size of the required scratch space has already
+ // been computed by prolog epilog insertion.
+ const SIMachineFunctionInfo *MFI =
+ MBB.getParent()->getInfo<SIMachineFunctionInfo>();
+ unsigned VGPRSize = MFI->getScratchReservedForDynamicVGPRs();
+ Register DestReg = MI.getOperand(0).getReg();
+ BuildMI(MBB, MI, DL, get(AMDGPU::S_GETREG_B32), DestReg)
+ .addImm(AMDGPU::Hwreg::HwregEncoding::encode(
+ AMDGPU::Hwreg::ID_HW_ID2, AMDGPU::Hwreg::OFFSET_ME_ID, 2));
+ // The MicroEngine ID is 0 for the graphics queue, and 1 or 2 for compute
+ // (3 is unused, so we ignore it). Unfortunately, S_GETREG doesn't set
+ // SCC, so we need to check for 0 manually.
+ BuildMI(MBB, MI, DL, get(AMDGPU::S_CMP_LG_U32)).addImm(0).addReg(DestReg);
+ MI.setDesc(get(AMDGPU::S_CMOVK_I32));
+ MI.addOperand(MachineOperand::CreateImm(VGPRSize));
+ // Change the implicif-def of SCC to an explicit use (but first remove
+ // the dead flag if present).
+ MI.getOperand(MI.getNumExplicitOperands()).setIsDead(false);
+ MI.getOperand(MI.getNumExplicitOperands()).setIsUse();
+ } else {
+ MI.setDesc(get(AMDGPU::S_MOV_B32));
+ MI.addOperand(MachineOperand::CreateImm(0));
+ MI.removeOperand(MI.getNumExplicitOperands()); // Drop implicit def of SCC.
+ }
+ break;
+ }
+
return true;
}
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index e06bc912113a8..83685b630075e 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -926,6 +926,7 @@ multiclass si_cs_chain_tc_dvgpr_patterns<
defm : si_cs_chain_tc_dvgpr_patterns<i32>; // On GFX12, dVGPR mode is wave32-only.
+let Defs = [SCC] in {
def ADJCALLSTACKUP : SPseudoInstSI<
(outs), (ins i32imm:$amt0, i32imm:$amt1),
[(callseq_start timm:$amt0, timm:$amt1)],
@@ -935,7 +936,6 @@ def ADJCALLSTACKUP : SPseudoInstSI<
let hasSideEffects = 1;
let usesCustomInserter = 1;
let SchedRW = [WriteSALU];
- let Defs = [SCC];
}
def ADJCALLSTACKDOWN : SPseudoInstSI<
@@ -946,9 +946,16 @@ def ADJCALLSTACKDOWN : SPseudoInstSI<
let hasSideEffects = 1;
let usesCustomInserter = 1;
let SchedRW = [WriteSALU];
- let Defs = [SCC];
}
+// Get the offset of the base of the stack, skipping any reserved areas.
+def GET_STACK_BASE : SPseudoInstSI<(outs SGPR_32:$dst), (ins),
+ [(set SGPR_32:$dst, (int_amdgcn_get_stack_base))]> {
+ let hasSideEffects = 0;
+ let SchedRW = [WriteSALU];
+}
+} // End Defs = [SCC]
+
let Defs = [M0, EXEC, SCC],
UseNamedOperandTable = 1 in {
diff --git a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll
index 9ff670bee0f89..e77d88255acc6 100644
--- a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll
+++ b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll
@@ -199,6 +199,13 @@ define void @s_memrealtime(ptr addrspace(1) inreg %out) {
ret void
}
+; CHECK-LABEL: for function 'get_stack_base':
+; CHECK: ALL VALUES UNIFORM
+define amdgpu_cs void @get_stack_base(ptr addrspace(1) inreg %out) {
+ %v = call i32 @llvm.amdgcn.get.stack.base()
+ store i32 %v, ptr addrspace(1) %out
+ ret void
+}
declare i32 @llvm.amdgcn.workitem.id.x() #0
declare i32 @llvm.amdgcn.readfirstlane(i32) #0
@@ -216,6 +223,7 @@ declare i32 @llvm.amdgcn.cluster.workgroup.max.id.x()
declare i32 @llvm.amdgcn.cluster.workgroup.max.id.y()
declare i32 @llvm.amdgcn.cluster.workgroup.max.id.z()
declare i32 @llvm.amdgcn.cluster.workgroup.max.flat.id()
+declare i32 @llvm.amdgcn.get.stack.base()
attributes #0 = { nounwind readnone }
attributes #1 = { nounwind readnone convergent }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.get.stack.base.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.get.stack.base.ll
new file mode 100644
index 0000000000000..eace3f778515a
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.get.stack.base.ll
@@ -0,0 +1,101 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=-real-true16 < %s | FileCheck %s
+; RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck %s
+; RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx1200 -mattr=-real-true16 < %s | FileCheck %s
+
+; Test that the llvm.amdgcn.get.stack.base intrinsic returns the correct value:
+; - for functions that need to reserve space for CWSR, it should return the offset
+; past the reserved area (i.e. the offset of the first spill or local variables)
+; - for functions that don't reserve any space, it should return 0
+
+define amdgpu_cs i32 @stack_base_cs_dvgpr_16(i32 %val) #0 {
+; CHECK-LABEL: stack_base_cs_dvgpr_16:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: s_getreg_b32 s33, hwreg(HW_REG_WAVE_HW_ID2, 8, 2)
+; CHECK-NEXT: s_getreg_b32 s0, hwreg(HW_REG_WAVE_HW_ID2, 8, 2)
+; CHECK-NEXT: s_cmp_lg_u32 0, s33
+; CHECK-NEXT: s_cmovk_i32 s33, 0x1c0
+; CHECK-NEXT: s_cmp_lg_u32 0, s0
+; CHECK-NEXT: scratch_store_b32 off, v0, s33 scope:SCOPE_SYS
+; CHECK-NEXT: s_wait_storecnt 0x0
+; CHECK-NEXT: s_cmovk_i32 s0, 0x1c0
+; CHECK-NEXT: ; return to shader part epilog
+ %local = alloca i32, addrspace(5)
+ store volatile i32 %val, ptr addrspace(5) %local
+ %stack.base = call i32 @llvm.amdgcn.get.stack.base()
+ ret i32 %stack.base
+}
+
+define amdgpu_cs i32 @stack_base_cs_dvgpr_32(i32 %val) #1 {
+; CHECK-LABEL: stack_base_cs_dvgpr_32:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: s_getreg_b32 s33, hwreg(HW_REG_WAVE_HW_ID2, 8, 2)
+; CHECK-NEXT: s_getreg_b32 s0, hwreg(HW_REG_WAVE_HW_ID2, 8, 2)
+; CHECK-NEXT: s_cmp_lg_u32 0, s33
+; CHECK-NEXT: s_cmovk_i32 s33, 0x380
+; CHECK-NEXT: s_cmp_lg_u32 0, s0
+; CHECK-NEXT: scratch_store_b32 off, v0, s33 scope:SCOPE_SYS
+; CHECK-NEXT: s_wait_storecnt 0x0
+; CHECK-NEXT: s_cmovk_i32 s0, 0x380
+; CHECK-NEXT: ; return to shader part epilog
+ %local = alloca i32, addrspace(5)
+ store volatile i32 %val, ptr addrspace(5) %local
+ %stack.base = call i32 @llvm.amdgcn.get.stack.base()
+ ret i32 %stack.base
+}
+
+define amdgpu_cs i32 @stack_base_cs_no_dvgpr(i32 %val) #2 {
+; CHECK-LABEL: stack_base_cs_no_dvgpr:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: s_mov_b32 s0, 0
+; CHECK-NEXT: scratch_store_b32 off, v0, off scope:SCOPE_SYS
+; CHECK-NEXT: s_wait_storecnt 0x0
+; CHECK-NEXT: ; return to shader part epilog
+ %local = alloca i32, addrspace(5)
+ store volatile i32 %val, ptr addrspace(5) %local
+ %stack.base = call i32 @llvm.amdgcn.get.stack.base()
+ ret i32 %stack.base
+}
+
+define amdgpu_cs i32 @stack_base_cs_dvgpr_control_flow(i32 %val) #0 {
+; CHECK-LABEL: stack_base_cs_dvgpr_control_flow:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: s_getreg_b32 s33, hwreg(HW_REG_WAVE_HW_ID2, 8, 2)
+; CHECK-NEXT: s_mov_b32 s0, exec_lo
+; CHECK-NEXT: s_cmp_lg_u32 0, s33
+; CHECK-NEXT: s_cmovk_i32 s33, 0x1c0
+; CHECK-NEXT: scratch_store_b32 off, v0, s33 scope:SCOPE_SYS
+; CHECK-NEXT: s_wait_storecnt 0x0
+; CHECK-NEXT: v_cmpx_gt_i32_e32 0x43, v0
+; CHECK-NEXT: ; %bb.1: ; %if.then
+; CHECK-NEXT: s_getreg_b32 s1, hwreg(HW_REG_WAVE_HW_ID2, 8, 2)
+; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT: s_cmp_lg_u32 0, s1
+; CHECK-NEXT: s_cmovk_i32 s1, 0x1c0
+; CHECK-NEXT: v_mov_b32_e32 v0, s1
+; CHECK-NEXT: ; %bb.2: ; %if.end
+; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s0
+; CHECK-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT: v_readfirstlane_b32 s0, v0
+; CHECK-NEXT: s_wait_alu depctr_va_sdst(0)
+; CHECK-NEXT: ; return to shader part epilog
+entry:
+ %local = alloca i32, addrspace(5)
+ store volatile i32 %val, ptr addrspace(5) %local
+ %which = icmp slt i32 %val, 67
+ br i1 %which, label %if.then, label %if.end
+
+if.then:
+ %stack.base = call i32 @llvm.amdgcn.get.stack.base()
+ br label %if.end
+
+if.end:
+ %ret = phi i32 [ %stack.base, %if.then ], [ %val, %entry ]
+ ret i32 %ret
+}
+
+
+attributes #0 = { nounwind "amdgpu-dynamic-vgpr-block-size"="16" }
+attributes #1 = { nounwind "amdgpu-dynamic-vgpr-block-size"="32" }
+attributes #2 = { nounwind "amdgpu-dynamic-vgpr-block-size"="0" }
|
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are generic intrinsics for this that should be implemented instead of adding a new one
nikic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you're not aware, llvm.stackaddress was recently added (https://llvm.org/docs/LangRef.html#llvm-stackaddress-intrinsic). I'm not sure whether or not it exactly matches the semantics you want.
|
Oh, thanks! |
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow up can you make sure computeKnownBits knows about the alignment and that the top bits must be 0?
| // For everything else, create a dummy stack object. | ||
| EVT VT = getPointerTy(DAG.getDataLayout(), AMDGPUAS::PRIVATE_ADDRESS); | ||
| int FI = MF.getFrameInfo().CreateFixedObject(1, 0, /*IsImmutable=*/false); | ||
| return DAG.getFrameIndex(FI, VT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return DAG.getFrameIndex(FI, VT); | |
| return DAG.getFrameIndex(FI, Op.getValueType()); |
You should almost ever need to use getPointerTy, usually the correct type is implied by the original operation. This also will avoid asserting if someone uses the wrong address space for the call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, thanks!
| def AMDGPUsponentry : SDNode< | ||
| "ISD::SPONENTRY", SDTypeProfile <1, 0, [SDTCisPtrTy<0>]> | ||
| >; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should just move this to the generic code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done! I wasn't sure if that was welcome, most targets can probably get away with the same trick as AArch64.
| // Get the offset of the base of the stack, skipping any reserved areas. | ||
| def GET_STACK_BASE : SPseudoInstSI<(outs SGPR_32:$dst), (ins), | ||
| [(set p5:$dst, (AMDGPUsponentry))]> { | ||
| let FixedSize = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not expand to 0 bytes. I think it's at least 12?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a boolean. But I realized now that it doesn't mean exactly what I thought it meant, so I just put the worst case size instead.
| } | ||
|
|
||
| ; CHECK: ScratchSize: 16 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test what happens if you use this with the wrong addrspace / p0? It should not crash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check for the address space in the Verifier, similar to what we have for allocas.
| // FIXME: The imported pattern checks for i32 instead of p5; if we fix | ||
| // that we can remove this cast. | ||
| const LLT S32 = LLT::scalar(32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add an explicit p5 does it work? I thought this was a solved problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where to add the explicit p5. Did you mean in the TableGen definition?
The pattern in SIInstructions.td is already written with p5 for the output, but the generated code in AMDGPUGenGlobalISel.inc looks like this (note the GILLT_s32):
/* 2023706 */ // Label 168: @2023706
/* 2023706 */ GIM_Try, /*On fail goto*//*Label 29065*/ GIMT_Encode4(2023731), // Rule ID 4295 //
/* 2023711 */ GIM_RootCheckType, /*Op*/0, /*Type*/GILLT_s32,
/* 2023714 */ GIM_RootCheckRegBankForClass, /*Op*/0, /*RC*/GIMT_Encode2(AMDGPU::SGPR_32RegClassID),
/* 2023718 */ // (sponentry:{ *:[i32] }) => (GET_STACK_BASE:{ *:[i32] }:{ *:[i1] })
/* 2023718 */ GIR_MutateOpcode, /*InsnID*/0, /*RecycleInsnID*/0, /*Opcode*/GIMT_Encode2(AMDGPU::GET_STACK_BASE),
/* 2023723 */ GIR_AddImplicitDef, /*InsnID*/0, GIMT_Encode2(AMDGPU::SCC), GIMT_Encode2(static_cast<unsigned>(RegState::Dead)),
/* 2023729 */ GIR_RootConstrainSelectedInstOperands,
/* 2023730 */ // GIR_Coverage, 4295,
/* 2023730 */ GIR_Done,
Did you have some workaround in mind?
For reference, this is what I was trying to select without the cast: LLVM ERROR: cannot select: %2:sreg_32(p5) = G_AMDGPU_SPONENTRY (in function: sponentry_cs_dvgpr_16).
| @@ -0,0 +1,64 @@ | |||
| ; RUN: not opt -mtriple=amdgcn -mcpu=gfx1250 -passes=verify -disable-output <%s 2>&1 | FileCheck %s | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verifier tests should use llvm-as, not opt. Also don't need -mcpu or the triple.This can go off the datalayout
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with test cleanup
|
Maybe update https://llvm.org/docs/LangRef.html#id420 as well? |
In some of our use cases, the GPU runtime stores some data at the top of
the stack. It figures out where it's safe to store it by using the PAL
metadata generated by the backend, which includes the total stack size.
However, the metadata does not include the space reserved at the bottom
of the stack for the trap handler when CWSR is enabled in dynamic VGPR
mode. This space is reserved dynamically based on whether or not the
code is running on the compute queue. Therefore, the runtime needs a way
to take that into account.
Add support for
llvm.sponentry, which should return the base of the stack,skipping over any reserved areas. This allows us to keep this computation in
one place rather than duplicate it between the backend and the runtime.
The implementation for functions that set up their own stack uses a pseudo
that is expanded to the same code sequence as that used in the prolog to
set up the stack in the first place.
In callable functions, we generate a fixed stack object and use that instead,
similar to the Arm/AArch64 approach. This wastes some stack space but that's
not a problem for now because we're not planning to use this in callable
functions yet.