Skip to content

Conversation

@phambinhfin
Copy link

Hypothesis: cuBLAS[Lt] GEMM uses
D := α * (A @ B) + β * C.
With matrix-bias fusion we set β = 1, which reads C. I’ve seen cases where C had garbage/stale values, and when A,B are small, β*C dominates → huge values / NaNs. For example, this one

buffer=A elems=16777216 sample=1024 NaN=0 Inf=0 min=-2.75 max=3.5
buffer=B elems=524288 sample=1024 NaN=0 Inf=0 min=-0.125 max=0.109375
buffer=C elems=33554432 sample=1024 NaN=0 Inf=0 min=-2.37932e+38 max=-2.37932e+38
buffer=D(out) elems=33554432 sample=1024 NaN=0 Inf=0 min=-2.37932e+38 max=-2.37932e+38
  1. FP8 GEMM: skip matrix-bias→C fusion

    • For FP8 cublasLt matmuls,it no longer fuse matrix bias as GEMM’s
      “C” (β=1). Keeping β=0 so GEMM does not read C at all, and apply
      the bias as a separate Add right after the GEMM.
    • Effect: eliminates the “garbage-in-C → garbage-in-D” failure mode
      without changing numerics otherwise.
  2. Safer non-contracting dim selection

    • Initialize non_contracting_dim = -1, select it explicitly, and
      CHECK it was found before use. This prevents any accidental use of
      an uninitialized variable if future refactors ever violate the
      single-(non)contracting-dim invariant

@phambinhfin phambinhfin self-assigned this Oct 17, 2025
@i-chaochen i-chaochen requested a review from ScXfjiang October 17, 2025 13:07
@ScXfjiang
Copy link

Instead of to disable this fusion, we need to figure out why it fails to work for gfx950.

@phambinhfin
Copy link
Author

Instead of to disable this fusion, we need to figure out why it fails to work for gfx950.

I think if we can find another GPUs that support FP8, we also can confirm that this issue may happen there as well, not only gfx950

@ScXfjiang
Copy link

Instead of to disable this fusion, we need to figure out why it fails to work for gfx950.

I think if we can find another GPUs that support FP8, we also can confirm that this issue may happen there as well, not only gfx950

It's the OCP FP8 that we currently care about, and OCP FP8 is only supported in gfx950 and gfx1201.

@phambinhfin
Copy link
Author

Can you test again, i just cover more cases to preven Ffusion

@i-chaochen
Copy link
Collaborator

Hi @phambinhfin I think we can close this PR since #416 is merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants