Force FP8 gemms into F16 dot #394

phambinhfin · 2025-10-05T23:02:13Z

This PR fixes FP8 training issues on gfx950 (MI350) architecture where ROCm's FP8 implementation doesn't properly handle conversion operations. The current ROCm FP8 support skips conversion handling, causing training nan issue. This fix ensures FP8 operations fall back to FP16 while waiting for XLA support FP8 convert to improve trainign performance and avoid nan issue , this link with ticket
(The nan isuse does not happen in MI300 now because MI300 handle FP8 dot into FP16)

i-chaochen

if it's gfx related changes, please add related gfx instead of general skipped.

ScXfjiang · 2025-10-06T08:13:27Z

XLA is supposed to support such scenario and it should not fall back to FP16 GEMM.

yeandy · 2025-10-06T14:05:58Z

Hi @phambinhfin, to echo @ScXfjiang's question, I'd like to understand, under what conditions would the FP8 rewriter not be able to rewrite dot with FP8 inputs into cublasLt custom call on MI35X? And similarly, for MI300, under what conditions would it not rewrite dot with nano_fp8 inputs into the custom GEMM?

Should it be the case in a properly installed ROCm environment that nanoo_fp8/fp8 GEMMs get used?

ScXfjiang · 2025-10-06T14:53:46Z

Hi @phambinhfin, to echo @ScXfjiang's question, I'd like to understand, under what conditions would the FP8 rewriter not be able to rewrite dot with FP8 inputs into cublasLt custom call on MI35X? And similarly, for MI300, under what conditions would it not rewrite dot with nano_fp8 inputs into the custom GEMM?

Should it be the case in a properly installed ROCm environment that nanoo_fp8/fp8 GEMMs get used?

There are multiple factors to decide if a hipblalst FP8 gemm custom call can be generated, e.g., data types, rocm versions, if a specific pattern is triggered.

You can check the main logic here:
https://github.com/openxla/xla/blob/e80be278d2b01f6b1b92102785b1f74ad10dfc92/xla/service/gpu/transforms/gemm_rewriter.cc#L1078

But if you only care about the results, you can enable this log:
https://github.com/openxla/xla/blob/e80be278d2b01f6b1b92102785b1f74ad10dfc92/xla/service/gpu/transforms/gemm_rewriter.cc#L1429

xla/service/gpu/transforms/gemm_rewriter.cc

Adds a temporary workaround to disable FP8 GEMM operations on gfx950 (MI355X) architecture, because FP8 operations combine with quantization produce NaN issues. While investigating and fixing the root cause, temporarily forcing them to fall back to FP16 instead.

phambinhfin requested a review from draganmladjenovic October 5, 2025 23:02

i-chaochen requested a review from ScXfjiang October 5, 2025 23:22

i-chaochen reviewed Oct 5, 2025

View reviewed changes

phambinhfin force-pushed the phambinh/fix_nan branch from fdb0fdc to e11a81b Compare October 12, 2025 20:41

i-chaochen requested changes Oct 13, 2025

View reviewed changes

xla/service/gpu/transforms/gemm_rewriter.cc Outdated Show resolved Hide resolved

phambinhfin force-pushed the phambinh/fix_nan branch from e11a81b to 9919fea Compare October 13, 2025 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Force FP8 gemms into F16 dot #394

Force FP8 gemms into F16 dot #394

phambinhfin commented Oct 5, 2025

Uh oh!

i-chaochen left a comment

Uh oh!

ScXfjiang commented Oct 6, 2025

Uh oh!

yeandy commented Oct 6, 2025

Uh oh!

ScXfjiang commented Oct 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Force FP8 gemms into F16 dot #394

Are you sure you want to change the base?

Force FP8 gemms into F16 dot #394

Conversation

phambinhfin commented Oct 5, 2025

Uh oh!

i-chaochen left a comment

Choose a reason for hiding this comment

Uh oh!

ScXfjiang commented Oct 6, 2025

Uh oh!

yeandy commented Oct 6, 2025

Uh oh!

ScXfjiang commented Oct 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants