[MXFP] mxfp conversions speedup #8610

farzad-openai · 2025-10-31T23:17:48Z

This PR improves the throughput of mxfp8 upcast and downcast operations. I included a commit from @jongsoo-openai (original PR here) and added improvements below on top of it. The PR is functionally a no-op, which is verified by the tests in python/triton_kernels/tests/test_mxfp.py.

Upcast improvements:

Added native packed e2m1 conversion to fp16 (for Blackwell+).
Added tensor descriptors to utilize TMA for reading the input mxfp value tensor and writing the output.
- Note that this addition required adding padding for the innermost dimension for IO tensors not adhering to tensor descriptor specification requirements, and unpadding the output afterwards.
Tuned tile dimensions and num_warps.

Downcast improvements:

Enabled grouped store of mxfp4 value tensors, as opposed to byte-level stores.
Tuned the tile dimensions as well as num_warps.
Unfortunately, as opposed to upcast, tensor descriptors were unable to give a consistent performance improvement.

I left performance tuning as a TODO for a subsequent PR.

Performance comparison (BW, in GBps)

Done via python/triton_kernels/tests/test_mxfp.py.

Before -- GB200

MXFP8 (e4m3fn):
   M     N  quant_dtype            quant_bw_bfloat16    quant_bw_float16    dequant_bw_bfloat16    dequant_bw_float16
----  ----  -------------------  -------------------  ------------------  ---------------------  --------------------
1024  8192  torch.float8_e4m3fn              1985.94             2053.35                2154.61               2347.56
4096  8192  torch.float8_e4m3fn              3479.79             3518.71                3243.02               3753.85

MXFP4 (e2m1):
   M     N  quant_dtype      quant_bw_bfloat16    quant_bw_float16    dequant_bw_bfloat16    dequant_bw_float16
----  ----  -------------  -------------------  ------------------  ---------------------  --------------------
1024  8192  torch.uint8                808.089             815.124                647.589               713.9
4096  8192  torch.uint8               1045.23             1041.91                 811.089               888.624

After -- GB200

MXFP8 (e4m3fn):
   M     N  quant_dtype            quant_bw_bfloat16    quant_bw_float16    dequant_bw_bfloat16    dequant_bw_float16
----  ----  -------------------  -------------------  ------------------  ---------------------  --------------------
1024  8192  torch.float8_e4m3fn              2259.86             2404.99                2119.76               2361.66
4096  8192  torch.float8_e4m3fn              4106.69             4268.29                4038.16               4059

MXFP4 (e2m1):
   M     N  quant_dtype      quant_bw_bfloat16    quant_bw_float16    dequant_bw_bfloat16    dequant_bw_float16
----  ----  -------------  -------------------  ------------------  ---------------------  --------------------
1024  8192  torch.uint8                1334.75             1332.03                1424.7                1397.36
4096  8192  torch.uint8                2027.41             2028.98                2097.15               2275.56

Before -- H100

MXFP8 (e4m3fn):
   M     N  quant_dtype            quant_bw_bfloat16    quant_bw_float16    dequant_bw_bfloat16    dequant_bw_float16
----  ----  -------------------  -------------------  ------------------  ---------------------  --------------------
1024  8192  torch.float8_e4m3fn              1250.29             1244.35                1595.2                1588.75
4096  8192  torch.float8_e4m3fn              1805.81             1799.62                2080.51               2118.34

MXFP4 (e2m1):
   M     N  quant_dtype      quant_bw_bfloat16    quant_bw_float16    dequant_bw_bfloat16    dequant_bw_float16
----  ----  -------------  -------------------  ------------------  ---------------------  --------------------
1024  8192  torch.uint8                418.493             416.102                572.367               627.739
4096  8192  torch.uint8                489.531             490.08                 687.861               758.08

After -- H100

MXFP8 (e4m3fn):
   M     N  quant_dtype            quant_bw_bfloat16    quant_bw_float16    dequant_bw_bfloat16    dequant_bw_float16
----  ----  -------------------  -------------------  ------------------  ---------------------  --------------------
1024  8192  torch.float8_e4m3fn              1604.96             1624.86                1732.23               1751.52
4096  8192  torch.float8_e4m3fn              2347.56             2337.09                2386.74               2292.8

MXFP4 (e2m1):
   M     N  quant_dtype      quant_bw_bfloat16    quant_bw_float16    dequant_bw_bfloat16    dequant_bw_float16
----  ----  -------------  -------------------  ------------------  ---------------------  --------------------
1024  8192  torch.uint8                731.429             745.575                892.861               917.871
4096  8192  torch.uint8                882.343             894.995               1102.37               1165.08

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because test_mxfp.py already has coverage.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

python/triton_kernels/tests/test_mxfp.py

jongsoo-openai and others added 6 commits October 31, 2025 12:16

[mxfp] speedup mxfp4 quant using inline assembly

206c540

utilize Blackwell's native cvt.rn.f16x2.e2m1x2 PTX for upcasting e2m1

b760188

use tensor descriptor for upcast kernel's io

1edd6c3

separate perf report for mxfp4 and mxfp8

cde9a09

add grouped writes for downcast

cc54b09

smol perf tune

6507b25

farzad-openai requested a review from ptillet as a code owner October 31, 2025 23:17

farzad-openai changed the title ~~Mxfp conversions speedup~~ [MXFP] mxfp conversions speedup Nov 1, 2025

pre-commit lint

15eb0cf

jongsoo-openai approved these changes Nov 2, 2025

View reviewed changes

python/triton_kernels/tests/test_mxfp.py Show resolved Hide resolved

farzad-openai added 2 commits November 3, 2025 08:55

add back bf16 and fp16 dtypes for test_maxfp_casting

a52e909

simplify inline ptx assembly

e5855e7

farzad-openai force-pushed the mxfp_conversions_speedup branch from 49d6715 to e5855e7 Compare November 4, 2025 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MXFP] mxfp conversions speedup #8610

[MXFP] mxfp conversions speedup #8610

farzad-openai commented Oct 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[MXFP] mxfp conversions speedup #8610

Are you sure you want to change the base?

[MXFP] mxfp conversions speedup #8610

Conversation

farzad-openai commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance comparison (BW, in GBps)

New contributor declaration

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

farzad-openai commented Oct 31, 2025 •

edited

Loading