Performance: BlockTile 256x128 optimizations enable 1500+ TF FP8 by sazczmh · Pull Request #81 · deepseek-ai/DeepGEMM

sazczmh · 2025-04-08T10:03:34Z

By resuing the Accumulator registers of Tensor Cores to implement a 256x128 BlockTile structure, this approach significantly increases data reuse, reduces the demand for L2 Cache and HBM memory accesses, and enhances the SM's computational frequency, ultimately achieving FP8 performance exceeding 1,500+ TFLOPS.

M	N	K	Base BMxBN	Computation	Opti BMxBN	Computation	Speedup
4096	24576	1536	128x160	1162 TF	256x128	1204 TF	3.61%
4096	32768	512	128x160	801 TF	256x128	777 TF	-3.00%
4096	7168	16384	128x160	1451 TF	256x128	1500 TF	3.38%
4096	4096	7168	128x160	1304 TF	256x128	1377 TF	5.60%
4096	7168	2048	128x160	1185 TF	256x128	1159 TF	-2.19%

Test on “H800”-SXM && CUDA 12.8.1

…performance on the H800-SXM platform

Performance: BlockTile 256x128 optimizations enable 1500+ TFLOPS FP8 …

97575bf

…performance on the H800-SXM platform

sazczmh added the perf label Apr 8, 2025

sazczmh self-assigned this Apr 8, 2025

LyricZhao added 2 commits April 9, 2025 09:32

Remove unused x256 WGMMA

ce65d5e

Clean up config heuristics

48a5f07

LyricZhao force-pushed the blocktile-256x128 branch from 1eeb98a to 48a5f07 Compare April 9, 2025 02:01

LyricZhao added 2 commits April 9, 2025 10:11

Larger block N candidates

a6524d4

Refactor M repetition with loops

4c0cc29

soundOfDestiny approved these changes Apr 9, 2025

View reviewed changes

LyricZhao added 2 commits April 9, 2025 10:59

Fix indent

bdca8b0

Fix indent x2

5a80e4b

LyricZhao requested a review from zheanxu April 9, 2025 03:10

LyricZhao added 2 commits April 9, 2025 11:14

Update README

a9967bc

Update README

989c9e3

LyricZhao merged commit fed3e4d into main Apr 9, 2025

zheanxu approved these changes Apr 9, 2025

View reviewed changes

LyricZhao deleted the blocktile-256x128 branch April 11, 2025 03:35

YouJiacheng mentioned this pull request Jan 12, 2026

New Record: Fuse linear layer + ReLU + square in MLP block (-2.8s) KellerJordan/modded-nanogpt#197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: BlockTile 256x128 optimizations enable 1500+ TF FP8#81

Performance: BlockTile 256x128 optimizations enable 1500+ TF FP8#81
LyricZhao merged 9 commits intomainfrom
blocktile-256x128

sazczmh commented Apr 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sazczmh commented Apr 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants