Skip to content

[CUDA] Tune ops per buffer based on device#2761

Merged
awni merged 3 commits intoml-explore:mainfrom
awni:tune_ops_per_buffer
Nov 16, 2025
Merged

[CUDA] Tune ops per buffer based on device#2761
awni merged 3 commits intoml-explore:mainfrom
awni:tune_ops_per_buffer

Conversation

@awni
Copy link
Copy Markdown
Member

@awni awni commented Nov 14, 2025

We need a more sophisticated policy to set ops per buffer based on the device. This is a start to that.

For inference on B200 it helps a lot for inference to increase it at very little memory cost.

mlx_lm.benchmark --model meta-llama/Meta-Llama-3.1-8B --p 128 -g 128 -b 1 -n 4

Pre: generation_tps=244.456, peak_memory=16.166
Post: generation_tps=283.073, peak_memory=16.224

For training 0.6B it's a double win, faster and less RAM 💪

Toks/sec Mem (GB)
Pre 60631 54.51
Post 64078 51.63

@awni awni changed the title Tune ops per buffer based on device [CUDA] Tune ops per buffer based on device Nov 14, 2025
@awni awni force-pushed the tune_ops_per_buffer branch from 011a737 to 43d2f55 Compare November 15, 2025 00:25
Copy link
Copy Markdown
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful 🚀

@awni awni force-pushed the tune_ops_per_buffer branch from 43d2f55 to e2694be Compare November 15, 2025 04:21
@awni
Copy link
Copy Markdown
Member Author

awni commented Nov 16, 2025

This will probably need more tuning in the future especially for devices that I didn't add yet. But for now I think it's good to merge.

@awni awni merged commit aad49f9 into ml-explore:main Nov 16, 2025
9 checks passed
@awni awni deleted the tune_ops_per_buffer branch November 22, 2025 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants