-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Hardware][AMD] Improve OAM device ID + llama4 Maverick MOE tuning #16263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
vllm/platforms/rocm.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our past experiments, we found amdsmi_get_gpu_asic_info()["market_name"] to be more reliable across a set of different MI Instinct machines. Hence, we should stick to the previous implementation for distinguishing AMD GPU names.
Also, if we make this change, we might also need to update the names of existing tuned configs for AMD gpus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@divakar-amd for our SKU it returns MI300-O which is not really usable. Can you help dump some output between board info and asic info and we'll see which one is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we can just introduce another mapping or rules to map to a generic name?
for example MI300X-O => MI300X
MI300X => MI300X?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some SKUs we cannot talk about that would break if you do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xw285cornell @houseroad Let's retain the name as it is for your SKU (i.e. MI300X-O) and stick to amdsmi_get_gpu_asic_info. We'll push another config too with MI300X. So, we'll have both MI300X-O and MI300X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@divakar-amd can we get on the slack channel and discuss a solution? I don't really have a strong opinion on asic vs board, but duplicate the config file seems really ugly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it’s ugly if they are just duplicated, and may fail on another type of similar ASIC
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvement!
benchmarks/kernels/benchmark_moe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering the old approach - blindly setting guard, is there any problem with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, with ROCR_VISIBLE_DEVICES, we can only see 1 device, and the device guard will use deviceX (X >=1) and this will fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a good potential fix and can be used to remove dependency on the ENV variable RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
Line 114 in e1a2c69
| ENV RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1 |
However, it comes with a caveat - this would require the users to be mindful of HIP_VISIBLE_DEVICES vs ROCR_VISIBLE_DEVICES; between the two, HIP_VISIBLE_DEVICES is more commonly used.
For example: if the HIP_VISIBLE_DEVICES is set in the env, this PR would throw the following error:
RuntimeError: HIP_VISIBLE_DEVICES contains more devices than ROCR_VISIBLE_DEVICES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels more like a Ray problem that it probably shouldn't set ROCR_VISIBLE_DEVICES. Or, set ROCR_VISIBLE_DEVICES based on HIP_VISIBLE_DEVICES. There are people not using docker and install from source and will hit this problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could force HIP_VISIBLE_DEVICES to be the same as ROCR_VISIBLE_DEVICE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should handle HIP_VISIBLE_DEVICES as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's add a guard which avoids any mismatch between HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not super clear to me how to add the guard - the check happens at import time when the ray worker starts. So I just deleted the HIP_VISIBLE_DEVICES env var - I don't think ray will handle it anyway (it'll always use 8 GPU regardless of HIP_VISIBLE_DEVICES). Let me know what you think
vllm/platforms/rocm.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shajrawi @divakar-amd better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The map approach is much cleaner. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to include these too in the map
"0x74a0": "MI300A",
"0x74a1": "MI300X",
"0x74b5": "MI300X", // MI300X VF
"0x74a5": "MI325X",
"0x74b9": "MI325X", // MI325X VF
"0x74a9": "MI300X-HF",
"0x74bd": "MI300X-HF",
vllm/platforms/rocm.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lint?
benchmarks/kernels/benchmark_moe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can we add a log message.
- Also, lets use the value of HIP_VISIBLE_DEVICES to set the ROCR_VISIBLE_DEVICES. This would allow us to expose the number of GPUs for tuning correctly. e.g. if you only want to tune it over 4 gpus
Something like
logger.warning(
"Removing HIP_VISIBLE_DEVICES. Using ROCR_VISIBLE_DEVICES "
"for GPU visibility for Ray."
)
val = os.environ["HIP_VISIBLE_DEVICES"]
os.environ["ROCR_VISIBLE_DEVICES"] = val
del os.environ["HIP_VISIBLE_DEVICES"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
|
@divakar-amd fixed if you want to take a look again |
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Will temporarily put on hold until internal ROCm upgrade is done, sorry about the inconvenience.
benchmarks/kernels/benchmark_moe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) accessibility . -> accessibility.
|
@xw285cornell could you rebase again? We should be good to merge this PR :-) |
|
sounds good, let me rebase |
|
done, rebased :) |
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Could you address the lint?
Signed-off-by: Lu Fang <[email protected]>
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I lint it :-)
|
oh sorry didnt' notice the lint error, thanks! |
…llm-project#16263) Signed-off-by: Lu Fang <[email protected]> Co-authored-by: Lu Fang <[email protected]> Signed-off-by: Mu Huai <[email protected]>
…llm-project#16263) Signed-off-by: Lu Fang <[email protected]> Co-authored-by: Lu Fang <[email protected]>
…llm-project#16263) Signed-off-by: Lu Fang <[email protected]> Co-authored-by: Lu Fang <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
This PR improves the device name handling, and add tuning files for llama4 Maverick.
For OAM, amdsmi amdsmi_get_gpu_asic_info returns MI300X-O which is an abbreviation. Change to amdsmi_get_gpu_board_info which seems to be a more reliable source of naming. Need to confirm with AMD if that applies to all MI300 SKUs.
When running the benchmark_moe script, it returns invalid device ordinal. This is because ray is setting up the ROCR_VISIBLE_DEVICES correctly so each subprocess can only see on device. So we'll first check if ROCR_VISIBLE_DEVICES is set - if so we'll skip the torch.cuda.device context manager.
And finally add the missing tuning file for 128 expert Maverick llama4 model.
Improving #16114