fix cross entropy loss issue for small vocab size on amd gpu #3503

wangxunx · 2025-10-24T06:26:08Z

Summary

This PR fixes cross entropy loss issue when fine tune models with small vocab sizes such as Mistral v0.3 7B on AMD MI300 gpu.

Context / Motivation

As AMD GPU with CDNA3 arch has different hardware resource with Nvidia's gpu, the default settings of cross_entropy_loss kernel produces illegal launch parameters and cause training failure like: RuntimeError: Triton Error [HIP]: Code: 1, Messsage: invalid argument
For models with small vocab sizes, like mistral, we could halve num_warps to reduce resource usage and this can keep the logits return unchanged.
This PR reduces num warps to reasonable values, to enable mistral sft on AMD MI300, while keeping RETURN_LOGITS logic unchanged.

Changes

Halve num_warps for single chunk case for amd cdna arch.

Testing

Qwen3-14B / Mistral_0.3-7B / Llama3.2_1B_and_3B sft on MI300
Qwen3-14B / Mistral_0.3-7B / Llama3.2_1B_and_3B sft on RTX 4090

danielhanchen · 2025-10-27T04:20:44Z

Ok thanks!

fix cross entropy loss issue for small vocab size on amd gpu

0c5f259

danielhanchen merged commit fe9210d into unslothai:main Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix cross entropy loss issue for small vocab size on amd gpu #3503

fix cross entropy loss issue for small vocab size on amd gpu #3503

wangxunx commented Oct 24, 2025 •

edited

Loading

Uh oh!

danielhanchen commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix cross entropy loss issue for small vocab size on amd gpu #3503

fix cross entropy loss issue for small vocab size on amd gpu #3503

Conversation

wangxunx commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context / Motivation

Changes

Testing

Uh oh!

danielhanchen commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wangxunx commented Oct 24, 2025 •

edited

Loading