[Bug] Recurring matrix dimensions mismatch issue during GRPO training on 2 Nvidia A100s through GCP.

**```
torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in method matmul of type object at 0x77cd34ddba20>(*(GradTrackingTensor(lvl=1, value=
    FakeTensor(..., device='cuda:0', size=(1, s17, s6), dtype=torch.bfloat16,
               requires_grad=True)
), GradTrackingTensor(lvl=1, value=
    FakeTensor(..., device='cuda:0', size=(2880, 201088), dtype=torch.bfloat16)
)), **{}): got RuntimeError('a and b must have same reduction dim, but got [s17, s6] X [2880, 201088].')
```**




Enviroment: 2 Nvidia 80G A100s on a single GCP VM - ssh through vscode. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug] Recurring matrix dimensions mismatch issue during GRPO training on 2 Nvidia A100s through GCP. #3518

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Recurring matrix dimensions mismatch issue during GRPO training on 2 Nvidia A100s through GCP. #3518

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions