Skip to content

[Bug] Recurring matrix dimensions mismatch issue during GRPO training on 2 Nvidia A100s through GCP. #3518

@prakritishetty

Description

@prakritishetty

**```
torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in method matmul of type object at 0x77cd34ddba20>(*(GradTrackingTensor(lvl=1, value=
FakeTensor(..., device='cuda:0', size=(1, s17, s6), dtype=torch.bfloat16,
requires_grad=True)
), GradTrackingTensor(lvl=1, value=
FakeTensor(..., device='cuda:0', size=(2880, 201088), dtype=torch.bfloat16)
)), **{}): got RuntimeError('a and b must have same reduction dim, but got [s17, s6] X [2880, 201088].')





Enviroment: 2 Nvidia 80G A100s on a single GCP VM - ssh through vscode. 

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions