Skip to content

[BUG] ZeRO-3 inference/eval dataloader bypasses SamplerDataset wrapper #140

@seth-hazelnutt

Description

@seth-hazelnutt

Describe the bug

When running validation/test/prediction with DeepSpeed ZeRO-3, non-training dataset loaders can miss the expected .datasets attribute and sampler behavior used by the training dataloader path. This appears to break or destabilize ZeRO-3 inference/eval flows in the current OpenFold3-preview codebase.

To reproduce

  1. Run validation or prediction with a non-training dataset configured under ZeRO-3.
  2. Use existing DEEPSPEED zero-3 config and DataModule setup path for non-training modes.
  3. Observe failures/instability around sampler/dataloader behavior before/within evaluator execution.

Expected behavior

Validation/test/prediction dataloaders should mirror the training dataset wrapper behavior and preserve sampler semantics so ZeRO-3 can execute without unexpected data-module issues.

Notes from TCE fork

Our fork fixed this by:

  • wrapping validation/test/prediction datasets in SamplerDataset during DataModule.setup().
  • preserving eval sampler length (epoch_len=len(dataset)) for consistent sampler semantics.

If useful, I can provide the exact patch/commit references from the fork.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions