Skip to content

Conversation

@Sachin-0001
Copy link

This PR adds an optional return_file_name parameter to the JSON dataset loader.

When enabled, a new file_name column is added containing the source file name
for each row. Default behavior is unchanged.

Changes:

  • Add return_file_name to JsonConfig
  • Append file name during JSON table generation
  • Add tests covering default and enabled behavior, and ensures other functions are not affected

Motivation:
This helps resume training from checkpoints by identifying already-consumed data shards.

Fixes #5806

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Return the name of the currently loaded file in the load_dataset function.

1 participant