Skip to content

Duplicate data_files when named <split>/<split>.parquet #6272

@lhoestq

Description

@lhoestq

e.g. with u23429/stock_1_minute_ticker

In [1]: from datasets import *

In [2]: b = load_dataset_builder("u23429/stock_1_minute_ticker")
Downloading readme: 100%|██████████████████████████| 627/627 [00:00<00:00, 246kB/s]

In [3]: b.config.data_files
Out[3]: 
{NamedSplit('train'): ['hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/train/train.parquet',
  'hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/train/train.parquet'],
 NamedSplit('validation'): ['hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/validation/validation.parquet',
  'hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/validation/validation.parquet'],
 NamedSplit('test'): ['hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/test/test.parquet',
  'hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/test/test.parquet']}

This bug issue is present in the current datasets 2.14.5 and also on main even after #6244 cc @mariosasko

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions