Skip to content

Very slow data loading on large dataset #546

@agemagician

Description

@agemagician

I made a simple python script to check the NLP library speed, which loads 1.1 TB of textual data.
It has been 8 hours and still, it is on the loading steps.
It does work when the text dataset size is small about 1 GB, but it doesn't scale.
It also uses a single thread during the data loading step.

train_files = glob.glob("xxx/*.txt",recursive=True)
random.shuffle(train_files)

print(train_files)

dataset = nlp.load_dataset('text', 
                           data_files=train_files,
                           name="customDataset",
                           version="1.0.0",
                           cache_dir="xxx/nlp")

Is there something that I am missing ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions