Very slow data loading on large dataset

I made a simple python script to check the NLP library speed, which loads 1.1 TB of textual data.
It has been 8 hours and still, it is on the loading steps.
It does work when the text dataset size is small about  1 GB, but it doesn't scale.
It also uses a single thread during the data loading step.

```
train_files = glob.glob("xxx/*.txt",recursive=True)
random.shuffle(train_files)

print(train_files)

dataset = nlp.load_dataset('text', 
                           data_files=train_files,
                           name="customDataset",
                           version="1.0.0",
                           cache_dir="xxx/nlp")
```

Is there something that I am missing ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Very slow data loading on large dataset #546

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Very slow data loading on large dataset #546

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions