-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
I made a simple python script to check the NLP library speed, which loads 1.1 TB of textual data.
It has been 8 hours and still, it is on the loading steps.
It does work when the text dataset size is small about 1 GB, but it doesn't scale.
It also uses a single thread during the data loading step.
train_files = glob.glob("xxx/*.txt",recursive=True)
random.shuffle(train_files)
print(train_files)
dataset = nlp.load_dataset('text',
data_files=train_files,
name="customDataset",
version="1.0.0",
cache_dir="xxx/nlp")
Is there something that I am missing ?
Metadata
Metadata
Assignees
Labels
No labels