As discussed on slack it appears that to_json is quite slow on huge datasets like OSCAR.
I implemented sharded saving, which is much much faster - but the tqdm bars all overwrite each other, so it's hard to make sense of the progress, so if possible ideally this multi-proc support could be implemented internally in to_json via num_proc argument. I guess num_proc will be the number of shards?
I think the user will need to use this feature wisely, since too many processes writing to say normal style HD is likely to be slower than one process.
I'm not sure whether the user should be responsible to concatenate the shards at the end or datasets, either way works for my needs.
The code I was using:
from multiprocessing import cpu_count, Process, Queue
[...]
filtered_dataset = concat_dataset.map(filter_short_documents, batched=True, batch_size=256, num_proc=cpu_count())
DATASET_NAME = "oscar"
SHARDS = 10
def process_shard(idx):
print(f"Sharding {idx}")
ds_shard = filtered_dataset.shard(SHARDS, idx, contiguous=True)
# ds_shard = ds_shard.shuffle() # remove contiguous=True above if shuffling
print(f"Saving {DATASET_NAME}-{idx}.jsonl")
ds_shard.to_json(f"{DATASET_NAME}-{idx}.jsonl", orient="records", lines=True, force_ascii=False)
queue = Queue()
processes = [Process(target=process_shard, args=(idx,)) for idx in range(SHARDS)]
for p in processes:
p.start()
for p in processes:
p.join()
Thank you!
@lhoestq
As discussed on slack it appears that
to_jsonis quite slow on huge datasets like OSCAR.I implemented sharded saving, which is much much faster - but the tqdm bars all overwrite each other, so it's hard to make sense of the progress, so if possible ideally this multi-proc support could be implemented internally in
to_jsonvianum_procargument. I guessnum_procwill be the number of shards?I think the user will need to use this feature wisely, since too many processes writing to say normal style HD is likely to be slower than one process.
I'm not sure whether the user should be responsible to concatenate the shards at the end or
datasets, either way works for my needs.The code I was using:
Thank you!
@lhoestq