[`to_json`] add multi-proc sharding support

As discussed on slack it appears that `to_json` is quite slow on huge datasets like OSCAR.

I implemented sharded saving, which is much much faster - but the tqdm bars all overwrite each other, so it's hard to make sense of the progress, so if possible ideally this multi-proc support could be implemented internally in `to_json` via `num_proc` argument. I guess `num_proc` will be the number of shards?

I think the user will need to use this feature wisely, since too many processes writing to say normal style HD is likely to be slower than one process.

I'm not sure whether the user should be responsible to concatenate the shards at the end  or `datasets`, either way works for my needs.

The code I was using:

```
from multiprocessing import cpu_count, Process, Queue

[...]

filtered_dataset = concat_dataset.map(filter_short_documents, batched=True, batch_size=256, num_proc=cpu_count())

DATASET_NAME = "oscar"
SHARDS = 10
def process_shard(idx):
    print(f"Sharding {idx}")
    ds_shard = filtered_dataset.shard(SHARDS, idx, contiguous=True)
    # ds_shard = ds_shard.shuffle() # remove contiguous=True above if shuffling
    print(f"Saving {DATASET_NAME}-{idx}.jsonl")
    ds_shard.to_json(f"{DATASET_NAME}-{idx}.jsonl", orient="records", lines=True, force_ascii=False)

queue = Queue()
processes = [Process(target=process_shard, args=(idx,)) for idx in range(SHARDS)]
for p in processes:
    p.start()

for p in processes:
    p.join()
```

Thank you!

@lhoestq 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`to_json`] add multi-proc sharding support #2663

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[to_json] add multi-proc sharding support #2663

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[`to_json`] add multi-proc sharding support #2663