Skip to content

[to_json] add multi-proc sharding support #2663

@stas00

Description

@stas00

As discussed on slack it appears that to_json is quite slow on huge datasets like OSCAR.

I implemented sharded saving, which is much much faster - but the tqdm bars all overwrite each other, so it's hard to make sense of the progress, so if possible ideally this multi-proc support could be implemented internally in to_json via num_proc argument. I guess num_proc will be the number of shards?

I think the user will need to use this feature wisely, since too many processes writing to say normal style HD is likely to be slower than one process.

I'm not sure whether the user should be responsible to concatenate the shards at the end or datasets, either way works for my needs.

The code I was using:

from multiprocessing import cpu_count, Process, Queue

[...]

filtered_dataset = concat_dataset.map(filter_short_documents, batched=True, batch_size=256, num_proc=cpu_count())

DATASET_NAME = "oscar"
SHARDS = 10
def process_shard(idx):
    print(f"Sharding {idx}")
    ds_shard = filtered_dataset.shard(SHARDS, idx, contiguous=True)
    # ds_shard = ds_shard.shuffle() # remove contiguous=True above if shuffling
    print(f"Saving {DATASET_NAME}-{idx}.jsonl")
    ds_shard.to_json(f"{DATASET_NAME}-{idx}.jsonl", orient="records", lines=True, force_ascii=False)

queue = Queue()
processes = [Process(target=process_shard, args=(idx,)) for idx in range(SHARDS)]
for p in processes:
    p.start()

for p in processes:
    p.join()

Thank you!

@lhoestq

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions