Optimize contiguous shard and select #4466

lhoestq · 2022-06-09T13:45:39Z

Currently .shard() and .select() always create an indices mapping. However if the requested data are contiguous, it's much more optimized to simply slice the Arrow table instead of building an indices mapping. In particular:

the shard/select operation will be much faster
reading speed will be much faster in the resulting dataset, since it won't have to do a lookup step in the indices mapping

Since .shard() is also used for .map() with num_proc>1, it will also significantly improve the reading speed of multiprocessed .map() operations

Here is an example of speed-up:

>>> import io
>>> import numpy as np
>>> from datasets import Dataset
>>> ds = Dataset.from_dict({"a": np.random.rand(10_000_000)})
>>> shard = ds.shard(num_shards=4, index=0, contiguous=True)  # this calls `.select(range(2_500_000))`
>>> buf = io.BytesIO()
>>> %time dd.to_json(buf)
Creating json from Arrow format: 100%|██████████████████| 100/100 [00:00<00:00, 376.17ba/s]
CPU times: user 258 ms, sys: 9.06 ms, total: 267 ms
Wall time: 266 ms

while previously it was

Creating json from Arrow format: 100%|███████████████████| 100/100 [00:03<00:00, 29.41ba/s]
CPU times: user 3.33 s, sys: 69.1 ms, total: 3.39 s
Wall time: 3.4 s

In this simple case the speed-up is x10, but @sayakpaul experienced a x100 speed-up on its data when exporting to JSON.

Implementation details

I mostly improved .select(): it now checks if the input corresponds to a contiguous chunk of data and then it slices the main Arrow table (or the indices mapping table if it exists). To check if the input indices are contiguous it checks two possibilities:

if the indices is of type range, it checks that start >= 0 and step = 1
otherwise in the general case, it iterates over the indices. If all the indices are contiguous then we're good, otherwise we have to build an indices mapping.

Having to iterate over the indices doesn't cause performance issues IMO because:

either they are contiguous and in this case the cost of iterating over the indices is much less than the cost of creating an indices mapping
or they are not contiguous, and then iterating generally stops quickly when it first encounters the first indice that is not contiguous.

HuggingFaceDocBuilderDev · 2022-06-09T13:53:50Z

The documentation is not available anymore as the PR was closed or merged.

sayakpaul · 2022-06-09T13:57:34Z

I thought of just mentioning the benefits I got. Here's the code that @lhoestq provided:

import os
from datasets import load_dataset
from tqdm.auto import tqdm

ds = load_dataset("squad", split="train")
os.makedirs("tmp")

num_shards = 5
for index in tqdm(range(num_shards)):
    size = len(ds) // num_shards
    shard = Dataset(ds.data.slice(size * index, size), fingerprint=f"{ds._fingerprint}_{index}")
    shard.to_json(f"tmp/data_{index}.jsonl")

It is 1.64s. Previously the code was:

num_shards = 5
for index in tqdm(range(num_shards)):
    shard = ds.shard(num_shards=num_shards, index=index, contiguous=True)
    shard.to_json(f"tmp/data_{index}.jsonl")
    # upload_to_gcs(f"tmp/data_{index}.jsonl")

It was 2min31s.

I ran it on my humble MacBook Pro:

albertvillanova

Thanks, good performance gain!!!

Just some comment/question below.

src/datasets/arrow_dataset.py

albertvillanova · 2022-06-14T06:58:00Z

src/datasets/arrow_dataset.py

+            except StopIteration:
+                return self._select_contiguous(0, 0, new_fingerprint=new_fingerprint)


Naive question: which use case is this for?

It's in case indices is an empty iterable, let me add a comment

albertvillanova

See additional comment on performance.

albertvillanova · 2022-06-14T07:33:31Z

src/datasets/arrow_dataset.py

+            try:
+                start = next(iter(indices))
+            except StopIteration:
+                return self._select_contiguous(0, 0, new_fingerprint=new_fingerprint)
+            if start >= 0:
+                counter_from_start = itertools.count(start=start)
+                if all(i == j for i, j in zip(indices, counter_from_start)):


Also note this implementation has an overhead for np.array and pd.Series, compared to regular Python list:

In [34]: def check_with_counter(indices): ...: start = next(iter(indices)) ...: counter_from_start = itertools.count(start=start) ...: if all(i == j for i, j in zip(indices, counter_from_start)): ...: return True ...: else: ...: return False In [81]: lis = list(range(10_000_000)) In [82]: %timeit check_with_counter(lis) 657 ms ± 5.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [83]: arr = np.array(range(10_000_000)) In [84]: %timeit check_with_counter(arr) 2.43 s ± 37.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [85]: ser = pd.Series(list(range(10_000_000))) In [86]: %timeit check_with_counter(ser) 1.35 s ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Thanks for the info ! In the docstring of select we should maybe encourage users to pass a range instead of a list/array/series in such cases anyway

lhoestq · 2022-06-14T14:50:49Z

I addressed your comments @albertvillanova , let me know what you think :)

albertvillanova

Thanks, good job.

lhoestq added 2 commits June 9, 2022 15:29

optimize contiguous shard and select

0695242

minor

876439f

lhoestq requested a review from albertvillanova June 9, 2022 13:45

albertvillanova reviewed Jun 14, 2022

View reviewed changes

src/datasets/arrow_dataset.py Show resolved Hide resolved

albertvillanova reviewed Jun 14, 2022

View reviewed changes

lhoestq added 3 commits June 14, 2022 15:57

support iterators (and therefore generators)

6d8b4d5

comments + docstrings

875582d

Merge branch 'master' into optimize-contiguous-shard-and-select

fe6325c

albertvillanova approved these changes Jun 14, 2022

View reviewed changes

lhoestq merged commit 5994036 into master Jun 14, 2022

lhoestq deleted the optimize-contiguous-shard-and-select branch June 14, 2022 15:54

cakiki mentioned this pull request Jun 26, 2022

Dataset sharding non-contiguous? #4570

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize contiguous shard and select #4466

Optimize contiguous shard and select #4466

Uh oh!

lhoestq commented Jun 9, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 9, 2022 •

edited

Loading

Uh oh!

sayakpaul commented Jun 9, 2022

Uh oh!

albertvillanova left a comment •

edited

Loading

Uh oh!

Uh oh!

albertvillanova Jun 14, 2022

Uh oh!

lhoestq Jun 14, 2022 •

edited

Loading

Uh oh!

albertvillanova left a comment

Uh oh!

albertvillanova Jun 14, 2022

Uh oh!

lhoestq Jun 14, 2022

Uh oh!

lhoestq commented Jun 14, 2022 •

edited

Loading

Uh oh!

albertvillanova left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		except StopIteration:
		return self._select_contiguous(0, 0, new_fingerprint=new_fingerprint)

Optimize contiguous shard and select #4466

Optimize contiguous shard and select #4466

Uh oh!

Conversation

lhoestq commented Jun 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation details

Uh oh!

HuggingFaceDocBuilderDev commented Jun 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Jun 9, 2022

Uh oh!

albertvillanova left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albertvillanova Jun 14, 2022

Choose a reason for hiding this comment

Uh oh!

lhoestq Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova Jun 14, 2022

Choose a reason for hiding this comment

Uh oh!

lhoestq Jun 14, 2022

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lhoestq commented Jun 9, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 9, 2022 •

edited

Loading

albertvillanova left a comment •

edited

Loading

lhoestq Jun 14, 2022 •

edited

Loading

lhoestq commented Jun 14, 2022 •

edited

Loading