-
Notifications
You must be signed in to change notification settings - Fork 3k
support LargeListArray in pyarrow #4800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
Hi, thanks for working on this! Can you run |
|
Hi, I have fixed the code quality error and added a test |
|
It seems that CI fails due to the lack of memory for allocating a large array, while I pass the test locally. |
|
Also, the current implementation of the NumPy-to-PyArrow conversion creates a lot of copies, which is not ideal for large arrays. We can improve performance significantly if we rewrite this part: datasets/src/datasets/features/features.py Lines 1322 to 1323 in 83f695c
as values = pa.array(arr.ravel(), type=type) |
|
@XWwwwww Feel free to ignore #4800 (comment) and revert the changes you've made to address it. Without copying the array, this would be possible: arr = np.array([
[1, 2, 3],
[4, 5, 6]
])
dset = Dataset.from_dict({"data": [arr]})
arr[0][0] = 100 # this change would be reflected in dset's PyArrow table -> a breaking change and also probably unexpected by the user |
Oh, that makes sense. |
|
passed tests in ubuntu while failed in windows |
|
@mariosasko Hi, do you have any clue about this failure in windows? |
|
Perhaps we can skip the added test on Windows then. Not sure if this can help, but the ERR tool available on Windows outputs the following for the returned error code |
|
What's the proper way to skip the added test in windows? |
|
@mariosasko Hi, any idea about this :) |
|
Hi again! We want to skip the test on Windows but not on Linux. You can use this decorator to do so: @pytest.mark.skipif(os.name == "nt" and (os.getenv("CIRCLECI") == "true" or os.getenv("GITHUB_ACTIONS") == "true"), reason="The Windows CI runner does not have enough RAM to run this test")
@pytest.mark.parametrize(...)
def test_large_array_xd_with_np(...):
... |
CI on windows still stucks :( |
|
@mariosasko Hi, could you please take a look at this issue |
|
@mariosasko Hi, all checks have passed, and we are finally ready to merge this PR :) |
|
@lhoestq @albertvillanova Perhaps other maintainers can take a look and merge this PR :) |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this ! I left a few comments
| values = pa.ListArray.from_arrays(offsets, values) | ||
| else: | ||
| offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int64()) | ||
| values = pa.LargeListArray.from_arrays(offsets, values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried using pa.chunked_array instead of pa.LargeListArray ? (i.e. chunking the input numpy array into small arrays to have a list of pa.ListArray that you concatenate with pa.chunked_array)
In the rest of the code base we don't support LargeListArray and so it could lead to issues when doing type inference or type casting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that it will be ok to return a pa.ChunkedArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so but I haven't tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implement this idea.
MAX_CHUNK_SIZE = 1 << 31 - 1
num_offset_per_chunk = MAX_CHUNK_SIZE // step_offsets
chunk_len = num_offset_per_chunk * step_offsets
num_chunks = math.ceil(n_offsets / num_offset_per_chunk)
chunked_arr = np.resize(arr, (math.ceil(arr.flatten().shape[0] / chunk_len), chunk_len))
values = []
for i in range(num_chunks):
chunk_values = pa.array(chunked_arr[i: i+1, :].flatten(), type=type)
start = i * num_offset_per_chunk
end = min(start + num_offset_per_chunk, n_offsets)
chunk_offsets = pa.array(np.arange(end - start + 1) * step_offsets)
chunk_values = pa.ListArray.from_arrays(chunk_offsets, chunk_values)
values.append(chunk_values)
values = pa.chunked_array(values, type=type)However, I still suggest using pa.LargeListArray for two reasons.
-
It would be more complex to handle the case where
step_offsets >= (1 << 31), and my current implementation would fail on it. -
I test the speed for building a pyarrow listarray from a numpy array of (50000, 50000). I find that implementing with LargeList would bring a 20x speedup than my current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yea indeed. I'm not sure it would even work if the shape was (1, 50000, 50000) since I believe a ListArray can't have a ChunkedArray as values.
Let's go for LargeListArray then, though we'll have to update a few things in table.py but this can be handled later I think (let me know if that's something you'd like to help with !)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that @mariosasko may be too busy these days, shall we merge this PR and support this important feature? @lhoestq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're discussing this internally and doing some tests to make sure - will keep you posted ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From internal discussions, this code would fail because the ArrowWriter can't start writing large lists to disk if the schema has been determined to contain regular lists:
import numpy as np
import datasets
a = np.zeros((5000000, 768), np.uint8)
datasets.Dataset.from_dict({"id": range(2)}).map(lambda x: {"a": a if x["id"] else a[:1]}, writer_batch_size=1, new_fingerprint="foo")
# ArrowInvalid: Array of type large_list<item: uint8> too large to convert to list<item: uint8>This means that whenever someone runs map with varying lengths arrays and at one point an array doesn't fit into a regular ListArray, this will fail.
In this case we expect to let the user specify in advance that the largelist type must be used.
Therefore I think we need to add a parameter to Sequence to differentiate between regular lists and large lists. Maybe Sequence(..., large=True) ?
and this way we can also always verify
schema == Features.from_arrow_schema(schema).arrow_schemaI hope that makes sense ^^'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it! Could you please provide some instructions on how to support this, e.g., functions I have to update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the key is to add the large parameter to Sequence and update the functions you modified in this PR to use pa.list_() if large is False, and pa.large_list otherwise
|
same issus come from pyarrow.Is there a solution for this? Generating train split: 0 examples [01:22, ? examples/s] |
|
when this feature adds to the newest version? |
|
LargeListArray support is not ready yet, there is one remaining change:
|
|
Gents, any move on this. Convert largse list of dicts to Datasets is a nightmare and took all RAM possible. Is there any other alternative? Thanks, |
|
Arrow large_list is supported since datasets 2.21.0. See: https://github.com/huggingface/datasets/releases/tag/2.21.0 |
Loading a large numpy array currently raises the error above as the type of offsets is
int32.And pyarrow has supported LargeListArray for this case.