support LargeListArray in pyarrow #4800

Jiaxin-Wen · 2022-08-08T03:58:46Z

import numpy as np
import datasets
a = np.zeros((5000000, 768))
res = datasets.Dataset.from_dict({'embedding': a})

'''
  File '/home/wenjiaxin/anaconda3/envs/data/lib/python3.8/site-packages/datasets/arrow_writer.py', line 178, in __arrow_array__
    out = numpy_to_pyarrow_listarray(data)
  File "/home/wenjiaxin/anaconda3/envs/data/lib/python3.8/site-packages/datasets/features/features.py", line 1173, in numpy_to_pyarrow_listarray
    offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int32())
  File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 2147483904 not in range: -2147483648 to 2147483647
'''

Loading a large numpy array currently raises the error above as the type of offsets is int32.
And pyarrow has supported LargeListArray for this case.

HuggingFaceDocBuilderDev · 2022-08-08T04:06:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

mariosasko · 2022-08-10T12:12:03Z

Hi, thanks for working on this! Can you run make style at the repo root to fix the code quality error in CI and add a test?

Jiaxin-Wen · 2022-08-10T15:30:46Z

Hi, I have fixed the code quality error and added a test

Jiaxin-Wen · 2022-08-11T02:14:34Z

It seems that CI fails due to the lack of memory for allocating a large array, while I pass the test locally.

tests/features/test_array_xd.py

mariosasko · 2022-08-11T17:32:14Z

Also, the current implementation of the NumPy-to-PyArrow conversion creates a lot of copies, which is not ideal for large arrays.

We can improve performance significantly if we rewrite this part:

datasets/src/datasets/features/features.py

Lines 1322 to 1323 in 83f695c

    
           arr = np.array(arr) 
        
           values = pa.array(arr.flatten(), type=type)

as

 values = pa.array(arr.ravel(), type=type)

mariosasko · 2022-08-12T13:23:54Z

@XWwwwww Feel free to ignore #4800 (comment) and revert the changes you've made to address it.

Without copying the array, this would be possible:

arr = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

dset = Dataset.from_dict({"data": [arr]})

arr[0][0] = 100 # this change would be reflected in dset's PyArrow table -> a breaking change and also probably unexpected by the user

Jiaxin-Wen · 2022-08-12T14:32:39Z

@XWwwwww Feel free to ignore #4800 (comment) and revert the changes you've made to address it.

Without copying the array, this would be possible:
arr = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

dset = Dataset.from_dict({"data": [arr]})

arr[0][0] = 100 # this change would be reflected in dset's PyArrow table -> a breaking change and also probably unexpected by the user 

Oh, that makes sense.

Jiaxin-Wen · 2022-08-13T03:04:06Z

passed tests in ubuntu while failed in windows

Jiaxin-Wen · 2022-08-14T11:19:17Z

@mariosasko Hi, do you have any clue about this failure in windows?

mariosasko · 2022-08-16T15:49:59Z

Perhaps we can skip the added test on Windows then.

Not sure if this can help, but the ERR tool available on Windows outputs the following for the returned error code -1073741819:

# for decimal -1073741819 / hex 0xc0000005
  ISCSI_ERR_SETUP_NETWORK_NODE                                   iscsilog.h
# Failed to setup initiator portal. Error status is given in
# the dump data.
  STATUS_ACCESS_VIOLATION                                        ntstatus.h
# The instruction at 0x%p referenced memory at 0x%p. The
# memory could not be %s.
  USBD_STATUS_DEV_NOT_RESPONDING                                 usb.h
# as an HRESULT: Severity: FAILURE (1), FACILITY_NONE (0x0), Code 0x5
# for decimal 5 / hex 0x5
  WINBIO_FP_TOO_FAST                                             winbio_err.h
# Move your finger more slowly on the fingerprint reader.
# as an HRESULT: Severity: FAILURE (1), FACILITY_NULL (0x0), Code 0x5
  ERROR_ACCESS_DENIED                                            winerror.h
# Access is denied.
# 5 matches found for "-1073741819"

Jiaxin-Wen · 2022-08-21T07:10:47Z

What's the proper way to skip the added test in windows?
I tried if platform.system() == 'Linux', but the CI test seems stuck

Jiaxin-Wen · 2022-08-27T02:06:14Z

@mariosasko Hi, any idea about this :)

mariosasko · 2022-08-29T11:13:51Z

Hi again! We want to skip the test on Windows but not on Linux. You can use this decorator to do so:

@pytest.mark.skipif(os.name == "nt" and (os.getenv("CIRCLECI") == "true" or os.getenv("GITHUB_ACTIONS") == "true"), reason="The Windows CI runner does not have enough RAM to run this test")
@pytest.mark.parametrize(...)
def test_large_array_xd_with_np(...):
    ...

Jiaxin-Wen · 2022-08-29T13:55:12Z

Hi again! We want to skip the test on Windows but not on Linux. You can use this decorator to do so:

@pytest.mark.skipif(os.name == "nt" and (os.getenv("CIRCLECI") == "true" or os.getenv("GITHUB_ACTIONS") == "true"), reason="The Windows CI runner does not have enough RAM to run this test")
@pytest.mark.parametrize(...)
def test_large_array_xd_with_np(...):
    ...

CI on windows still stucks :(

Jiaxin-Wen · 2022-09-03T04:56:35Z

@mariosasko Hi, could you please take a look at this issue

Jiaxin-Wen · 2022-09-11T03:49:42Z

@mariosasko Hi, all checks have passed, and we are finally ready to merge this PR :)

Jiaxin-Wen · 2022-09-13T09:00:39Z

@lhoestq @albertvillanova Perhaps other maintainers can take a look and merge this PR :)

lhoestq

Thanks for fixing this ! I left a few comments

src/datasets/features/features.py

lhoestq · 2022-09-13T10:14:16Z

src/datasets/features/features.py

+            values = pa.ListArray.from_arrays(offsets, values)
+        else:
+            offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int64())
+            values = pa.LargeListArray.from_arrays(offsets, values)


Have you tried using pa.chunked_array instead of pa.LargeListArray ? (i.e. chunking the input numpy array into small arrays to have a list of pa.ListArray that you concatenate with pa.chunked_array)

In the rest of the code base we don't support LargeListArray and so it could lead to issues when doing type inference or type casting

Do you mean that it will be ok to return a pa.ChunkedArray?

I think so but I haven't tested

I implement this idea.

MAX_CHUNK_SIZE = 1 << 31 - 1 num_offset_per_chunk = MAX_CHUNK_SIZE // step_offsets chunk_len = num_offset_per_chunk * step_offsets num_chunks = math.ceil(n_offsets / num_offset_per_chunk) chunked_arr = np.resize(arr, (math.ceil(arr.flatten().shape[0] / chunk_len), chunk_len)) values = [] for i in range(num_chunks): chunk_values = pa.array(chunked_arr[i: i+1, :].flatten(), type=type) start = i * num_offset_per_chunk end = min(start + num_offset_per_chunk, n_offsets) chunk_offsets = pa.array(np.arange(end - start + 1) * step_offsets) chunk_values = pa.ListArray.from_arrays(chunk_offsets, chunk_values) values.append(chunk_values) values = pa.chunked_array(values, type=type)

However, I still suggest using pa.LargeListArray for two reasons.

It would be more complex to handle the case where step_offsets >= (1 << 31), and my current implementation would fail on it.

I test the speed for building a pyarrow listarray from a numpy array of (50000, 50000). I find that implementing with LargeList would bring a 20x speedup than my current implementation.

Oh yea indeed. I'm not sure it would even work if the shape was (1, 50000, 50000) since I believe a ListArray can't have a ChunkedArray as values.

Let's go for LargeListArray then, though we'll have to update a few things in table.py but this can be handled later I think (let me know if that's something you'd like to help with !)

Considering that @mariosasko may be too busy these days, shall we merge this PR and support this important feature? @lhoestq

we're discussing this internally and doing some tests to make sure - will keep you posted ;)

From internal discussions, this code would fail because the ArrowWriter can't start writing large lists to disk if the schema has been determined to contain regular lists:

import numpy as np import datasets a = np.zeros((5000000, 768), np.uint8) datasets.Dataset.from_dict({"id": range(2)}).map(lambda x: {"a": a if x["id"] else a[:1]}, writer_batch_size=1, new_fingerprint="foo") # ArrowInvalid: Array of type large_list<item: uint8> too large to convert to list<item: uint8>

This means that whenever someone runs map with varying lengths arrays and at one point an array doesn't fit into a regular ListArray, this will fail.

In this case we expect to let the user specify in advance that the largelist type must be used.
Therefore I think we need to add a parameter to Sequence to differentiate between regular lists and large lists. Maybe Sequence(..., large=True) ?
and this way we can also always verify

schema == Features.from_arrow_schema(schema).arrow_schema

I hope that makes sense ^^'

got it! Could you please provide some instructions on how to support this, e.g., functions I have to update

I think the key is to add the large parameter to Sequence and update the functions you modified in this PR to use pa.list_() if large is False, and pa.large_list otherwise

one-matrix · 2023-08-17T02:34:11Z

same issus come from pyarrow.Is there a solution for this?
file parquet:50GB
datasets version: 2.14.4
pyarrow :12.0.1

Generating train split: 0 examples [01:22, ? examples/s]
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 1925, in _prepare_split_single
for _, table in generator:
File "/opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py", line 79, in _generate_tables
for batch_idx, record_batch in enumerate(
File "pyarrow/_parquet.pyx", line 1315, in iter_batches
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: List index overflow.

lkfo415579 · 2024-05-30T07:04:00Z

when this feature adds to the newest version?

lhoestq · 2024-05-30T10:07:20Z

LargeListArray support is not ready yet, there is one remaining change:

I think the key is to add the large parameter to Sequence and update the functions you modified in this PR to use pa.list_() if large is False, and pa.large_list otherwise

thusinh1969 · 2024-09-27T07:55:55Z

Gents, any move on this. Convert largse list of dicts to Datasets is a nightmare and took all RAM possible. Is there any other alternative?

Thanks,
Steve

albertvillanova · 2024-09-27T09:54:16Z

Arrow large_list is supported since datasets 2.21.0. See: https://github.com/huggingface/datasets/releases/tag/2.21.0

Support pyarrow large_list #7019

Jiaxin-Wen added 4 commits July 29, 2022 19:45

fix typo

f56f73e

fix typo

62ad1b7

Merge branch 'huggingface:main' into main

6ed6ff4

support largelist

1968e65

Jiaxin-Wen and others added 4 commits August 10, 2022 22:01

make style

54f9471

make style

d138a18

Merge branch 'huggingface:main' into main

2eb5246

add test about large array

284018f

mariosasko reviewed Aug 11, 2022

View reviewed changes

tests/features/test_array_xd.py Outdated Show resolved Hide resolved

Jiaxin-Wen added 2 commits August 12, 2022 09:09

change dtype to uint8

f36c73c

polish numpy-to-pyarrow conversion

36a10df

Jiaxin-Wen added 3 commits August 12, 2022 22:36

update shape

df5596c

revert changes to numpy-to-pyarrow conversion

5b14b1e

make style

727ebd4

Jiaxin-Wen added 2 commits August 21, 2022 10:40

test only in linux

36fb2ae

make style

3765f9f

skip test on windows

4486995

remove import

677107a

Merge branch 'main' of github.com:huggingface/datasets into main

1d7feb6

mariosasko and others added 3 commits September 7, 2022 16:36

Attempt to fix test

b755b8c

run CI

284717d

speedup

ec62d00

lhoestq reviewed Sep 13, 2022

View reviewed changes

Jiaxin-Wen added 5 commits September 13, 2022 20:01

revert stylistic changes

7b3f9e2

make style

cc4f600

update largelistarray type cast

79224a5

cast listarray to largelistarray during concat

5363c39

re-run CI

6c9cc81

mariosasko mentioned this pull request Mar 16, 2023

Dataset cannot convert too large dictionnary #5632

Open

This was referenced Jun 20, 2024

Add large_list type support in string_to_arrow #6986

Closed

Support pyarrow LargeListType #6835

Closed

albertvillanova mentioned this pull request Jul 2, 2024

Support pyarrow large_list #7019

Merged

albertvillanova closed this in #7019 Aug 12, 2024

support LargeListArray in pyarrow #4800

support LargeListArray in pyarrow #4800

Uh oh!

Conversation

Jiaxin-Wen commented Aug 8, 2022 • edited by lhoestq Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 8, 2022

Uh oh!

mariosasko commented Aug 10, 2022

Uh oh!

Jiaxin-Wen commented Aug 10, 2022

Uh oh!

Jiaxin-Wen commented Aug 11, 2022

Uh oh!

Uh oh!

mariosasko commented Aug 11, 2022

Uh oh!

mariosasko commented Aug 12, 2022

Uh oh!

Jiaxin-Wen commented Aug 12, 2022

Uh oh!

Jiaxin-Wen commented Aug 13, 2022

Uh oh!

Jiaxin-Wen commented Aug 14, 2022

Uh oh!

mariosasko commented Aug 16, 2022

Uh oh!

Jiaxin-Wen commented Aug 21, 2022

Uh oh!

Jiaxin-Wen commented Aug 27, 2022

Uh oh!

mariosasko commented Aug 29, 2022

Uh oh!

Jiaxin-Wen commented Aug 29, 2022

Uh oh!

Jiaxin-Wen commented Sep 3, 2022

Uh oh!

Jiaxin-Wen commented Sep 11, 2022

Uh oh!

Jiaxin-Wen commented Sep 13, 2022

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

one-matrix commented Aug 17, 2023

Uh oh!

lkfo415579 commented May 30, 2024

Uh oh!

lhoestq commented May 30, 2024

Uh oh!

thusinh1969 commented Sep 27, 2024

Uh oh!

albertvillanova commented Sep 27, 2024

Uh oh!

Reviewers

Assignees

Jiaxin-Wen commented Aug 8, 2022 •

edited by lhoestq

Loading

lhoestq Oct 13, 2022 •

edited

Loading