Skip to content

Conversation

@Jiaxin-Wen
Copy link
Contributor

@Jiaxin-Wen Jiaxin-Wen commented Aug 8, 2022

import numpy as np
import datasets
a = np.zeros((5000000, 768))
res = datasets.Dataset.from_dict({'embedding': a})

'''
  File '/home/wenjiaxin/anaconda3/envs/data/lib/python3.8/site-packages/datasets/arrow_writer.py', line 178, in __arrow_array__
    out = numpy_to_pyarrow_listarray(data)
  File "/home/wenjiaxin/anaconda3/envs/data/lib/python3.8/site-packages/datasets/features/features.py", line 1173, in numpy_to_pyarrow_listarray
    offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int32())
  File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 2147483904 not in range: -2147483648 to 2147483647
'''

Loading a large numpy array currently raises the error above as the type of offsets is int32.
And pyarrow has supported LargeListArray for this case.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@mariosasko
Copy link
Collaborator

Hi, thanks for working on this! Can you run make style at the repo root to fix the code quality error in CI and add a test?

@Jiaxin-Wen
Copy link
Contributor Author

Hi, I have fixed the code quality error and added a test

@Jiaxin-Wen
Copy link
Contributor Author

It seems that CI fails due to the lack of memory for allocating a large array, while I pass the test locally.

@mariosasko
Copy link
Collaborator

Also, the current implementation of the NumPy-to-PyArrow conversion creates a lot of copies, which is not ideal for large arrays.

We can improve performance significantly if we rewrite this part:

arr = np.array(arr)
values = pa.array(arr.flatten(), type=type)

as

 values = pa.array(arr.ravel(), type=type) 

@mariosasko
Copy link
Collaborator

@XWwwwww Feel free to ignore #4800 (comment) and revert the changes you've made to address it.

Without copying the array, this would be possible:

arr = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

dset = Dataset.from_dict({"data": [arr]})

arr[0][0] = 100 # this change would be reflected in dset's PyArrow table -> a breaking change and also probably unexpected by the user 

@Jiaxin-Wen
Copy link
Contributor Author

@XWwwwww Feel free to ignore #4800 (comment) and revert the changes you've made to address it.

Without copying the array, this would be possible:

arr = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

dset = Dataset.from_dict({"data": [arr]})

arr[0][0] = 100 # this change would be reflected in dset's PyArrow table -> a breaking change and also probably unexpected by the user 

Oh, that makes sense.

@Jiaxin-Wen
Copy link
Contributor Author

passed tests in ubuntu while failed in windows

@Jiaxin-Wen
Copy link
Contributor Author

@mariosasko Hi, do you have any clue about this failure in windows?

@mariosasko
Copy link
Collaborator

Perhaps we can skip the added test on Windows then.

Not sure if this can help, but the ERR tool available on Windows outputs the following for the returned error code -1073741819:

# for decimal -1073741819 / hex 0xc0000005
  ISCSI_ERR_SETUP_NETWORK_NODE                                   iscsilog.h
# Failed to setup initiator portal. Error status is given in
# the dump data.
  STATUS_ACCESS_VIOLATION                                        ntstatus.h
# The instruction at 0x%p referenced memory at 0x%p. The
# memory could not be %s.
  USBD_STATUS_DEV_NOT_RESPONDING                                 usb.h
# as an HRESULT: Severity: FAILURE (1), FACILITY_NONE (0x0), Code 0x5
# for decimal 5 / hex 0x5
  WINBIO_FP_TOO_FAST                                             winbio_err.h
# Move your finger more slowly on the fingerprint reader.
# as an HRESULT: Severity: FAILURE (1), FACILITY_NULL (0x0), Code 0x5
  ERROR_ACCESS_DENIED                                            winerror.h
# Access is denied.
# 5 matches found for "-1073741819"

@Jiaxin-Wen
Copy link
Contributor Author

What's the proper way to skip the added test in windows?
I tried if platform.system() == 'Linux', but the CI test seems stuck

@Jiaxin-Wen
Copy link
Contributor Author

@mariosasko Hi, any idea about this :)

@mariosasko
Copy link
Collaborator

Hi again! We want to skip the test on Windows but not on Linux. You can use this decorator to do so:

@pytest.mark.skipif(os.name == "nt" and (os.getenv("CIRCLECI") == "true" or os.getenv("GITHUB_ACTIONS") == "true"), reason="The Windows CI runner does not have enough RAM to run this test")
@pytest.mark.parametrize(...)
def test_large_array_xd_with_np(...):
    ...

@Jiaxin-Wen
Copy link
Contributor Author

Hi again! We want to skip the test on Windows but not on Linux. You can use this decorator to do so:

@pytest.mark.skipif(os.name == "nt" and (os.getenv("CIRCLECI") == "true" or os.getenv("GITHUB_ACTIONS") == "true"), reason="The Windows CI runner does not have enough RAM to run this test")
@pytest.mark.parametrize(...)
def test_large_array_xd_with_np(...):
    ...

CI on windows still stucks :(

@Jiaxin-Wen
Copy link
Contributor Author

@mariosasko Hi, could you please take a look at this issue

@Jiaxin-Wen
Copy link
Contributor Author

@mariosasko Hi, all checks have passed, and we are finally ready to merge this PR :)

@Jiaxin-Wen
Copy link
Contributor Author

@lhoestq @albertvillanova Perhaps other maintainers can take a look and merge this PR :)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this ! I left a few comments

values = pa.ListArray.from_arrays(offsets, values)
else:
offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int64())
values = pa.LargeListArray.from_arrays(offsets, values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried using pa.chunked_array instead of pa.LargeListArray ? (i.e. chunking the input numpy array into small arrays to have a list of pa.ListArray that you concatenate with pa.chunked_array)

In the rest of the code base we don't support LargeListArray and so it could lead to issues when doing type inference or type casting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that it will be ok to return a pa.ChunkedArray?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so but I haven't tested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implement this idea.

    MAX_CHUNK_SIZE = 1 << 31 - 1
    num_offset_per_chunk = MAX_CHUNK_SIZE // step_offsets
    chunk_len = num_offset_per_chunk * step_offsets
    num_chunks = math.ceil(n_offsets / num_offset_per_chunk)     
    chunked_arr = np.resize(arr, (math.ceil(arr.flatten().shape[0] / chunk_len), chunk_len))
    values = []
    for i in range(num_chunks):
        chunk_values = pa.array(chunked_arr[i: i+1, :].flatten(), type=type)
        start = i * num_offset_per_chunk
        end = min(start + num_offset_per_chunk, n_offsets)
        chunk_offsets = pa.array(np.arange(end - start + 1) * step_offsets)
        chunk_values = pa.ListArray.from_arrays(chunk_offsets, chunk_values)
        values.append(chunk_values)
    values = pa.chunked_array(values, type=type)

However, I still suggest using pa.LargeListArray for two reasons.

  1. It would be more complex to handle the case where step_offsets >= (1 << 31), and my current implementation would fail on it.

  2. I test the speed for building a pyarrow listarray from a numpy array of (50000, 50000). I find that implementing with LargeList would bring a 20x speedup than my current implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yea indeed. I'm not sure it would even work if the shape was (1, 50000, 50000) since I believe a ListArray can't have a ChunkedArray as values.

Let's go for LargeListArray then, though we'll have to update a few things in table.py but this can be handled later I think (let me know if that's something you'd like to help with !)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that @mariosasko may be too busy these days, shall we merge this PR and support this important feature? @lhoestq

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're discussing this internally and doing some tests to make sure - will keep you posted ;)

Copy link
Member

@lhoestq lhoestq Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From internal discussions, this code would fail because the ArrowWriter can't start writing large lists to disk if the schema has been determined to contain regular lists:

import numpy as np
import datasets
a = np.zeros((5000000, 768), np.uint8)

datasets.Dataset.from_dict({"id": range(2)}).map(lambda x: {"a": a if x["id"] else a[:1]}, writer_batch_size=1, new_fingerprint="foo")
# ArrowInvalid: Array of type large_list<item: uint8> too large to convert to list<item: uint8>

This means that whenever someone runs map with varying lengths arrays and at one point an array doesn't fit into a regular ListArray, this will fail.

In this case we expect to let the user specify in advance that the largelist type must be used.
Therefore I think we need to add a parameter to Sequence to differentiate between regular lists and large lists. Maybe Sequence(..., large=True) ?
and this way we can also always verify

schema == Features.from_arrow_schema(schema).arrow_schema

I hope that makes sense ^^'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it! Could you please provide some instructions on how to support this, e.g., functions I have to update

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key is to add the large parameter to Sequence and update the functions you modified in this PR to use pa.list_() if large is False, and pa.large_list otherwise

@one-matrix
Copy link

same issus come from pyarrow.Is there a solution for this?
file parquet:50GB
datasets version: 2.14.4
pyarrow :12.0.1

Generating train split: 0 examples [01:22, ? examples/s]
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 1925, in _prepare_split_single
for _, table in generator:
File "/opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py", line 79, in _generate_tables
for batch_idx, record_batch in enumerate(
File "pyarrow/_parquet.pyx", line 1315, in iter_batches
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: List index overflow.

@lkfo415579
Copy link

when this feature adds to the newest version?

@lhoestq
Copy link
Member

lhoestq commented May 30, 2024

LargeListArray support is not ready yet, there is one remaining change:

I think the key is to add the large parameter to Sequence and update the functions you modified in this PR to use pa.list_() if large is False, and pa.large_list otherwise

@thusinh1969
Copy link

Gents, any move on this. Convert largse list of dicts to Datasets is a nightmare and took all RAM possible. Is there any other alternative?

Thanks,
Steve

@albertvillanova
Copy link
Member

Arrow large_list is supported since datasets 2.21.0. See: https://github.com/huggingface/datasets/releases/tag/2.21.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants