Skip to content

Dataset cannot convert too large dictionnary #5632

@MaraLac

Description

@MaraLac

Describe the bug

Hello everyone!

I tried to build a new dataset with the command "dict_valid = datasets.Dataset.from_dict({'input_values': values_array})".
However, I have a very large dataset (~400Go) and it seems that dataset cannot handle this.

Indeed, I can create the dataset until a certain size of my dictionnary, and then I have the error "OverflowError: Python int too large to convert to C long".

Do you know how to solve this problem?
Unfortunately I cannot give a reproductible code because I cannot share a so large file, but you can find the code below (it's a test on only a part of the validation data ~10Go, but it's already the case).

Thank you!

Steps to reproduce the bug

SAVE_DIR = './data/'
features = h5py.File(SAVE_DIR+'features.hdf5','r')

valid_data = features["validation"]["data/features"]

v_array_values = [np.float32(item[()]) for item in valid_data.values()]
for i in range(len(v_array_values)):
v_array_values[i] = v_array_values[i].round(decimals=5)

dict_valid = datasets.Dataset.from_dict({'input_values': v_array_values})

Expected behavior

The code is expected to give me a Huggingface dataset.

Environment info

python: 3.8.15
numpy: 1.22.3
datasets: 2.3.2
pyarrow: 8.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions