Dataset cannot convert too large dictionnary

### Describe the bug

Hello everyone!

I tried to build a new dataset with the command "dict_valid = datasets.Dataset.from_dict({'input_values': values_array})".
However, I have a very large dataset (~400Go) and it seems that dataset cannot handle this.

Indeed, I can create the dataset until a certain size of my dictionnary, and then I have the error "OverflowError: Python int too large to convert to C long".

Do you know how to solve this problem?
Unfortunately I cannot give a reproductible code because I cannot share a so large file, but you can find the code below (it's a test on only a part of the validation data ~10Go, but it's already the case).

Thank you!

### Steps to reproduce the bug

SAVE_DIR = './data/'
features = h5py.File(SAVE_DIR+'features.hdf5','r')

valid_data = features["validation"]["data/features"]

v_array_values = [np.float32(item[()]) for item in valid_data.values()]
for i in range(len(v_array_values)):
    v_array_values[i] = v_array_values[i].round(decimals=5)

dict_valid = datasets.Dataset.from_dict({'input_values': v_array_values})

### Expected behavior

The code is expected to give me a Huggingface dataset.

### Environment info

python: 3.8.15
numpy: 1.22.3
datasets: 2.3.2
pyarrow: 8.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset cannot convert too large dictionnary #5632

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset cannot convert too large dictionnary #5632

Description

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions