-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Describe the bug
Hello everyone!
I tried to build a new dataset with the command "dict_valid = datasets.Dataset.from_dict({'input_values': values_array})".
However, I have a very large dataset (~400Go) and it seems that dataset cannot handle this.
Indeed, I can create the dataset until a certain size of my dictionnary, and then I have the error "OverflowError: Python int too large to convert to C long".
Do you know how to solve this problem?
Unfortunately I cannot give a reproductible code because I cannot share a so large file, but you can find the code below (it's a test on only a part of the validation data ~10Go, but it's already the case).
Thank you!
Steps to reproduce the bug
SAVE_DIR = './data/'
features = h5py.File(SAVE_DIR+'features.hdf5','r')
valid_data = features["validation"]["data/features"]
v_array_values = [np.float32(item[()]) for item in valid_data.values()]
for i in range(len(v_array_values)):
v_array_values[i] = v_array_values[i].round(decimals=5)
dict_valid = datasets.Dataset.from_dict({'input_values': v_array_values})
Expected behavior
The code is expected to give me a Huggingface dataset.
Environment info
python: 3.8.15
numpy: 1.22.3
datasets: 2.3.2
pyarrow: 8.0.0