I'm training a Swedish Wav2vec2 model on a Linux GPU and having issues that the huggingface cached dataset folder is completely filling up my disk space (I'm training on a dataset of around 500 gb).
The cache folder is 500gb (and now my disk space is full).
Is there a way to toggle caching or set the caching to be stored on a different device (I have another drive with 4 tb that could hold the caching files).
This might not technically be a bug, but I was unsure and I felt that the bug was the closest one.
Traceback (most recent call last):
File "/home/birger/miniconda3/envs/wav2vec2/lib/python3.7/site-packages/multiprocess/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/birger/miniconda3/envs/wav2vec2/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 186, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/birger/miniconda3/envs/wav2vec2/lib/python3.7/site-packages/datasets/fingerprint.py", line 397, in wrapper
out = func(self, *args, **kwargs)
File "/home/birger/miniconda3/envs/wav2vec2/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1983, in _map_single
writer.finalize()
File "/home/birger/miniconda3/envs/wav2vec2/lib/python3.7/site-packages/datasets/arrow_writer.py", line 418, in finalize
self.pa_writer.close()
File "pyarrow/ipc.pxi", line 402, in pyarrow.lib._CRecordBatchWriter.close
File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
OSError: [Errno 28] Error writing bytes to file. Detail: [errno 28] No space left on device
"""
The above exception was the direct cause of the following exception:
I'm training a Swedish Wav2vec2 model on a Linux GPU and having issues that the huggingface cached dataset folder is completely filling up my disk space (I'm training on a dataset of around 500 gb).
The cache folder is 500gb (and now my disk space is full).
Is there a way to toggle caching or set the caching to be stored on a different device (I have another drive with 4 tb that could hold the caching files).
This might not technically be a bug, but I was unsure and I felt that the bug was the closest one.
Traceback (most recent call last):
File "/home/birger/miniconda3/envs/wav2vec2/lib/python3.7/site-packages/multiprocess/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/birger/miniconda3/envs/wav2vec2/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 186, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/birger/miniconda3/envs/wav2vec2/lib/python3.7/site-packages/datasets/fingerprint.py", line 397, in wrapper
out = func(self, *args, **kwargs)
File "/home/birger/miniconda3/envs/wav2vec2/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1983, in _map_single
writer.finalize()
File "/home/birger/miniconda3/envs/wav2vec2/lib/python3.7/site-packages/datasets/arrow_writer.py", line 418, in finalize
self.pa_writer.close()
File "pyarrow/ipc.pxi", line 402, in pyarrow.lib._CRecordBatchWriter.close
File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
OSError: [Errno 28] Error writing bytes to file. Detail: [errno 28] No space left on device
"""
The above exception was the direct cause of the following exception: