-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Is your feature request related to a problem? Please describe.
I am working on a project, where I would like to test different preprocessing methods for my ML-data. Thus, I would like to work a lot with revisions and compare them. Currently, I was not able to make it work with the revision keyword because it was not redownloading the data, it was reading in some cached data, until I put download_mode="force_redownload", even though the reversion was different.
Of course, I may have done something wrong or missed a setting somewhere!
Describe the solution you'd like
The solution would allow me to easily work with revisions:
-
create a new dataset (by combining things, different preprocessing, ..) and give it a new revision (v.1.2.3), maybe like this:
dataset_audio.push_to_hub('kenfus/xy', revision='v1.0.2') -
then, get the current revision as follows:
dataset = load_dataset(
'kenfus/xy', revision='v1.0.2',
)
this downloads the new version and does not load in a different revision and all future map, filter, .. operations are done on this dataset and not loaded from cache produced from a different revision.
- if I rerun the run, the caching should be smart enough in every step to not reuse a mapping operation on a different revision.
Describe alternatives you've considered
I created my own caching, putting download_mode="force_redownload" and load_from_cache_file=False, everywhere.
Additional context
Thanks a lot for your great work! Creating NLP datasets and training a model with them is really easy and straightforward with huggingface.
This is the data loading in my script:
## CREATE PATHS
prepared_dataset_path = os.path.join(
DATA_FOLDER, str(DATA_VERSION), "prepared_dataset"
)
os.makedirs(os.path.join(DATA_FOLDER, str(DATA_VERSION)), exist_ok=True)
## LOAD DATASET
if os.path.exists(prepared_dataset_path):
print("Loading prepared dataset from disk...")
dataset_prepared = load_from_disk(prepared_dataset_path)
else:
print("Loading dataset from HuggingFace Datasets...")
dataset = load_dataset(
PATH_TO_DATASET, revision=DATA_VERSION, download_mode="force_redownload"
)
print("Preparing dataset...")
dataset_prepared = dataset.map(
prepare_dataset,
remove_columns=["audio", "transcription"],
num_proc=os.cpu_count(),
load_from_cache_file=False,
)
dataset_prepared.save_to_disk(prepared_dataset_path)
del dataset
if CHECK_DATASET:
## CHECK DATASET
dataset_prepared = dataset_prepared.map(
check_dimensions, num_proc=os.cpu_count(), load_from_cache_file=False
)
dataset_filtered = dataset_prepared.filter(
lambda example: not example["incorrect_dimension"],
load_from_cache_file=False,
)
for example in dataset_prepared.filter(
lambda example: example["incorrect_dimension"], load_from_cache_file=False
):
print(example["path"])
print(
f"Number of examples with incorrect dimension: {len(dataset_prepared) - len(dataset_filtered)}"
)
print("Number of examples train: ", len(dataset_filtered["train"]))
print("Number of examples test: ", len(dataset_filtered["test"]))