[Feature Request] Dataset versioning

**Is your feature request related to a problem? Please describe.**
I am working on a project, where I would like to test different preprocessing methods for my ML-data. Thus, I would like to work a lot with revisions and compare them. Currently, I was not able to make it work with the revision keyword because it was not redownloading the data, it was reading in some cached data, until I put  `download_mode="force_redownload"`, even though the reversion was different. 
Of course, I may have done something wrong or missed a setting somewhere! 

**Describe the solution you'd like**
The solution would allow me to easily work with revisions: 
- create a new dataset (by combining things, different preprocessing, ..) and give it a new revision (v.1.2.3), maybe like this:
`dataset_audio.push_to_hub('kenfus/xy', revision='v1.0.2')`

- then, get the current revision as follows:         
```
dataset = load_dataset(
            'kenfus/xy', revision='v1.0.2',
        )
```
this downloads the new version and does not load in a different revision and all future map, filter, .. operations are done on this dataset and not loaded from cache produced from a different revision. 
- if I rerun the run, the caching should be smart enough in every step to not reuse a mapping operation on a different revision. 

**Describe alternatives you've considered**
I created my own caching, putting `download_mode="force_redownload"` and `load_from_cache_file=False,` everywhere.

**Additional context**
Thanks a lot for your great work! Creating NLP datasets and training a model with them is really easy and straightforward with huggingface.

This is the data loading in my script:

```
    ## CREATE PATHS
    prepared_dataset_path = os.path.join(
        DATA_FOLDER, str(DATA_VERSION), "prepared_dataset"
    )
    os.makedirs(os.path.join(DATA_FOLDER, str(DATA_VERSION)), exist_ok=True)

    ## LOAD DATASET
    if os.path.exists(prepared_dataset_path):
        print("Loading prepared dataset from disk...")
        dataset_prepared = load_from_disk(prepared_dataset_path)
    else:
        print("Loading dataset from HuggingFace Datasets...")
        dataset = load_dataset(
            PATH_TO_DATASET, revision=DATA_VERSION, download_mode="force_redownload"
        )

        print("Preparing dataset...")
        dataset_prepared = dataset.map(
            prepare_dataset,
            remove_columns=["audio", "transcription"],
            num_proc=os.cpu_count(),
            load_from_cache_file=False,
        )
        dataset_prepared.save_to_disk(prepared_dataset_path)
        del dataset

    if CHECK_DATASET:
        ## CHECK DATASET
        dataset_prepared = dataset_prepared.map(
            check_dimensions, num_proc=os.cpu_count(), load_from_cache_file=False
        )
        dataset_filtered = dataset_prepared.filter(
            lambda example: not example["incorrect_dimension"],
            load_from_cache_file=False,
        )

        for example in dataset_prepared.filter(
            lambda example: example["incorrect_dimension"], load_from_cache_file=False
        ):
            print(example["path"])

        print(
            f"Number of examples with incorrect dimension:  {len(dataset_prepared) - len(dataset_filtered)}"
        )

        print("Number of examples train: ", len(dataset_filtered["train"]))
        print("Number of examples test: ", len(dataset_filtered["test"]))

```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Dataset versioning #6484

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Dataset versioning #6484

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions