Skip to content

[Feature Request] Dataset versioning #6484

@kenfus

Description

@kenfus

Is your feature request related to a problem? Please describe.
I am working on a project, where I would like to test different preprocessing methods for my ML-data. Thus, I would like to work a lot with revisions and compare them. Currently, I was not able to make it work with the revision keyword because it was not redownloading the data, it was reading in some cached data, until I put download_mode="force_redownload", even though the reversion was different.
Of course, I may have done something wrong or missed a setting somewhere!

Describe the solution you'd like
The solution would allow me to easily work with revisions:

  • create a new dataset (by combining things, different preprocessing, ..) and give it a new revision (v.1.2.3), maybe like this:
    dataset_audio.push_to_hub('kenfus/xy', revision='v1.0.2')

  • then, get the current revision as follows:

dataset = load_dataset(
            'kenfus/xy', revision='v1.0.2',
        )

this downloads the new version and does not load in a different revision and all future map, filter, .. operations are done on this dataset and not loaded from cache produced from a different revision.

  • if I rerun the run, the caching should be smart enough in every step to not reuse a mapping operation on a different revision.

Describe alternatives you've considered
I created my own caching, putting download_mode="force_redownload" and load_from_cache_file=False, everywhere.

Additional context
Thanks a lot for your great work! Creating NLP datasets and training a model with them is really easy and straightforward with huggingface.

This is the data loading in my script:

    ## CREATE PATHS
    prepared_dataset_path = os.path.join(
        DATA_FOLDER, str(DATA_VERSION), "prepared_dataset"
    )
    os.makedirs(os.path.join(DATA_FOLDER, str(DATA_VERSION)), exist_ok=True)

    ## LOAD DATASET
    if os.path.exists(prepared_dataset_path):
        print("Loading prepared dataset from disk...")
        dataset_prepared = load_from_disk(prepared_dataset_path)
    else:
        print("Loading dataset from HuggingFace Datasets...")
        dataset = load_dataset(
            PATH_TO_DATASET, revision=DATA_VERSION, download_mode="force_redownload"
        )

        print("Preparing dataset...")
        dataset_prepared = dataset.map(
            prepare_dataset,
            remove_columns=["audio", "transcription"],
            num_proc=os.cpu_count(),
            load_from_cache_file=False,
        )
        dataset_prepared.save_to_disk(prepared_dataset_path)
        del dataset

    if CHECK_DATASET:
        ## CHECK DATASET
        dataset_prepared = dataset_prepared.map(
            check_dimensions, num_proc=os.cpu_count(), load_from_cache_file=False
        )
        dataset_filtered = dataset_prepared.filter(
            lambda example: not example["incorrect_dimension"],
            load_from_cache_file=False,
        )

        for example in dataset_prepared.filter(
            lambda example: example["incorrect_dimension"], load_from_cache_file=False
        ):
            print(example["path"])

        print(
            f"Number of examples with incorrect dimension:  {len(dataset_prepared) - len(dataset_filtered)}"
        )

        print("Number of examples train: ", len(dataset_filtered["train"]))
        print("Number of examples test: ", len(dataset_filtered["test"]))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions