Skip to content

Allow opposite of remove_columns on Dataset and DatasetDict #5468

@hollance

Description

@hollance

Feature request

In this blog post https://huggingface.co/blog/audio-datasets, I noticed the following code:

COLUMNS_TO_KEEP = ["text", "audio"]
all_columns = gigaspeech["train"].column_names
columns_to_remove = set(all_columns) - set(COLUMNS_TO_KEEP)

gigaspeech = gigaspeech.remove_columns(columns_to_remove)

This kind of thing happens a lot when you don't need to keep all columns from the dataset. It would be more convenient (and less error prone) if you could just write:

gigaspeech = gigaspeech.keep_columns(["text", "audio"])

Internally, keep_columns could still call remove_columns, but it expresses more clearly what the user's intent is.

Motivation

Less code to write for the user of the dataset.

Your contribution

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions