Skip to content

Latest commit

 

History

History
251 lines (204 loc) · 4.52 KB

File metadata and controls

251 lines (204 loc) · 4.52 KB

Main classes

DatasetInfo

[[autodoc]] datasets.DatasetInfo

Dataset

The base class [Dataset] implements a Dataset backed by an Apache Arrow table.

[[autodoc]] datasets.Dataset - add_column - add_item - from_file - from_buffer - from_pandas - from_dict - from_generator - data - cache_files - num_columns - num_rows - column_names - shape - unique - flatten - cast - cast_column - remove_columns - rename_column - rename_columns - select_columns - class_encode_column - len - iter - iter - formatted_as - set_format - set_transform - reset_format - with_format - with_transform - getitem - cleanup_cache_files - map - filter - select - sort - shuffle - train_test_split - shard - to_tf_dataset - push_to_hub - save_to_disk - load_from_disk - flatten_indices - to_csv - to_pandas - to_dict - to_json - to_parquet - to_sql - add_faiss_index - add_faiss_index_from_external_arrays - save_faiss_index - load_faiss_index - add_elasticsearch_index - load_elasticsearch_index - list_indexes - get_index - drop_index - search - search_batch - get_nearest_examples - get_nearest_examples_batch - info - split - builder_name - citation - config_name - dataset_size - description - download_checksums - download_size - features - homepage - license - size_in_bytes - supervised_keys - version - from_csv - from_json - from_parquet - from_text - from_sql - prepare_for_task - align_labels_with_mapping

[[autodoc]] datasets.concatenate_datasets

[[autodoc]] datasets.interleave_datasets

[[autodoc]] datasets.distributed.split_dataset_by_node

[[autodoc]] datasets.enable_caching

[[autodoc]] datasets.disable_caching

[[autodoc]] datasets.is_caching_enabled

DatasetDict

Dictionary with split names as keys ('train', 'test' for example), and Dataset objects as values. It also has dataset transform methods like map or filter, to process all the splits at once.

[[autodoc]] datasets.DatasetDict - data - cache_files - num_columns - num_rows - column_names - shape - unique - cleanup_cache_files - map - filter - sort - shuffle - set_format - reset_format - formatted_as - with_format - with_transform - flatten - cast - cast_column - remove_columns - rename_column - rename_columns - select_columns - class_encode_column - push_to_hub - save_to_disk - load_from_disk - from_csv - from_json - from_parquet - from_text - prepare_for_task

IterableDataset

The base class [IterableDataset] implements an iterable Dataset backed by python generators.

[[autodoc]] datasets.IterableDataset - from_generator - remove_columns - select_columns - cast_column - cast - iter - iter - map - rename_column - filter - shuffle - skip - take - info - split - builder_name - citation - config_name - dataset_size - description - download_checksums - download_size - features - homepage - license - size_in_bytes - supervised_keys - version

IterableDatasetDict

Dictionary with split names as keys ('train', 'test' for example), and IterableDataset objects as values.

[[autodoc]] datasets.IterableDatasetDict - map - filter - shuffle - with_format - cast - cast_column - remove_columns - rename_column - rename_columns - select_columns

Features

[[autodoc]] datasets.Features

[[autodoc]] datasets.Sequence

[[autodoc]] datasets.ClassLabel

[[autodoc]] datasets.Value

[[autodoc]] datasets.Translation

[[autodoc]] datasets.TranslationVariableLanguages

[[autodoc]] datasets.Array2D

[[autodoc]] datasets.Array3D

[[autodoc]] datasets.Array4D

[[autodoc]] datasets.Array5D

[[autodoc]] datasets.Audio

[[autodoc]] datasets.Image

MetricInfo

[[autodoc]] datasets.MetricInfo

Metric

The base class Metric implements a Metric backed by one or several [Dataset].

[[autodoc]] datasets.Metric

Filesystems

[[autodoc]] datasets.filesystems.S3FileSystem

[[autodoc]] datasets.filesystems.extract_path_from_uri

[[autodoc]] datasets.filesystems.is_remote_filesystem

Fingerprint

[[autodoc]] datasets.fingerprint.Hasher