Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 27 additions & 11 deletions docs/source/features.rst
Original file line number Diff line number Diff line change
@@ -1,15 +1,20 @@
Dataset features
================

:class:`datasets.Features` defines the internal structure of a dataset. Features are used to specify the underlying serialization format but also contain high-level information regarding the fields, e.g. column names, types, and conversion methods from names to integer values for a class label field.
:class:`datasets.Features` defines the internal structure of a dataset. Features are used to specify the underlying
serialization format but also contain high-level information regarding the fields, e.g. column names, types, and
conversion methods from class label strings to integer values for a :class:`datasets.ClassLabel` field.

A brief summary of how to use this class:

- :class:`datasets.Features` should be only called once and instantiated with a ``dict[str, FieldType]``, where keys are your desired column names, and values are the type of that column.
- :class:`datasets.Features` should be only called once and instantiated with a ``dict[str, FieldType]``, where keys are
your desired column names, and values are the type of that column.

`FieldType` can be one of a few possibilities:
``FieldType`` can be one of a few possibilities:

- a :class:`datasets.Value` feature specifies a single typed value, e.g. ``int64`` or ``string``. The dtypes supported
are as follows:

- a :class:`datasets.Value` feature specifies a single typed value, e.g. ``int64`` or ``string``. The dtypes supported are as follows:
- null
- bool
- int8
Expand All @@ -30,15 +35,26 @@ A brief summary of how to use this class:
- string
- large_string

- a python :obj:`dict` specifies that the field is a nested field containing a mapping of sub-fields to sub-fields features. It's possible to have nested fields of nested fields in an arbitrary manner.

- a python :obj:`list` or a :class:`datasets.Sequence` specifies that the field contains a list of objects. The python :obj:`list` or :class:`datasets.Sequence` should be provided with a single sub-feature as an example of the feature type hosted in this list. Python :obj:`list` are simplest to define and write while :class:`datasets.Sequence` provide a few more specific behaviors like the possibility to specify a fixed length for the list (slightly more efficient).
- a python :obj:`dict` specifies that the field is a nested field containing a mapping of sub-fields to sub-fields
features. It's possible to have nested fields of nested fields in an arbitrary manner.

.. note::
- a python :obj:`list` or a :class:`datasets.Sequence` specifies that the field contains a list of objects. The python
:obj:`list` or :class:`datasets.Sequence` should be provided with a single sub-feature as an example of the feature
type hosted in this list. Python :obj:`list` are simplest to define and write while :class:`datasets.Sequence` provide
a few more specific behaviors like the possibility to specify a fixed length for the list (slightly more efficient).

A :class:`datasets.Sequence` with a internal dictionary feature will be automatically converted into a dictionary of lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be un-wanted in some cases. If you don't want this behavior, you can use a python :obj:`list` instead of the :class:`datasets.Sequence`.
.. note::

- a :class:`datasets.ClassLabel` feature specifies a field with a predefined set of classes which can have labels associated to them and will be stored as integers in the dataset. This field will be stored and retrieved as an integer value and two conversion methods, :func:`datasets.ClassLabel.str2int` and :func:`datasets.ClassLabel.int2str` can be used to convert from the label names to the associate integer value and vice-versa.
A :class:`datasets.Sequence` with a internal dictionary feature will be automatically converted into a dictionary of
lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be
un-wanted in some cases. If you don't want this behavior, you can use a python :obj:`list` instead of the
:class:`datasets.Sequence`.

- finally, two features are specific to Machine Translation: :class:`datasets.Translation` and :class:`datasets.TranslationVariableLanguages`. We refer to the package reference for more details on these features.
- a :class:`datasets.ClassLabel` feature specifies a field with a predefined set of classes which can have labels
associated to them and will be stored as integers in the dataset. This field will be stored and retrieved as an
integer value and two conversion methods, :func:`datasets.ClassLabel.str2int` and :func:`datasets.ClassLabel.int2str`
can be used to convert from the label names to the associate integer value and vice-versa.

- finally, two features are specific to Machine Translation: :class:`datasets.Translation` and
:class:`datasets.TranslationVariableLanguages`. We refer to the :ref:`package reference <package_reference_features>`
for more details on these features.
1 change: 1 addition & 0 deletions docs/source/package_reference/main_classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ It also has dataset transform methods like map or filter, to process all the spl
from_csv, from_json, from_text,
prepare_for_task, align_labels_with_mapping

.. _package_reference_features:

``Features``
~~~~~~~~~~~~~~~~~~~~~
Expand Down
87 changes: 77 additions & 10 deletions src/datasets/features.py
Original file line number Diff line number Diff line change
Expand Up @@ -884,15 +884,15 @@ def encode_nested_example(schema, obj):


def generate_from_dict(obj: Any):
"""Regenerate the nested feature object from a serialized dict.
"""Regenerate the nested feature object from a deserialized dict.
We use the '_type' fields to get the dataclass name to load.

generate_from_dict is the recursive helper for Features.from_dict, and allows for a convenient constructor syntax
to define features from json dictionaries. This function is used in particular when deserializing
a DatasetInfo that was dumped to a json dictionary. This acts as an analogue to
Features.from_arrow_schema and handles the recursive field-by-field instantiation, but doesn't require any
mapping to/from pyarrow, except for the fact that it takes advantage of the mapping of pyarrow primitive dtypes
that Value() automatically performs.
to define features from deserialized JSON dictionaries. This function is used in particular when deserializing
a :class:`DatasetInfo` that was dumped to a JSON object. This acts as an analogue to
:meth:`Features.from_arrow_schema` and handles the recursive field-by-field instantiation, but doesn't require any
mapping to/from pyarrow, except for the fact that it takes advantage of the mapping of pyarrow primitive dtypes
that :class:`Value` automatically performs.
"""
# Nested structures: we allow dict, list/tuples, sequences
if isinstance(obj, list):
Expand Down Expand Up @@ -942,23 +942,79 @@ def generate_from_arrow_type(pa_type: pa.DataType) -> FeatureType:
class Features(dict):
@property
def type(self):
"""
Features field types.

Returns:
:obj:`pyarrow.DataType`
"""
return get_nested_type(self)

@classmethod
def from_arrow_schema(cls, pa_schema: pa.Schema) -> "Features":
"""
Construct Features from Arrow Schema.

Args:
pa_schema (:obj:`pyarrow.Schema`): Arrow Schema.

Returns:
:class:`Features`
"""
obj = {field.name: generate_from_arrow_type(field.type) for field in pa_schema}
return cls(**obj)

@classmethod
def from_dict(cls, dic) -> "Features":
"""
Construct Features from dict.

Regenerate the nested feature object from a deserialized dict.
We use the '_type' key to infer the dataclass name of the feature FieldType.

It allows for a convenient constructor syntax
to define features from deserialized JSON dictionaries. This function is used in particular when deserializing
a :class:`DatasetInfo` that was dumped to a JSON object. This acts as an analogue to
:meth:`Features.from_arrow_schema` and handles the recursive field-by-field instantiation, but doesn't require
any mapping to/from pyarrow, except for the fact that it takes advantage of the mapping of pyarrow primitive
dtypes that :class:`Value` automatically performs.

Args:
dic (:obj:`dict[str, Any]`): Python dictionary.

Returns:
:class:`Features`

Examples:
>>> Features.from_dict({'_type': {'dtype': 'string', 'id': None, '_type': 'Value'}})
{'_type': Value(dtype='string', id=None)}
"""
obj = generate_from_dict(dic)
return cls(**obj)

def encode_example(self, example):
"""
Encode example into a format for Arrow.

Args:
example (:obj:`dict[str, Any]`): Data in a Dataset row.

Returns:
:obj:`dict[str, Any]`
"""
example = cast_to_python_objects(example)
return encode_nested_example(self, example)

def encode_batch(self, batch):
"""
Encode batch into a format for Arrow.

Args:
batch (:obj:`dict[str, list[Any]]`): Data in a Dataset batch.

Returns:
:obj:`dict[str, list[Any]]`
"""
encoded_batch = {}
if set(batch) != set(self):
raise ValueError("Column mismatch between batch {} and features {}".format(set(batch), set(self)))
Expand All @@ -968,16 +1024,28 @@ def encode_batch(self, batch):
return encoded_batch

def copy(self) -> "Features":
"""
Make a deep copy of Features.

Returns:
:class:`Features`
"""
return copy.deepcopy(self)

def reorder_fields_as(self, other: "Features") -> "Features":
"""
The order of the fields is important since it matters for the underlying arrow data.
This method is used to re-order your features to match the fields orders of other features.
Reorder Features fields to match the field order of other Features.

The order of the fields is important since it matters for the underlying arrow data.
Re-ordering the fields allows to make the underlying arrow data type match.

Example::
Args:
other (:class:`Features`): The other Features to align with.

Returns:
:class:`Features`

Examples:

>>> from datasets import Features, Sequence, Value
>>> # let's say we have to features with a different order of nested fields (for a and b for example)
Expand All @@ -988,7 +1056,6 @@ def reorder_fields_as(self, other: "Features") -> "Features":
>>> f1.reorder_fields_as(f2)
{'root': Sequence(feature={'b': Value(dtype='string', id=None), 'a': Value(dtype='string', id=None)}, length=-1, id=None)}
>>> assert f1.reorder_fields_as(f2).type == f2.type

"""

def recursive_reorder(source, target, stack=""):
Expand Down