diff --git a/docs/source/features.rst b/docs/source/features.rst index 23a90ffe4ff..28db1079127 100644 --- a/docs/source/features.rst +++ b/docs/source/features.rst @@ -1,15 +1,20 @@ Dataset features ================ -:class:`datasets.Features` defines the internal structure of a dataset. Features are used to specify the underlying serialization format but also contain high-level information regarding the fields, e.g. column names, types, and conversion methods from names to integer values for a class label field. +:class:`datasets.Features` defines the internal structure of a dataset. Features are used to specify the underlying +serialization format but also contain high-level information regarding the fields, e.g. column names, types, and +conversion methods from class label strings to integer values for a :class:`datasets.ClassLabel` field. A brief summary of how to use this class: -- :class:`datasets.Features` should be only called once and instantiated with a ``dict[str, FieldType]``, where keys are your desired column names, and values are the type of that column. +- :class:`datasets.Features` should be only called once and instantiated with a ``dict[str, FieldType]``, where keys are + your desired column names, and values are the type of that column. -`FieldType` can be one of a few possibilities: +``FieldType`` can be one of a few possibilities: + +- a :class:`datasets.Value` feature specifies a single typed value, e.g. ``int64`` or ``string``. The dtypes supported + are as follows: -- a :class:`datasets.Value` feature specifies a single typed value, e.g. ``int64`` or ``string``. The dtypes supported are as follows: - null - bool - int8 @@ -30,15 +35,26 @@ A brief summary of how to use this class: - string - large_string -- a python :obj:`dict` specifies that the field is a nested field containing a mapping of sub-fields to sub-fields features. It's possible to have nested fields of nested fields in an arbitrary manner. - -- a python :obj:`list` or a :class:`datasets.Sequence` specifies that the field contains a list of objects. The python :obj:`list` or :class:`datasets.Sequence` should be provided with a single sub-feature as an example of the feature type hosted in this list. Python :obj:`list` are simplest to define and write while :class:`datasets.Sequence` provide a few more specific behaviors like the possibility to specify a fixed length for the list (slightly more efficient). +- a python :obj:`dict` specifies that the field is a nested field containing a mapping of sub-fields to sub-fields + features. It's possible to have nested fields of nested fields in an arbitrary manner. -.. note:: +- a python :obj:`list` or a :class:`datasets.Sequence` specifies that the field contains a list of objects. The python + :obj:`list` or :class:`datasets.Sequence` should be provided with a single sub-feature as an example of the feature + type hosted in this list. Python :obj:`list` are simplest to define and write while :class:`datasets.Sequence` provide + a few more specific behaviors like the possibility to specify a fixed length for the list (slightly more efficient). - A :class:`datasets.Sequence` with a internal dictionary feature will be automatically converted into a dictionary of lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be un-wanted in some cases. If you don't want this behavior, you can use a python :obj:`list` instead of the :class:`datasets.Sequence`. + .. note:: -- a :class:`datasets.ClassLabel` feature specifies a field with a predefined set of classes which can have labels associated to them and will be stored as integers in the dataset. This field will be stored and retrieved as an integer value and two conversion methods, :func:`datasets.ClassLabel.str2int` and :func:`datasets.ClassLabel.int2str` can be used to convert from the label names to the associate integer value and vice-versa. + A :class:`datasets.Sequence` with a internal dictionary feature will be automatically converted into a dictionary of + lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be + un-wanted in some cases. If you don't want this behavior, you can use a python :obj:`list` instead of the + :class:`datasets.Sequence`. -- finally, two features are specific to Machine Translation: :class:`datasets.Translation` and :class:`datasets.TranslationVariableLanguages`. We refer to the package reference for more details on these features. +- a :class:`datasets.ClassLabel` feature specifies a field with a predefined set of classes which can have labels + associated to them and will be stored as integers in the dataset. This field will be stored and retrieved as an + integer value and two conversion methods, :func:`datasets.ClassLabel.str2int` and :func:`datasets.ClassLabel.int2str` + can be used to convert from the label names to the associate integer value and vice-versa. +- finally, two features are specific to Machine Translation: :class:`datasets.Translation` and + :class:`datasets.TranslationVariableLanguages`. We refer to the :ref:`package reference ` + for more details on these features. diff --git a/docs/source/package_reference/main_classes.rst b/docs/source/package_reference/main_classes.rst index c3e34ee0395..09810d5ab2d 100644 --- a/docs/source/package_reference/main_classes.rst +++ b/docs/source/package_reference/main_classes.rst @@ -59,6 +59,7 @@ It also has dataset transform methods like map or filter, to process all the spl from_csv, from_json, from_text, prepare_for_task, align_labels_with_mapping +.. _package_reference_features: ``Features`` ~~~~~~~~~~~~~~~~~~~~~ diff --git a/src/datasets/features.py b/src/datasets/features.py index e62e56436e1..f877e808d7a 100644 --- a/src/datasets/features.py +++ b/src/datasets/features.py @@ -884,15 +884,15 @@ def encode_nested_example(schema, obj): def generate_from_dict(obj: Any): - """Regenerate the nested feature object from a serialized dict. + """Regenerate the nested feature object from a deserialized dict. We use the '_type' fields to get the dataclass name to load. generate_from_dict is the recursive helper for Features.from_dict, and allows for a convenient constructor syntax - to define features from json dictionaries. This function is used in particular when deserializing - a DatasetInfo that was dumped to a json dictionary. This acts as an analogue to - Features.from_arrow_schema and handles the recursive field-by-field instantiation, but doesn't require any - mapping to/from pyarrow, except for the fact that it takes advantage of the mapping of pyarrow primitive dtypes - that Value() automatically performs. + to define features from deserialized JSON dictionaries. This function is used in particular when deserializing + a :class:`DatasetInfo` that was dumped to a JSON object. This acts as an analogue to + :meth:`Features.from_arrow_schema` and handles the recursive field-by-field instantiation, but doesn't require any + mapping to/from pyarrow, except for the fact that it takes advantage of the mapping of pyarrow primitive dtypes + that :class:`Value` automatically performs. """ # Nested structures: we allow dict, list/tuples, sequences if isinstance(obj, list): @@ -942,23 +942,79 @@ def generate_from_arrow_type(pa_type: pa.DataType) -> FeatureType: class Features(dict): @property def type(self): + """ + Features field types. + + Returns: + :obj:`pyarrow.DataType` + """ return get_nested_type(self) @classmethod def from_arrow_schema(cls, pa_schema: pa.Schema) -> "Features": + """ + Construct Features from Arrow Schema. + + Args: + pa_schema (:obj:`pyarrow.Schema`): Arrow Schema. + + Returns: + :class:`Features` + """ obj = {field.name: generate_from_arrow_type(field.type) for field in pa_schema} return cls(**obj) @classmethod def from_dict(cls, dic) -> "Features": + """ + Construct Features from dict. + + Regenerate the nested feature object from a deserialized dict. + We use the '_type' key to infer the dataclass name of the feature FieldType. + + It allows for a convenient constructor syntax + to define features from deserialized JSON dictionaries. This function is used in particular when deserializing + a :class:`DatasetInfo` that was dumped to a JSON object. This acts as an analogue to + :meth:`Features.from_arrow_schema` and handles the recursive field-by-field instantiation, but doesn't require + any mapping to/from pyarrow, except for the fact that it takes advantage of the mapping of pyarrow primitive + dtypes that :class:`Value` automatically performs. + + Args: + dic (:obj:`dict[str, Any]`): Python dictionary. + + Returns: + :class:`Features` + + Examples: + >>> Features.from_dict({'_type': {'dtype': 'string', 'id': None, '_type': 'Value'}}) + {'_type': Value(dtype='string', id=None)} + """ obj = generate_from_dict(dic) return cls(**obj) def encode_example(self, example): + """ + Encode example into a format for Arrow. + + Args: + example (:obj:`dict[str, Any]`): Data in a Dataset row. + + Returns: + :obj:`dict[str, Any]` + """ example = cast_to_python_objects(example) return encode_nested_example(self, example) def encode_batch(self, batch): + """ + Encode batch into a format for Arrow. + + Args: + batch (:obj:`dict[str, list[Any]]`): Data in a Dataset batch. + + Returns: + :obj:`dict[str, list[Any]]` + """ encoded_batch = {} if set(batch) != set(self): raise ValueError("Column mismatch between batch {} and features {}".format(set(batch), set(self))) @@ -968,16 +1024,28 @@ def encode_batch(self, batch): return encoded_batch def copy(self) -> "Features": + """ + Make a deep copy of Features. + + Returns: + :class:`Features` + """ return copy.deepcopy(self) def reorder_fields_as(self, other: "Features") -> "Features": """ - The order of the fields is important since it matters for the underlying arrow data. - This method is used to re-order your features to match the fields orders of other features. + Reorder Features fields to match the field order of other Features. + The order of the fields is important since it matters for the underlying arrow data. Re-ordering the fields allows to make the underlying arrow data type match. - Example:: + Args: + other (:class:`Features`): The other Features to align with. + + Returns: + :class:`Features` + + Examples: >>> from datasets import Features, Sequence, Value >>> # let's say we have to features with a different order of nested fields (for a and b for example) @@ -988,7 +1056,6 @@ def reorder_fields_as(self, other: "Features") -> "Features": >>> f1.reorder_fields_as(f2) {'root': Sequence(feature={'b': Value(dtype='string', id=None), 'a': Value(dtype='string', id=None)}, length=-1, id=None)} >>> assert f1.reorder_fields_as(f2).type == f2.type - """ def recursive_reorder(source, target, stack=""):