Skip to content

yaml error using push_to_hub with generated README.md #6112

@kevintee

Description

@kevintee

Describe the bug

When I construct a dataset with the following features:

features = Features(
    {
        "pixel_values": Array3D(dtype="float64", shape=(3, 224, 224)),
        "input_ids": Sequence(feature=Value(dtype="int64")),
        "attention_mask": Sequence(Value(dtype="int64")),
        "tokens": Sequence(Value(dtype="string")),
        "bbox": Array2D(dtype="int64", shape=(512, 4)),
    }
)

and run push_to_hub, the individual *.parquet files are pushed, but when trying to upload the auto-generated README, I run into the following error:

Traceback (most recent call last):
  File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 261, in hf_raise_for_status
    response.raise_for_status()
  File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/datasets/looppayments/multitask_document_classification_dataset/commit/main

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/kevintee/loop-payments/ml/src/ml/data_scripts/build_document_classification_training_data.py", line 297, in <module>
    build_dataset()
  File "/Users/kevintee/loop-payments/ml/src/ml/data_scripts/build_document_classification_training_data.py", line 290, in build_dataset
    push_to_hub(dataset, "multitask_document_classification_dataset")
  File "/Users/kevintee/loop-payments/ml/src/ml/data_scripts/build_document_classification_training_data.py", line 135, in push_to_hub
    dataset.push_to_hub(f"looppayments/{dataset_name}", private=True)
  File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5577, in push_to_hub
    HfApi(endpoint=config.HF_ENDPOINT).upload_file(
  File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 828, in _inner
    return fn(self, *args, **kwargs)
  File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3221, in upload_file
    commit_info = self.create_commit(
  File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 828, in _inner
    return fn(self, *args, **kwargs)
  File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2728, in create_commit
    hf_raise_for_status(commit_resp, endpoint_name="commit")
  File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 299, in hf_raise_for_status
    raise BadRequestError(message, response=response) from e
huggingface_hub.utils._errors.BadRequestError:  (Request ID: Root=1-64ca9c3d-2d2bbef354e102482a9a168e;bc00371c-8549-4859-9f41-43ff140ad36e)

Bad request for commit endpoint:
Invalid YAML in README.md: unknown tag !<tag:yaml.org,2002:python/tuple> (10:9)

  7 |         - 3
  8 |         - 224
  9 |         - 224
 10 |         dtype: float64
--------------^
 11 |   - name: input_ids
 12 |     sequence: int64

My guess is that the auto-generated yaml is unable to be parsed for some reason.

Steps to reproduce the bug

The description contains most of what's needed to reproduce the issue, but I've added a shortened code snippet:

from datasets import Array2D, Array3D, ClassLabel, Dataset, Features, Sequence, Value
from PIL import Image
from transformers import AutoProcessor

features = Features(
    {
        "pixel_values": Array3D(dtype="float64", shape=(3, 224, 224)),
        "input_ids": Sequence(feature=Value(dtype="int64")),
        "attention_mask": Sequence(Value(dtype="int64")),
        "tokens": Sequence(Value(dtype="string")),
        "bbox": Array2D(dtype="int64", shape=(512, 4)),
    }
)

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)

def preprocess_dataset(rows):
    # Get images
    images = [
        Image.open(png_filename).convert("RGB") for png_filename in rows["png_filename"]
    ]

    encoding = processor(
        images,
        rows["tokens"],
        boxes=rows["bbox"],
        truncation=True,
        padding="max_length",
    )
    encoding["tokens"] = rows["tokens"]
    return encoding

dataset = dataset.map(
    preprocess_dataset,
    batched=True,
    batch_size=5,
    features=features,
)

Expected behavior

Using datasets==2.11.0, I'm able to succesfully push_to_hub, no issues, but with datasets==2.14.2, I run into the above error.

Environment info

  • datasets version: 2.14.2
  • Platform: macOS-12.5-arm64-arm-64bit
  • Python version: 3.10.12
  • Huggingface_hub version: 0.16.4
  • PyArrow version: 12.0.1
  • Pandas version: 1.5.3

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions