-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
Describe the bug
When I construct a dataset with the following features:
features = Features(
{
"pixel_values": Array3D(dtype="float64", shape=(3, 224, 224)),
"input_ids": Sequence(feature=Value(dtype="int64")),
"attention_mask": Sequence(Value(dtype="int64")),
"tokens": Sequence(Value(dtype="string")),
"bbox": Array2D(dtype="int64", shape=(512, 4)),
}
)
and run push_to_hub, the individual *.parquet files are pushed, but when trying to upload the auto-generated README, I run into the following error:
Traceback (most recent call last):
File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 261, in hf_raise_for_status
response.raise_for_status()
File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/datasets/looppayments/multitask_document_classification_dataset/commit/main
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/kevintee/loop-payments/ml/src/ml/data_scripts/build_document_classification_training_data.py", line 297, in <module>
build_dataset()
File "/Users/kevintee/loop-payments/ml/src/ml/data_scripts/build_document_classification_training_data.py", line 290, in build_dataset
push_to_hub(dataset, "multitask_document_classification_dataset")
File "/Users/kevintee/loop-payments/ml/src/ml/data_scripts/build_document_classification_training_data.py", line 135, in push_to_hub
dataset.push_to_hub(f"looppayments/{dataset_name}", private=True)
File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5577, in push_to_hub
HfApi(endpoint=config.HF_ENDPOINT).upload_file(
File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 828, in _inner
return fn(self, *args, **kwargs)
File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3221, in upload_file
commit_info = self.create_commit(
File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 828, in _inner
return fn(self, *args, **kwargs)
File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2728, in create_commit
hf_raise_for_status(commit_resp, endpoint_name="commit")
File "/Users/kevintee/.pyenv/versions/dev2/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 299, in hf_raise_for_status
raise BadRequestError(message, response=response) from e
huggingface_hub.utils._errors.BadRequestError: (Request ID: Root=1-64ca9c3d-2d2bbef354e102482a9a168e;bc00371c-8549-4859-9f41-43ff140ad36e)
Bad request for commit endpoint:
Invalid YAML in README.md: unknown tag !<tag:yaml.org,2002:python/tuple> (10:9)
7 | - 3
8 | - 224
9 | - 224
10 | dtype: float64
--------------^
11 | - name: input_ids
12 | sequence: int64
My guess is that the auto-generated yaml is unable to be parsed for some reason.
Steps to reproduce the bug
The description contains most of what's needed to reproduce the issue, but I've added a shortened code snippet:
from datasets import Array2D, Array3D, ClassLabel, Dataset, Features, Sequence, Value
from PIL import Image
from transformers import AutoProcessor
features = Features(
{
"pixel_values": Array3D(dtype="float64", shape=(3, 224, 224)),
"input_ids": Sequence(feature=Value(dtype="int64")),
"attention_mask": Sequence(Value(dtype="int64")),
"tokens": Sequence(Value(dtype="string")),
"bbox": Array2D(dtype="int64", shape=(512, 4)),
}
)
processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
def preprocess_dataset(rows):
# Get images
images = [
Image.open(png_filename).convert("RGB") for png_filename in rows["png_filename"]
]
encoding = processor(
images,
rows["tokens"],
boxes=rows["bbox"],
truncation=True,
padding="max_length",
)
encoding["tokens"] = rows["tokens"]
return encoding
dataset = dataset.map(
preprocess_dataset,
batched=True,
batch_size=5,
features=features,
)
Expected behavior
Using datasets==2.11.0, I'm able to succesfully push_to_hub, no issues, but with datasets==2.14.2, I run into the above error.
Environment info
datasetsversion: 2.14.2- Platform: macOS-12.5-arm64-arm-64bit
- Python version: 3.10.12
- Huggingface_hub version: 0.16.4
- PyArrow version: 12.0.1
- Pandas version: 1.5.3
Metadata
Metadata
Assignees
Labels
No labels