-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
Describe the bug
When using Dataset.from_parquet(path_or_paths, columns=[...]) and a subset of columns, loading fails with a variant of the following
ValueError: Couldn't cast
a: int64
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 273
to
{'a': Value(dtype='int64', id=None), 'b': Value(dtype='int64', id=None)}
because column names don't match
The above exception was the direct cause of the following exception:
Looks to be triggered by
datasets/src/datasets/table.py
Lines 2285 to 2286 in c02a447
| if sorted(table.column_names) != sorted(features): | |
| raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match") |
Steps to reproduce the bug
import pandas as pd
from datasets import Dataset
pd.DataFrame([{"a": 1, "b": 2}]).to_parquet("test.pq")
Dataset.from_parquet("test.pq", columns=["a"])
Expected behavior
A subset of columns should be loaded without error
Environment info
datasetsversion: 2.14.4- Platform: Linux-5.10.0-23-cloud-amd64-x86_64-with-glibc2.2.5
- Python version: 3.8.16
- Huggingface_hub version: 0.16.4
- PyArrow version: 12.0.1
- Pandas version: 2.0.3
Metadata
Metadata
Assignees
Labels
No labels