Skip to content

Dataset.from_parquet cannot load subset of columns #6149

@dwyatte

Description

@dwyatte

Describe the bug

When using Dataset.from_parquet(path_or_paths, columns=[...]) and a subset of columns, loading fails with a variant of the following

ValueError: Couldn't cast
a: int64
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 273
to
{'a': Value(dtype='int64', id=None), 'b': Value(dtype='int64', id=None)}
because column names don't match

The above exception was the direct cause of the following exception:

Looks to be triggered by

datasets/src/datasets/table.py

Lines 2285 to 2286 in c02a447

if sorted(table.column_names) != sorted(features):
raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")

Steps to reproduce the bug

import pandas as pd
from datasets import Dataset


pd.DataFrame([{"a": 1, "b": 2}]).to_parquet("test.pq")
Dataset.from_parquet("test.pq", columns=["a"])

Expected behavior

A subset of columns should be loaded without error

Environment info

  • datasets version: 2.14.4
  • Platform: Linux-5.10.0-23-cloud-amd64-x86_64-with-glibc2.2.5
  • Python version: 3.8.16
  • Huggingface_hub version: 0.16.4
  • PyArrow version: 12.0.1
  • Pandas version: 2.0.3

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions