Skip to content

Json loader fails if user-specified features don't match the json data fields order #2366

@lhoestq

Description

@lhoestq

If you do

dataset = load_dataset("json", data_files=data_files, features=features)

Then depending on the order of the features in the json data field it fails:

[...]
~/Desktop/hf/datasets/src/datasets/packaged_modules/json/json.py in _generate_tables(self, files)
     94             if self.config.schema:
     95                 # Cast allows str <-> int/float, while parse_option explicit_schema does NOT
---> 96                 pa_table = pa_table.cast(self.config.schema)
     97             yield i, pa_table
[...]
ValueError: Target schema's field names are not matching the table's field names: ['tokens', 'ner_tags'], ['ner_tags', 'tokens']

This is because one must first re-order the columns of the table to match the self.config.schema before calling cast.

One way to fix the cast would be to replace it with:

# reorder the arrays if necessary + cast to schema
# we can't simply use .cast here because we may need to change the order of the columns
pa_table = pa.Table.from_arrays([pa_table[name] for name in schema.names], schema=schema)

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions