Skip to content

Bug with filtered indices #5112

@albertvillanova

Description

@albertvillanova

Describe the bug

As reported by @partiallytyped (and by @Muennighoff):

There is an issue with the indices of a filtered dataset.

Steps to reproduce the bug

ds = Dataset.from_dict({"num": [0, 1, 2, 3]})
ds = ds.filter(lambda num: num % 2 == 0, input_columns="num", batch_size=2)
assert all(item["num"] % 2 == 0 for item in ds)

Expected results

The indices of the filtered dataset should correspond to the examples with "language" equals to "english".

Actual results

Indices to items with other languages are included in the filtered dataset indices

Preliminar investigation

It seems a bug introduced by:

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions