-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Labels
dataset bugA bug in a dataset script provided in the libraryA bug in a dataset script provided in the library
Description
Describe the bug
All the alleged non-english subcollections for the X-CSQA task in the xcsr benchmark dataset seem to be copies of the english subcollection, rather than translations. This is in contrast to the data description:
we automatically translate the original CSQA and CODAH datasets, which only have English versions, to 15 other languages, forming development and test sets for studying X-CSR
Steps to reproduce the bug
# let's say you want to load the french X-CSQA subcollection
french = datasets.load_dataset("xcsr", "X-CSQA-fr")
# for good measure, let's load english too
english = datasets.load_dataset("xcsr", "X-CSQA-en")
# let's inspect
"".join(english['test'][0]['question']['stem'])
# output: 'The people wanted to stop the parade, so what did they set up to thwart it?'
"".join(french['test'][0]['question']['stem'])
# output: 'The people wanted to stop the parade, so what did they set up to thwart it?'
# what? Why are they both in english?
# I've checked this for validation and train splits too, across many datapoints. It's all the same english dataset
# maybe i need to look better?
french['test'].unique('lang')
# output: ['en']
# no, it's all englishExpected results
Accessing a subcollection in language X should return a subcollection containg samples in language X
Actual results
Accessing a subcollection in language X returns a subcollection containing samples in English.
Environment info
datasetsversion: 2.5.1- Platform: macOS-10.15.7-x86_64-i386-64bit
- Python version: 3.8.13
- PyArrow version: 9.0.0
- Pandas version: 1.4.3
Metadata
Metadata
Assignees
Labels
dataset bugA bug in a dataset script provided in the libraryA bug in a dataset script provided in the library