Skip to content

xcsr: X-CSQA simply uses english for all alleged non-english data #5017

@thesofakillers

Description

@thesofakillers

Describe the bug

All the alleged non-english subcollections for the X-CSQA task in the xcsr benchmark dataset seem to be copies of the english subcollection, rather than translations. This is in contrast to the data description:

we automatically translate the original CSQA and CODAH datasets, which only have English versions, to 15 other languages, forming development and test sets for studying X-CSR

Steps to reproduce the bug

# let's say you want to load the french X-CSQA subcollection
french = datasets.load_dataset("xcsr", "X-CSQA-fr")
# for good measure, let's load english too
english = datasets.load_dataset("xcsr", "X-CSQA-en")

# let's inspect
"".join(english['test'][0]['question']['stem'])
# output: 'The people wanted to stop the parade, so what did they set up to thwart it?'
"".join(french['test'][0]['question']['stem'])
# output: 'The people wanted to stop the parade, so what did they set up to thwart it?'
# what? Why are they both in english?

# I've checked this for validation and train splits too, across many datapoints. It's all the same english dataset
# maybe i need to look better?
french['test'].unique('lang')
# output: ['en']
# no, it's all english

Expected results

Accessing a subcollection in language X should return a subcollection containg samples in language X

Actual results

Accessing a subcollection in language X returns a subcollection containing samples in English.

Environment info

  • datasets version: 2.5.1
  • Platform: macOS-10.15.7-x86_64-i386-64bit
  • Python version: 3.8.13
  • PyArrow version: 9.0.0
  • Pandas version: 1.4.3

Metadata

Metadata

Labels

dataset bugA bug in a dataset script provided in the library

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions