Running the following code
import nlp
ds = nlp.load_dataset("trivia_qa", "rc", split="validation[:1%]") # this might take 2.3 min to download but it's cached afterwards...
ds.map(lambda x: x, load_from_cache_file=False)
triggers a ArrowInvalid: Could not convert TagMe with type str: converting to null type error.
On the other hand if we remove a certain column of trivia_qa which seems responsible for the bug, it works:
import nlp
ds = nlp.load_dataset("trivia_qa", "rc", split="validation[:1%]") # this might take 2.3 min to download but it's cached afterwards...
ds.map(lambda x: x, remove_columns=["entity_pages"], load_from_cache_file=False)
. Seems quite hard to debug what's going on here... @lhoestq @thomwolf - do you have a good first guess what the problem could be?
Note BTW: I think this could be a good test to check that the datasets work correctly: Take a tiny portion of the dataset and check that it can be written correctly.