Skip to content

TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe' #4210

@loretoparisi

Description

@loretoparisi

System Info

- `transformers` version: 4.18.0
- Platform: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.10.0+cu111 (True)
- Tensorflow version (GPU?): 2.8.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@LysandreJik

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from datasets import load_dataset,Features,Value,ClassLabel

class_names = ["cmn","deu","rus","fra","eng","jpn","spa","ita","kor","vie","nld","epo","por","tur","heb","hun","ell","ind","ara","arz","fin","bul","yue","swe","ukr","bel","que","ces","swh","nno","wuu","nob","zsm","est","kat","pol","lat","urd","sqi","isl","fry","afr","ron","fao","san","bre","tat","yid","uig","uzb","srp","qya","dan","pes","slk","eus","cycl","acm","tgl","lvs","kaz","hye","hin","lit","ben","cat","bos","hrv","tha","orv","cha","mon","lzh","scn","gle","mkd","slv","frm","glg","vol","ain","jbo","tok","ina","nds","mal","tlh","roh","ltz","oss","ido","gla","mlt","sco","ast","jav","oci","ile","ota","xal","tel","sjn","nov","khm","tpi","ang","aze","tgk","tuk","chv","hsb","dsb","bod","sme","cym","mri","ksh","kmr","ewe","kab","ber","tpw","udm","lld","pms","lad","grn","mlg","xho","pnb","grc","hat","lao","npi","cor","nah","avk","mar","guj","pan","kir","myv","prg","sux","crs","ckt","bak","zlm","hil","cbk","chr","nav","lkt","enm","arq","lin","abk","pcd","rom","gsw","tam","zul","awa","wln","amh","bar","hbo","mhr","bho","mrj","ckb","osx","pfl","mgm","sna","mah","hau","kan","nog","sin","glv","dng","kal","liv","vro","apc","jdt","fur","che","haw","yor","crh","pdc","ppl","kin","shs","mnw","tet","sah","kum","ngt","nya","pus","hif","mya","moh","wol","tir","ton","lzz","oar","lug","brx","non","mww","hak","nlv","ngu","bua","aym","vec","ibo","tkl","bam","kha","ceb","lou","fuc","smo","gag","lfn","arg","umb","tyv","kjh","oji","cyo","urh","kzj","pam","srd","lmo","swg","mdf","gil","snd","tso","sot","zza","tsn","pau","som","egl","ady","asm","ori","dtp","cho","max","kam","niu","sag","ilo","kaa","fuv","nch","hoc","iba","gbm","sun","war","mvv","pap","ary","kxi","csb","pag","cos","rif","kek","krc","aii","ban","ssw","tvl","mfe","tah","bvy","bcl","hnj","nau","nst","afb","quc","min","tmw","mad","bjn","mai","cjy","got","hsn","gan","tzl","dws","ldn","afh","sgs","krl","vep","rue","tly","mic","ext","izh","sma","jam","cmo","mwl","kpv","koi","bis","ike","run","evn","ryu","mnc","aoz","otk","kas","aln","akl","yua","shy","fkv","gos","fij","thv","zgh","gcf","cay","xmf","tig","div","lij","rap","hrx","cpi","tts","gaa","tmr","iii","ltg","bzt","syc","emx","gom","chg","osp","stq","frr","fro","nys","toi","new","phn","jpa","rel","drt","chn","pli","laa","bal","hdn","hax","mik","ajp","xqa","pal","crk","mni","lut","ayl","ood","sdh","ofs","nus","kiu","diq","qxq","alt","bfz","klj","mus","srn","guc","lim","zea","shi","mnr","bom","sat","szl"]
features = Features({ 'label': ClassLabel(names=class_names), 'text': Value('string')})
num_labels = features['label'].num_classes
data_files = { "train": "train.csv", "test": "test.csv" }
sentences = load_dataset("loretoparisi/tatoeba-sentences",
                             data_files=data_files,
                             delimiter='\t', 
                             column_names=['label', 'text'],
                             features = features

ERROR:

ClassLabel(num_classes=403, names=['cmn', 'deu', 'rus', 'fra', 'eng', 'jpn', 'spa', 'ita', 'kor', 'vie', 'nld', 'epo', 'por', 'tur', 'heb', 'hun', 'ell', 'ind', 'ara', 'arz', 'fin', 'bul', 'yue', 'swe', 'ukr', 'bel', 'que', 'ces', 'swh', 'nno', 'wuu', 'nob', 'zsm', 'est', 'kat', 'pol', 'lat', 'urd', 'sqi', 'isl', 'fry', 'afr', 'ron', 'fao', 'san', 'bre', 'tat', 'yid', 'uig', 'uzb', 'srp', 'qya', 'dan', 'pes', 'slk', 'eus', 'cycl', 'acm', 'tgl', 'lvs', 'kaz', 'hye', 'hin', 'lit', 'ben', 'cat', 'bos', 'hrv', 'tha', 'orv', 'cha', 'mon', 'lzh', 'scn', 'gle', 'mkd', 'slv', 'frm', 'glg', 'vol', 'ain', 'jbo', 'tok', 'ina', 'nds', 'mal', 'tlh', 'roh', 'ltz', 'oss', 'ido', 'gla', 'mlt', 'sco', 'ast', 'jav', 'oci', 'ile', 'ota', 'xal', 'tel', 'sjn', 'nov', 'khm', 'tpi', 'ang', 'aze', 'tgk', 'tuk', 'chv', 'hsb', 'dsb', 'bod', 'sme', 'cym', 'mri', 'ksh', 'kmr', 'ewe', 'kab', 'ber', 'tpw', 'udm', 'lld', 'pms', 'lad', 'grn', 'mlg', 'xho', 'pnb', 'grc', 'hat', 'lao', 'npi', 'cor', 'nah', 'avk', 'mar', 'guj', 'pan', 'kir', 'myv', 'prg', 'sux', 'crs', 'ckt', 'bak', 'zlm', 'hil', 'cbk', 'chr', 'nav', 'lkt', 'enm', 'arq', 'lin', 'abk', 'pcd', 'rom', 'gsw', 'tam', 'zul', 'awa', 'wln', 'amh', 'bar', 'hbo', 'mhr', 'bho', 'mrj', 'ckb', 'osx', 'pfl', 'mgm', 'sna', 'mah', 'hau', 'kan', 'nog', 'sin', 'glv', 'dng', 'kal', 'liv', 'vro', 'apc', 'jdt', 'fur', 'che', 'haw', 'yor', 'crh', 'pdc', 'ppl', 'kin', 'shs', 'mnw', 'tet', 'sah', 'kum', 'ngt', 'nya', 'pus', 'hif', 'mya', 'moh', 'wol', 'tir', 'ton', 'lzz', 'oar', 'lug', 'brx', 'non', 'mww', 'hak', 'nlv', 'ngu', 'bua', 'aym', 'vec', 'ibo', 'tkl', 'bam', 'kha', 'ceb', 'lou', 'fuc', 'smo', 'gag', 'lfn', 'arg', 'umb', 'tyv', 'kjh', 'oji', 'cyo', 'urh', 'kzj', 'pam', 'srd', 'lmo', 'swg', 'mdf', 'gil', 'snd', 'tso', 'sot', 'zza', 'tsn', 'pau', 'som', 'egl', 'ady', 'asm', 'ori', 'dtp', 'cho', 'max', 'kam', 'niu', 'sag', 'ilo', 'kaa', 'fuv', 'nch', 'hoc', 'iba', 'gbm', 'sun', 'war', 'mvv', 'pap', 'ary', 'kxi', 'csb', 'pag', 'cos', 'rif', 'kek', 'krc', 'aii', 'ban', 'ssw', 'tvl', 'mfe', 'tah', 'bvy', 'bcl', 'hnj', 'nau', 'nst', 'afb', 'quc', 'min', 'tmw', 'mad', 'bjn', 'mai', 'cjy', 'got', 'hsn', 'gan', 'tzl', 'dws', 'ldn', 'afh', 'sgs', 'krl', 'vep', 'rue', 'tly', 'mic', 'ext', 'izh', 'sma', 'jam', 'cmo', 'mwl', 'kpv', 'koi', 'bis', 'ike', 'run', 'evn', 'ryu', 'mnc', 'aoz', 'otk', 'kas', 'aln', 'akl', 'yua', 'shy', 'fkv', 'gos', 'fij', 'thv', 'zgh', 'gcf', 'cay', 'xmf', 'tig', 'div', 'lij', 'rap', 'hrx', 'cpi', 'tts', 'gaa', 'tmr', 'iii', 'ltg', 'bzt', 'syc', 'emx', 'gom', 'chg', 'osp', 'stq', 'frr', 'fro', 'nys', 'toi', 'new', 'phn', 'jpa', 'rel', 'drt', 'chn', 'pli', 'laa', 'bal', 'hdn', 'hax', 'mik', 'ajp', 'xqa', 'pal', 'crk', 'mni', 'lut', 'ayl', 'ood', 'sdh', 'ofs', 'nus', 'kiu', 'diq', 'qxq', 'alt', 'bfz', 'klj', 'mus', 'srn', 'guc', 'lim', 'zea', 'shi', 'mnr', 'bom', 'sat', 'szl'], id=None)
Value(dtype='string', id=None)
Using custom data configuration loretoparisi--tatoeba-sentences-7b2c5e991f398f39
Downloading and preparing dataset csv/loretoparisi--tatoeba-sentences to /root/.cache/huggingface/datasets/csv/loretoparisi--tatoeba-sentences-7b2c5e991f398f39/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...
Downloading data files: 100%
2/2 [00:18<00:00, 8.06s/it]
Downloading data: 100%
391M/391M [00:13<00:00, 35.3MB/s]
Downloading data: 100%
92.4M/92.4M [00:02<00:00, 36.5MB/s]
Failed to read file '/root/.cache/huggingface/datasets/downloads/933132df9905194ea9faeb30cabca8c49318795612f6495fcb941a290191dd5d' with error <class 'ValueError'>: invalid literal for int() with base 10: 'cmn'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
15 frames
/usr/local/lib/python3.7/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

ValueError: invalid literal for int() with base 10: 'cmn'

while loading without features it loads without errors

sentences = load_dataset("loretoparisi/tatoeba-sentences",
                             data_files=data_files,
                             delimiter='\t', 
                             column_names=['label', 'text']
                         )

but the label col seems to be wrong (without the ClassLabel object):

sentences['train'].features
{'label': Value(dtype='string', id=None),
 'text': Value(dtype='string', id=None)}

The dataset was https://huggingface.co/datasets/loretoparisi/tatoeba-sentences

Dataset format is:

ces	Nechci vědět, co je tam uvnitř.
ces	Kdo o tom chce slyšet?
deu	Tom sagte, er fühle sich nicht wohl.
ber	Mel-iyi-d anida-t tura ?
hun	Gondom lesz rá rögtön.
ber	Mel-iyi-d anida-tt tura ?
deu	Ich will dich nicht reden hören.

Expected behavior

correctly load train and test files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions