Improved compatibility with Robust04#59
Conversation
- Skip of empty documents without aborting the process
|
Thanks for this Andrea. Perhaps we can rename the parameter to Also, could you add a test case. |
|
Hi @cmacdonald, I assume that the Vaswani collection has no empty documents, and therefore the test is trivial. Can I put the following collection 'irds:disks45/nocr/trec-robust-2004' into the test? Kind regards, |
agreed it wont
No, not available on github, as it needs a license. Try something like this: indexer.index([next(iter) for i in range(200)])-> docs = [next(iter) for i in range(200)]
docs.insert(100, {'docno': 'empty', 'text' : ''}) # truly empty
docs.insert(105, {'docno': 'empty', 'text' : ' '}) # whitespace only
factory = indexer.index(docs)
self.assertEqual(200, len(factory)) # check that empty docs are indeed ignored |
|
Done, |
|
I fixed various things to make the test cases not give Python errors. Its now throwing |
Hi,
I am working using ColBert and I have some issues with the indexing of Robust04.
I noticed that the code crashes if there are empty documents, so I propose to let users decide the behaviour with the optional parameter allow_empty_doc.
Additionally, the Robust04 collection uses "body" rather than "text".
I hope my pull request helps, I remain available for further changes and clarifications.
Greetings,
Andrea