Skip to content

Improved compatibility with Robust04#59

Open
andreabac3 wants to merge 7 commits intoterrierteam:mainfrom
andreabac3:main
Open

Improved compatibility with Robust04#59
andreabac3 wants to merge 7 commits intoterrierteam:mainfrom
andreabac3:main

Conversation

@andreabac3
Copy link

@andreabac3 andreabac3 commented Feb 10, 2023

Hi,
I am working using ColBert and I have some issues with the indexing of Robust04.
I noticed that the code crashes if there are empty documents, so I propose to let users decide the behaviour with the optional parameter allow_empty_doc.
Additionally, the Robust04 collection uses "body" rather than "text".

I hope my pull request helps, I remain available for further changes and clarifications.

Greetings,
Andrea

- Skip of empty documents without aborting the process
@cmacdonald
Copy link
Collaborator

Thanks for this Andrea. Perhaps we can rename the parameter to skip_empty_docs?

Also, could you add a test case.

@andreabac3
Copy link
Author

Hi @cmacdonald,
Done.

I assume that the Vaswani collection has no empty documents, and therefore the test is trivial.

Can I put the following collection 'irds:disks45/nocr/trec-robust-2004' into the test?

Kind regards,
Andrea

@cmacdonald
Copy link
Collaborator

cmacdonald commented Feb 10, 2023

does vaswani have any empty documents?

agreed it wont

Can I put the following collection 'irds:disks45/nocr/trec-robust-2004' into the test?

No, not available on github, as it needs a license.

Try something like this:

indexer.index([next(iter) for i in range(200)])

->

docs = [next(iter) for i in range(200)]
docs.insert(100, {'docno': 'empty', 'text' : ''}) # truly empty
docs.insert(105, {'docno': 'empty', 'text' : ' '}) # whitespace only
factory = indexer.index(docs)
self.assertEqual(200, len(factory)) # check that empty docs are indeed ignored

@andreabac3
Copy link
Author

Done,
thank you for your support! :)

@cmacdonald
Copy link
Collaborator

I fixed various things to make the test cases not give Python errors. Its now throwing Error: Process completed with exit code 143. - i'll try to look into this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants