-
Notifications
You must be signed in to change notification settings - Fork 33
Open
Description
I have a query regarding regarding your training corpus.
The News Crawl corpora that you use are both shuffled and de-duplicated. However, the corpora used by other models like BERT, RoBERTa etc. use a non-shuffled corpus where each document within the corpus is also demarcated with an empty line. Now with this un-shuffled form, when you create pre-training instances, you will end up with contiguous sentences in segment A and segment B. But in your case, the segments will contain non-contiguous sentences right?
So my question is what is your opinion on having non-contiguous sentences in the segments? Does it hurt the performance of MLM, or downstream tasks?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels