Pretraining with News Crawls by WMT 19

I have a query regarding regarding your training corpus.

The News Crawl corpora that you use are both shuffled and de-duplicated. However, the corpora used by other models like BERT, RoBERTa etc. use a non-shuffled corpus where each document within the corpus is also demarcated with an empty line. Now with this un-shuffled form, when you create pre-training instances, you will end up with contiguous sentences in segment A and segment B. But in your case, the segments will contain non-contiguous sentences right?

So my question is what is your opinion on having  non-contiguous sentences in the segments? Does it hurt the performance of MLM, or downstream tasks?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining with News Crawls by WMT 19 #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pretraining with News Crawls by WMT 19 #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions