-
Notifications
You must be signed in to change notification settings - Fork 3
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Ondrej asks Ebrahim to propose several solutions to the problem of indicating what are the input and reference files.
The exact problem is this:
- a given index specifies a set of documents
- each document comes in multiple versions (language variants, modalities, …)
- a single document can come in several modalities that allow for different uses, see e.g. https://github.com/ELITR/elitr-testset/tree/master/documents/wmt18-newstest-sample-read and the related issue 11 there (Add wmt18-newstest-sample-read to indices elitr-testset#11).
We need an approach to indicate which is the source and which is the reference document so that it works across all indices and all documents.
Imagine that you want an index for en2cs MT and you want it to include:
- wmt18-newstest-sample-read (28 documents of the same kind)
- confidential/amalach-sample-interview (1 document as of now)
These two collection of documents use different suffixes, for a good reason.
My proposal: Make index interleaved with ‘suffix pair specifiers’:
# index for MT from EN to CS
# files from one document collection:
# SRC->REF: *.en.OSt -> *.cs.OSt
wmt18-newstest-sample-read
# files from another document collection
# SRC->REF: *.en -> *.cs
confidential/amalach-sample-interview
Whatever follows the line with a ‘suffix pair specifier’ (SRC->REF) will be interpreted according to the suffix pair specifier.
If we want more file types to be considered (quite likely), we could have more suffix specifiers, not just a pair:
# An index for MT evaluation with a metric that further focuses on some 'dictionary' scoring:
# SRC: *.en.OSt
# REF: *.cs.OSt
# REFDICT: *.cs.dictionary
wmt18-newstest-sample-read
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request