suffix specifiers in indices

Ondrej asks Ebrahim to propose several solutions to the problem of indicating what are the input and reference files.

The exact problem is this:
- a given index specifies a set of documents
- each document comes in multiple versions (language variants, modalities, …)
- a single document can come in several modalities that allow for different uses, see e.g. https://github.com/ELITR/elitr-testset/tree/master/documents/wmt18-newstest-sample-read and the related issue 11 there (https://github.com/ELITR/elitr-testset/issues/11).

We need an approach to indicate which is the source and which is the reference document so that it works across all indices and all documents.

Imagine that you want an index for en2cs MT and you want it to include:
- wmt18-newstest-sample-read (28 documents of the same kind)
- confidential/amalach-sample-interview (1 document as of now)
These two collection of documents use different suffixes, for a good reason.

My proposal: Make index interleaved with ‘suffix pair specifiers’:
```
# index for MT from EN to CS

# files from one document collection:
# SRC->REF: *.en.OSt -> *.cs.OSt
wmt18-newstest-sample-read

# files from another document collection
# SRC->REF: *.en -> *.cs
confidential/amalach-sample-interview
```

Whatever follows the line with a ‘suffix pair specifier’ (SRC->REF) will be interpreted according to the suffix pair specifier.

If we want more file types to be considered (quite likely), we could have more suffix specifiers, not just a pair:
```
# An index for MT evaluation with a metric that further focuses on some 'dictionary' scoring:
# SRC: *.en.OSt
# REF: *.cs.OSt
# REFDICT: *.cs.dictionary
wmt18-newstest-sample-read
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suffix specifiers in indices #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

suffix specifiers in indices #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions