Skip to content

suffix specifiers in indices #19

@obo

Description

@obo

Ondrej asks Ebrahim to propose several solutions to the problem of indicating what are the input and reference files.

The exact problem is this:

We need an approach to indicate which is the source and which is the reference document so that it works across all indices and all documents.

Imagine that you want an index for en2cs MT and you want it to include:

  • wmt18-newstest-sample-read (28 documents of the same kind)
  • confidential/amalach-sample-interview (1 document as of now)
    These two collection of documents use different suffixes, for a good reason.

My proposal: Make index interleaved with ‘suffix pair specifiers’:

# index for MT from EN to CS

# files from one document collection:
# SRC->REF: *.en.OSt -> *.cs.OSt
wmt18-newstest-sample-read

# files from another document collection
# SRC->REF: *.en -> *.cs
confidential/amalach-sample-interview

Whatever follows the line with a ‘suffix pair specifier’ (SRC->REF) will be interpreted according to the suffix pair specifier.

If we want more file types to be considered (quite likely), we could have more suffix specifiers, not just a pair:

# An index for MT evaluation with a metric that further focuses on some 'dictionary' scoring:
# SRC: *.en.OSt
# REF: *.cs.OSt
# REFDICT: *.cs.dictionary
wmt18-newstest-sample-read

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions