Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ build
__pycache__
.history-mahmoudi
SLTev.egg-info
elitr-testset/
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,8 +194,26 @@ Demo example:
```
ASReval -i sample-data/sample.en.en.asrt sample-data/sample.en.OSt sample-data/sample.en.OStt -f asrt ost ostt
```
#### Parsing index files
See `SLTev/index_parser.py` for detailed description. Structure of the index file:
```
# SRC -> *.<EXTENSION>
# REF -> *.<EXTENSION>
# ALIGN -> *.<EXTENSION>
PATH_TO_DIRECTORY
PATH_TO_ANOTHER_DIRECTORY_WITH_SAME_EXTENSIONS

# SRC -> *.<EXTENSION>
# REF -> *.<EXTENSION>
PATH_TO_DIRECTORY_WITH_DIFFERENT_EXTENSIONS
```

`SRC` and `REF` annotations are mandatory. Specifying a `SRC` annotation "clears" the rest of the annotations.

Usage:
```
SLTIndexParser path_to_index_file path_to_dataset
```
#### Notes
1. *.asrt and *.slt files have timestamps and, *.mt and *.asr do not have them.
2. For using ``MTeval``, ``SLTeval``, ``ASReval`` commands, you do not need to follow naming templates, it is the ``-f`` parameter that specifies the use of the file.
Expand Down
66 changes: 66 additions & 0 deletions SLTev/index_parser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
import os
import sys
import glob
import re
import json

"""
Read an index file with meta-annotations (SRC, REF, ALIGNMENT...)

Meta-annotation format:
# NAME -> *.<EXTENSION>

Return an iterable of dicts containing paths to the specified files
If invoked on the command line, return a JSON of the list of dicts
Multiple directories can share the same meta-annotations, as long as there isn't a blank line between them
SRC line resets the meta-annotations
SRC and REF are mandatory annotations

Example:

# SRC -> *.<EXTENSION>
# REF -> *.<EXTENSION>
PATH_TO_DIRECTORY
PATH_TO_ANOTHER_DIRECTORY_WITH_SAME_PREFIXES

# SRC -> *.<EXTENSION>
# REF -> *.<EXTENSION>
PATH_TO_DIRECTORY_WITH_DIFFERENT_PREFIXES
"""


def parseIndexFile(indexFilePath, testsetPath):
fileExtensions = {} # Dict of file extensions
with open(indexFilePath) as indexFile:
for line in indexFile:
line = line.rstrip()
if line.startswith("#"):
if "->" in line:
_, fileType, _, extension = line.split(" ")
if not extension.startswith("*"):
raise Exception(f"{line} -- extension must start with a *")
if fileType == "SRC":
fileExtensions = {}
fileExtensions[fileType] = extension
elif len(line) > 0:
if "SRC" not in fileExtensions or "REF" not in fileExtensions:
raise Exception(f"{line} -- SRC or REF not specified")
sourceExtension = fileExtensions["SRC"]
sources = glob.glob(f"{testsetPath}/{line}/{sourceExtension}")

# Source file is guaranteed to exist, verify all other requested files exist
for source in sources:
evalEntry = {}
for name, extension in fileExtensions.items():
matchingFileName = re.sub(sourceExtension[1:] + "$", "", source) + extension[1:]
if not os.path.exists(matchingFileName):
raise Exception(f"{name} {extension} -- {matchingFileName} does not exist")
evalEntry[name] = os.path.realpath(matchingFileName)
yield (evalEntry)

def main():
paths = [path for path in parseIndexFile(sys.argv[1], sys.argv[2])]
print(json.dumps(paths))

if __name__ == "__main__":
main()
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
"SLTeval = SLTev.SLTeval:main_point",
"ASReval = SLTev.ASReval:main_point",
"MTeval = SLTev.MTeval:main_point",
"SLTIndexParser = SLTev.index_parser:main"
],
},
python_requires=">=3.6",
Expand Down