A library for parsing the giella-cg format from hfst-tokenize.
giellatekno/fst-analysis-parser is the same, but as that parser had some
issues, this project was rebuilt to try to work around it, using that as
a reference.
A binary (program) that takes giella-cg as input on stdin, and outputs it to stdout, in a converted format.
Usage: gcgp [OPTIONS] <--json|--text>
Options:
-v, --verbose Be verbose
-p, --parallel Parse in parallell
--json Output json. One sentence is output as one object, on one line. Line breaks separates sentence objects
--text Convert to text. Empty lines between sentences
-h, --help Print help
parsed analysed corpus sme in ~30s, using ~20GB RAM.
parsed analysed corpus sme in parallel in ~6s, using ~20GB RAM.