Skip to content

giellatekno/giellacgparser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

giellacgparser

A library for parsing the giella-cg format from hfst-tokenize.

giellatekno/fst-analysis-parser is the same, but as that parser had some issues, this project was rebuilt to try to work around it, using that as a reference.

gcgp

A binary (program) that takes giella-cg as input on stdin, and outputs it to stdout, in a converted format.

Usage: gcgp [OPTIONS] <--json|--text>

Options:
  -v, --verbose   Be verbose
  -p, --parallel  Parse in parallell
      --json      Output json. One sentence is output as one object, on one line. Line breaks separates sentence objects
      --text      Convert to text. Empty lines between sentences
  -h, --help      Print help

current progress

parsed analysed corpus sme in ~30s, using ~20GB RAM.

parsed analysed corpus sme in parallel in ~6s, using ~20GB RAM.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages