Skip to content

GNOME Catalan #15

@jorgtied

Description

@jorgtied

I think there might be something wrong with one file from the GNOME corpus. These links are from "legacy" OPUS, but I think the problem might be the same obtaining the file with a more current method. The file is the Catalan (ca) monolingual plain text file from the GNOME corpus: https://opus.nlpl.eu/legacy/download.php?f=GNOME/v1/mono/ca.txt.gz According to the stats on the website, these are the expected stats for the file: language files tokens sentences ca 2,071 6.4M 0.9M However, the downloaded file "ca.txt.gz" has much fewer tokens and sentences: zcat GNOME_v1_mono_ca.txt.gz | wc 1422 13808 87751 In contrast, the corresponding ca.tok.gz is a much larger file which actually has the expected number of lines. zcat GNOME_v1_mono_ca.tok.gz | wc 668727 6386997 33416861 ( from https://opus.nlpl.eu/legacy/download.php?f=GNOME/v1/mono/ca.tok.gz ) Could you check whether the ca.txt.gz is wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions