Skip to content

Understand relationship between perplexity and other metrics #8

@gonzalobenegas

Description

@gonzalobenegas

Description

In NLP, held-out perplexity has been found to be highly correlated with downstream task performance, and has been the main subject of scaling laws. In genomics, several issues might distort the relationship between perplexity and downstream task performance:

  • Eukaryotic genomes contain a large portion of repetitive elements (~50% in humans, ~85% in maize) which are highly predictable but often not under functional constraint
  • Beyond repetitive elements, genomes contain a large amount of non-repetitive content which is not under functional constraint (e.g. pseudogenes)
  • Different genomes are not sampled IID but are highly correlated according to phylogenetic structure. It is not entirely possible to obtain clean train and test splits.

It is also worth understanding the relationship between generally-available metrics (e.g. correlation with allele frequency) and more rare but important evals such as prediction of known causal variants.

Hypothesis or Goal

TODO

Links

Results

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions