Understand relationship between perplexity and other metrics

## Description

In NLP, held-out perplexity has been found to be highly correlated with downstream task performance, and has been the main subject of scaling laws. In genomics, several issues might distort the relationship between perplexity and downstream task performance:
- Eukaryotic genomes contain a large portion of repetitive elements (~50% in humans, ~85% in maize) which are highly predictable but often not under functional constraint
- Beyond repetitive elements, genomes contain a large amount of non-repetitive content which is not under functional constraint (e.g. pseudogenes)
- Different genomes are not sampled IID but are highly correlated according to phylogenetic structure. It is not entirely possible to obtain clean train and test splits.

It is also worth understanding the relationship between generally-available metrics (e.g. correlation with allele frequency) and more rare but important evals such as prediction of known causal variants.

## Hypothesis or Goal

TODO

### Links

## Results



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understand relationship between perplexity and other metrics #8

Description

Hypothesis or Goal

Links

Results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Understand relationship between perplexity and other metrics #8

Description

Description

Hypothesis or Goal

Links

Results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions