-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Description
In NLP, held-out perplexity has been found to be highly correlated with downstream task performance, and has been the main subject of scaling laws. In genomics, several issues might distort the relationship between perplexity and downstream task performance:
- Eukaryotic genomes contain a large portion of repetitive elements (~50% in humans, ~85% in maize) which are highly predictable but often not under functional constraint
- Beyond repetitive elements, genomes contain a large amount of non-repetitive content which is not under functional constraint (e.g. pseudogenes)
- Different genomes are not sampled IID but are highly correlated according to phylogenetic structure. It is not entirely possible to obtain clean train and test splits.
It is also worth understanding the relationship between generally-available metrics (e.g. correlation with allele frequency) and more rare but important evals such as prediction of known causal variants.
Hypothesis or Goal
TODO
Links
Results
Reactions are currently unavailable