-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Labels
Description
Description
Instead of using 5' UTR and 3' UTR annotations, use a distance-based heuristic to try to find these regions, similar to SpeciesLM:
We trained distinct models for the 1000 nucleotides 5′ of the start codon (5′ region) and the 300 nucleotides 3′ of the stop codon (3′ region) of all annotated coding sequences (Fig. 1D). The 5′ region typically contains the 5′ UTR and the promoter of the gene [29]. The 3′ region typically contains the 3′ UTR [30].
Some statistics in the human genome:
Hypothesis or Goal
The closer to the CDS, the more conserved. This could improve performance of gLMs. 3' UTRs in particular is where we really need to improve data curation #43.
Links
Training code
Analysis code
Wandb: 1, 2, 3, 4
Results
VEP task: TraitGym Mendelian v2 (same as #43)
Downstream of CDS
- The best model is the one trained on 256bp downstream of CDS. This does not cover all variants Causal variant statistics for 3' UTR #49 but tends to be more conserved than 3' UTRs overall, suggesting that a smaller, higher-quality dataset might be better than a larger lesser-quality dataset.
Upstream of CDS
- 512bp upstream of CDS does not outperform 256bp around TSS (our "promoter" recipe) on either promoter or 5' UTR variants. Not interested in exploring this further since the promoter and 5 ' UTR are already working well.

Reactions are currently unavailable