Skip to content

Experiment: alternative datasets based on distance from CDS #53

@gonzalobenegas

Description

@gonzalobenegas

Description

Instead of using 5' UTR and 3' UTR annotations, use a distance-based heuristic to try to find these regions, similar to SpeciesLM:

We trained distinct models for the 1000 nucleotides 5′ of the start codon (5′ region) and the 300 nucleotides 3′ of the stop codon (3′ region) of all annotated coding sequences (Fig. 1D). The 5′ region typically contains the 5′ UTR and the promoter of the gene [29]. The 3′ region typically contains the 3′ UTR [30].

Some statistics in the human genome:

Image

Hypothesis or Goal

The closer to the CDS, the more conserved. This could improve performance of gLMs. 3' UTRs in particular is where we really need to improve data curation #43.

Links

Training code
Analysis code
Wandb: 1, 2, 3, 4

Results

VEP task: TraitGym Mendelian v2 (same as #43)

Downstream of CDS

  • The best model is the one trained on 256bp downstream of CDS. This does not cover all variants Causal variant statistics for 3' UTR #49 but tends to be more conserved than 3' UTRs overall, suggesting that a smaller, higher-quality dataset might be better than a larger lesser-quality dataset.
Image

Upstream of CDS

  • 512bp upstream of CDS does not outperform 256bp around TSS (our "promoter" recipe) on either promoter or 5' UTR variants. Not interested in exploring this further since the promoter and 5 ' UTR are already working well.
Image

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions