Experiment: alternative datasets based on distance from CDS

## Description

Instead of using 5' UTR and 3' UTR annotations, use a distance-based heuristic to try to find these regions, similar to [SpeciesLM](https://link.springer.com/article/10.1186/s13059-024-03221-x):

> We trained distinct models for the 1000 nucleotides 5′ of the start codon (5′ region) and the 300 nucleotides 3′ of the stop codon (3′ region) of all annotated coding sequences (Fig. 1D). The 5′ region typically contains the 5′ UTR and the promoter of the gene [29]. The 3′ region typically contains the 3′ UTR [30].

Some statistics in the human genome:

<img width="1141" height="563" alt="Image" src="https://github.com/user-attachments/assets/45190637-ec7b-4166-8fdb-921bae18e652" />

## Hypothesis or Goal

The closer to the CDS, the more conserved. This could improve performance of gLMs. 3' UTRs in particular is where we really need to improve data curation #43.

## Links

[Training code](https://github.com/marin-community/marin/blob/73dbb49e319927869a9b16684545b62fde9b6906/experiments/dna/exp53_distance_from_cds.py)
[Analysis code](https://github.com/Open-Athena/bolinas-dna/tree/main/snakemake/analysis/evals_v1)
Wandb: [1](https://wandb.ai/gonzalobenegas/marin/runs/exp53-three_prime_utr_baseline-r01-5d901d?nw=nwusergonzalobenegas), [2](https://wandb.ai/gonzalobenegas/marin/runs/exp53-downstream_of_cds_512-r01-6f73b5?nw=nwusergonzalobenegas), [3](https://wandb.ai/gonzalobenegas/marin/runs/exp53-downstream_of_cds_256-r01-34c9cc?nw=nwusergonzalobenegas), [4](https://wandb.ai/gonzalobenegas/marin/runs/exp53-upstream_of_cds_512-r01-104557?nw=nwusergonzalobenegas)

## Results

VEP task: TraitGym Mendelian v2 (same as https://github.com/Open-Athena/bolinas-dna/issues/43)

### Downstream of CDS

- The best model is the one trained on 256bp downstream of CDS. This does not cover all variants #49 but tends to be more conserved than 3' UTRs overall, suggesting that a smaller, higher-quality dataset might be better than a larger lesser-quality dataset. 
<img width="626" height="386" alt="Image" src="https://github.com/user-attachments/assets/c7b88c80-565d-4e4a-9adf-0440f16a3b2f" />

### Upstream of CDS

- 512bp upstream of CDS does not outperform 256bp around TSS (our "promoter" recipe) on either promoter or 5' UTR variants. Not interested in exploring this further since the promoter and 5 ' UTR are already working well. 
<img width="910" height="386" alt="Image" src="https://github.com/user-attachments/assets/3162ccb4-8a99-4f43-b146-67475f5d1662" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: alternative datasets based on distance from CDS #53

Description

Hypothesis or Goal

Links

Results

Downstream of CDS

Upstream of CDS

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Experiment: alternative datasets based on distance from CDS #53

Description

Description

Hypothesis or Goal

Links

Results

Downstream of CDS

Upstream of CDS

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions