Skip to content

Experiment: promoters YOLO run #21

@gonzalobenegas

Description

@gonzalobenegas

Description

Train a model of a modest size (~1B) with reasonable defaults on an animal promoters dataset.

Hypothesis or Goal

See how far you can get in downstream tasks and understand major strengths and weaknesses to guide future experiments.

Links

Training code
Analysis code
Wandb

Results

Variant effect prediction

Setup:

  • Classification of causal variants for human Mendelian and complex traits (TraitGym).
    • (I'm also working on larger versions of these datasets, as part of the TraitGym paper revision).
  • Only odd chromosomes (validation set).
  • Two types of variants roughly overlapping promoter regions the model was trained on: promoters and 5' UTRs.

Results:

  • Typically matches the performance of Evo 2 (40B).
    • Not in complex/promoter. Understanding these variants perhaps benefits from understanding other regions of the genome.
  • Still generally behind GPN-Star, most clearly in complex traits.
Image
  • Performance seems to plateau while pre-training loss continues to go down.
Image

Follow up ideas

  • Intuitively we should be able to obtain better performance by specializing the model (trained on all animals) into a closer evolutionary neighborhood to human (e.g. primates or mammals). We know this is particularly important for complex trait variants. It would be interesting to try different approaches, such as finetuning just on primates vs. on all animals but spike up the proportion of primates.

  • Beyond zero-shot tasks, could we find/develop a good transfer learning benchmark for promoter models? The tasks in PromoterAI paper look interesting.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions