-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Description
Train a model of a modest size (~1B) with reasonable defaults on an animal promoters dataset.
Hypothesis or Goal
See how far you can get in downstream tasks and understand major strengths and weaknesses to guide future experiments.
Links
Training code
Analysis code
Wandb
Results
Variant effect prediction
Setup:
- Classification of causal variants for human Mendelian and complex traits (TraitGym).
- (I'm also working on larger versions of these datasets, as part of the TraitGym paper revision).
- Only odd chromosomes (validation set).
- Two types of variants roughly overlapping promoter regions the model was trained on: promoters and 5' UTRs.
Results:
- Typically matches the performance of Evo 2 (40B).
- Not in complex/promoter. Understanding these variants perhaps benefits from understanding other regions of the genome.
- Still generally behind GPN-Star, most clearly in complex traits.
- Performance seems to plateau while pre-training loss continues to go down.
Follow up ideas
-
Intuitively we should be able to obtain better performance by specializing the model (trained on all animals) into a closer evolutionary neighborhood to human (e.g. primates or mammals). We know this is particularly important for complex trait variants. It would be interesting to try different approaches, such as finetuning just on primates vs. on all animals but spike up the proportion of primates.
-
Beyond zero-shot tasks, could we find/develop a good transfer learning benchmark for promoter models? The tasks in PromoterAI paper look interesting.