Skip to content

Open-Athena/bolinas-dna

Repository files navigation

Bolinas

Open development of genomic language models — data, modeling, and evaluation.

Experiments

Development is driven by experiments tracked as GitHub issues.

Baseline Runs

Experiment Status
#21 Promoters YOLO run Closed - matches Evo 2 on promoter VEP but still behind GPN-Star
#22 mRNA + promoters YOLO run Closed - combined model consumed by coding regions; poor on promoter variants
#27 CDS YOLO run Closed - matches Evo 2 on missense variants but falls behind GPN-Star

Data

Genomic Regions

Experiment Status
#41 Promoters from mRNA vs. ncRNA Closed - adding ncRNA promoters shows no significant difference in VEP performance
#13 Mixing different genomic regions Closed - balanced mixing gives balanced performance; proportional mixing dominated by CDS
#53 Alternative datasets based on distance from CDS Closed - distance-based heuristic (a la SpeciesLM) instead of UTR annotations
#9 Repeat downweighting Closed - downweighting repetitive elements improves VEP and stabilizes training
#42 Promoter radius Closed - smaller radius performs better; expanding to ±2kb degrades performance
#43 Mixing 5 different regions Closed - CDS, promoters, and 5' UTR learn well; 3' UTR and ncRNA show limited improvement

Evolutionary Timescales

Experiment Status
#55 Promoters from different evolutionary timescales Closed - mammals-trained model reaches good VEP performance fastest
#58 CDS from different evolutionary timescales Closed - longer timescales (animals) perform better for missense variants
#59 Downstream regions from different evolutionary timescales Closed - mammals trains fastest but all timescales converge with sufficient training

Modeling

Training Objectives

Experiment Status
#3 Different training objectives Closed - CLM appears to do better than MLM at initial steps

Architecture

Experiment Status
#37 Context size Closed - 256bp and 512bp contexts perform similarly on VEP

Scaling

Experiment Status
#57 Scaling on a mixture dataset Open

Evaluation

Experiment Status
#8 Understand relationship between perplexity and other metrics Open

Installation

uv sync

Development

# Install dev dependencies and pre-commit hooks
uv sync --group dev
uv run pre-commit install

# Run quality checks (ruff format/lint, snakefmt)
uv run pre-commit run

# Run tests
uv run pytest

Project Structure

  • src/bolinas/ - Main Python package
    • data/ - Genomic data structures (GenomicSet, etc.)
    • evals/ - Evaluation utilities (inference, metrics, plotting)
  • snakemake/ - Snakemake workflows
    • training_dataset/ - Creates genomic training datasets from NCBI RefSeq genomes
    • evals/ - Downloads and processes evaluation datasets
    • analysis/evals_v1/ - Evaluates trained models on variant effect prediction tasks
  • tests/ - Test suite

About

Bolinas

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •