Bolinas

Open development of genomic language models — data, modeling, and evaluation.

Experiments

Development is driven by experiments tracked as GitHub issues.

Baseline Runs

Experiment	Status
#21 Promoters YOLO run	Closed - matches Evo 2 on promoter VEP but still behind GPN-Star
#22 mRNA + promoters YOLO run	Closed - combined model consumed by coding regions; poor on promoter variants
#27 CDS YOLO run	Closed - matches Evo 2 on missense variants but falls behind GPN-Star

Data

Genomic Regions

Experiment	Status
#41 Promoters from mRNA vs. ncRNA	Closed - adding ncRNA promoters shows no significant difference in VEP performance
#13 Mixing different genomic regions	Closed - balanced mixing gives balanced performance; proportional mixing dominated by CDS
#53 Alternative datasets based on distance from CDS	Closed - distance-based heuristic (a la SpeciesLM) instead of UTR annotations
#9 Repeat downweighting	Closed - downweighting repetitive elements improves VEP and stabilizes training
#42 Promoter radius	Closed - smaller radius performs better; expanding to ±2kb degrades performance
#43 Mixing 5 different regions	Closed - CDS, promoters, and 5' UTR learn well; 3' UTR and ncRNA show limited improvement

Evolutionary Timescales

Experiment	Status
#55 Promoters from different evolutionary timescales	Closed - mammals-trained model reaches good VEP performance fastest
#58 CDS from different evolutionary timescales	Closed - longer timescales (animals) perform better for missense variants
#59 Downstream regions from different evolutionary timescales	Closed - mammals trains fastest but all timescales converge with sufficient training

Modeling

Training Objectives

Experiment	Status
#3 Different training objectives	Closed - CLM appears to do better than MLM at initial steps

Architecture

Experiment	Status
#37 Context size	Closed - 256bp and 512bp contexts perform similarly on VEP

Scaling

Experiment	Status
#57 Scaling on a mixture dataset	Open

Evaluation

Experiment	Status
#8 Understand relationship between perplexity and other metrics	Open

Installation

uv sync

Development

# Install dev dependencies and pre-commit hooks
uv sync --group dev
uv run pre-commit install

# Run quality checks (ruff format/lint, snakefmt)
uv run pre-commit run

# Run tests
uv run pytest

Project Structure

src/bolinas/ - Main Python package
- data/ - Genomic data structures (GenomicSet, etc.)
- evals/ - Evaluation utilities (inference, metrics, plotting)
snakemake/ - Snakemake workflows
- training_dataset/ - Creates genomic training datasets from NCBI RefSeq genomes
- evals/ - Downloads and processes evaluation datasets
- analysis/evals_v1/ - Evaluates trained models on variant effect prediction tasks
tests/ - Test suite

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github		.github
assets		assets
scripts		scripts
snakemake		snakemake
src/bolinas		src/bolinas
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bolinas

Experiments

Baseline Runs

Data

Genomic Regions

Evolutionary Timescales

Modeling

Training Objectives

Architecture

Scaling

Evaluation

Installation

Development

Project Structure

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

Open-Athena/bolinas-dna

Folders and files

Latest commit

History

Repository files navigation

Bolinas

Experiments

Baseline Runs

Data

Genomic Regions

Evolutionary Timescales

Modeling

Training Objectives

Architecture

Scaling

Evaluation

Installation

Development

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages