STRdust

Tandem repeat genotyper for long reads, returning the repeat length and sequence in VCF format.

Usage

Installation

Preferably, for most users, download a ready-to-use binary for your system to add directory on your $PATH from the releases.
You may have to change the file permissions to execute it with chmod +x STRdust

Alternatively, you can install the tool using cargo:

git clone https://github.com/wdecoster/STRdust.git
cd STRdust
cargo build --release

Quick start examples

STRdust -r chr7:154654404-154654432 reference.fa sample.cram > sample.vcf
STRdust --pathogenic reference.fa sample.cram | bgzip > sample.vcf.gz
STRdust -R targets.bed --haploid chrX,chrY reference.fa male_sample.cram | bgzip > repeats.vcf.gz

The 'test_data' directory contains a small example dataset to test the tool:

STRdust -r chr7:154654404-154654432 test_data/chr7.fa.gz test_data/small-test-phased.bam > small-test-phased.vcf

All arguments

    STRdust [OPTIONS] <FASTA> <BAM>

ARGS:
    <FASTA>    reference genome used for alignment
    <BAM>      bam/cram file to call STRs in (local path or URL)

SPECIFY ONE OF:
    -r, --region <REGION>              region string to genotype expansion in (format: chr:start-end)
    -R, --region-file <REGION_FILE>    Bed file with region(s) to genotype expansion(s) in
        --pathogenic                   Genotype the pathogenic STRs from STRchive

OPTIONS:
    -m, --minlen <MINLEN>              minimal length of insertion/deletion operation [default: 5]
    -s, --support <SUPPORT>            minimal number of supporting reads per haplotype [default: 3]
    -t, --threads <THREADS>            Number of parallel threads to use [default: 1]
        --sample <SAMPLE>              Sample name to use in VCF header, if not provided, the bam file name is used
        --somatic                      Print information on somatic variability
        --unphased                     Reads are not phased, will use hierarchical clustering to phase expansions
        --consensus-reads              Maximum number of reads to use to build the consensus sequence [default: 20]
        --find-outliers                Identify poorly supported outlier expansions (only with --unphased)
        --haploid <HAPLOID>            comma-separated list of haploid (sex) chromosomes
    -h, --help                         Print help information
    -V, --version                      Print version information

Notes

Lowering the number of consensus reads may lead to lesser accurate alternative allele sequences (selecting randomly from the reads), but may greatly improve speed. Note that in the case of somatic length variation, a small number of randomly selected reads may lead to a bias and not be representative of the true repeat length.
Genotyping known pathogenic repeats with the --pathogenic flag will return a VCF with the pathogenic STRs from STRchive, but currently only for the GRCh38 reference.

Output format

STRdust produces a VCF file per sample. The consensus sequence is in the ALT field, with sequences from each read in the SEQS INFO field (when running with --somatic). The FRB FORMAT field is the total repeat length, of the two alleles, in nucleotides. The RB field is the difference between the indidiual allele lengths and the reference length. The SC FORMAT field is a measure of accuracy of the consensus sequence compared to the overlap graph from the individual reads, which could be influenced by the presence of sequencing errors or somatic variation.

Example output:

(header cropped for brevity)
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the repeat interval">
##INFO=<ID=STDEV,Number=2,Type=Integer,Description="Standard deviation of the repeat length">
##INFO=<ID=SEQS,Number=1,Type=String,Description="Sequences supporting the two alleles">
##INFO=<ID=OUTLIERS,Number=1,Type=String,Description="Outlier sequences much longer than the alleles">
##INFO=<ID=CLUSTERFAILURE,Number=0,Type=Flag,Description="If unphased input failed to cluster in two haplotype">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=RB,Number=2,Type=Integer,Description="Repeat length of the two alleles in bases relative to reference">
##FORMAT=<ID=FRB,Number=2,Type=Integer,Description="Full repeat length of the two alleles in bases">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
##FORMAT=<ID=SUP,Number=2,Type=Integer,Description="Read support per allele">
##FORMAT=<ID=SC,Number=2,Type=Integer,Description="Consensus score per allele">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00271.hg38
chr1    1435798 .       TGGCGCGGAGCGGCGCGGAGCG  GCTGGCGCGGAGCGGCGCGGA,GCGGGCGCGCGCAGGA  .       .       END=1435818;STDEV=1,2     GT:RB:FRB:SUP:SC        1|2:1,-4:21,16:18,6:63,41
chr1    57367044        .       AAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAATAAAT     AAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAATAAA,AAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAATAAA        .       .       END=57367125;STDEV=3,0 GT:RB:FRB:SUP:SC 1|2:-9,-34:72,47:17,12:216,141

Development

Getting Started for Contributors

Prerequisites

Rust toolchain (install via rustup)
Git

First-Time Setup

Clone the repository:

git clone https://github.com/wdecoster/STRdust.git
cd STRdust

Install development tools and git hooks:

make setup        # Installs rustfmt, clippy, cargo-audit, cargo-outdated
make install-hooks # Installs pre-commit and pre-push hooks

The git hooks will automatically run quality checks before commits and pushes, catching issues early.

Development Workflow

Quick checks before committing:

make pre-commit   # Runs fmt, clippy, and tests

Full CI simulation before pushing:

make ci           # Runs fmt-check, clippy, and tests (same as CI)

Other useful commands:

make fmt          # Format code
make fmt-check    # Check formatting without modifying files
make clippy       # Run linter
make test         # Run tests
make build        # Build release binary
make build-musl   # Build static MUSL binary
make docs         # Generate and open documentation
make help         # Show all available targets

Testing

STRdust includes comprehensive tests, including specific tests for the --pathogenic functionality. Run tests with:

cargo test
# or
make test

For network-dependent tests (testing STRchive download functionality):

TEST_PATHOGENIC_NETWORK=1 cargo test

See PATHOGENIC_TESTING.md for detailed information about the pathogenic flag testing.

Code Quality Standards

Formatting

Code must be formatted with rustfmt using the project's configuration (.rustfmt.toml):

Max line width: 100 characters
Edition: 2021
Field init shorthand enabled

The pre-commit hook automatically runs formatting checks.

Linting

The project uses cargo clippy for linting with warnings treated as errors. Some clippy warnings are configured to be allowed in Cargo.toml:

too_many_arguments: Allowed because bioinformatics functions often require many parameters for configuration

Run clippy with:

cargo clippy --all-targets --all-features -- -D warnings
# or
make clippy

Security

Security audits run automatically:

make audit        # Run cargo-audit for vulnerability scanning
make outdated     # Check for outdated dependencies

Dependency Management

This project uses Dependabot to automatically keep dependencies up to date. Dependabot is configured to:

Check for Cargo dependency updates weekly on Mondays
Check for GitHub Actions updates weekly
Automatically create pull requests for dependency updates
Group minor and patch updates together for easier review
Auto-merge patch updates after tests pass
Require manual review for major version updates

The Dependabot configuration can be found in .github/dependabot.yml.

Continuous Integration

The project uses GitHub Actions for CI/CD:

Test workflow: Runs on all pushes and pull requests
- Checks formatting (cargo fmt --check)
- Runs clippy with -D warnings
- Runs full test suite
- Separate job for MUSL static binary build and test
- Uses cargo caching for faster builds
Security workflow: Runs weekly and on every push/PR
- Security audit with cargo-audit
- Outdated dependency checks
- License and security policy enforcement with cargo-deny
Dependabot workflow: Automatically tests and merges safe dependency updates
Publish workflow: Creates releases for Linux and macOS when tags are pushed

All CI checks can be simulated locally with make ci before pushing.

CITATION

If you use this tool, please consider citing our publication.

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
.cargo		.cargo
.githooks		.githooks
.github		.github
misc		misc
src		src
test_data		test_data
tests		tests
.gitignore		.gitignore
.rustfmt.toml		.rustfmt.toml
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
deny.toml		deny.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

STRdust

Usage

Installation

Quick start examples

All arguments

Notes

Output format

Development

Getting Started for Contributors

Prerequisites

First-Time Setup

Development Workflow

Testing

Code Quality Standards

Formatting

Linting

Security

Dependency Management

Continuous Integration

CITATION

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

wdecoster/STRdust

Folders and files

Latest commit

History

Repository files navigation

STRdust

Usage

Installation

Quick start examples

All arguments

Notes

Output format

Development

Getting Started for Contributors

Prerequisites

First-Time Setup

Development Workflow

Testing

Code Quality Standards

Formatting

Linting

Security

Dependency Management

Continuous Integration

CITATION

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages