Skip to content

RagnarGrootKoerkamp/sassy

Repository files navigation

crates.io docs.rs PyPI

Sassy: SIMD-accelerated Approximate String Matching

Sassy is a library and tool for searching short strings in texts, a problem that goes by many names:

  • approximate string matching,
  • pattern matching,
  • fuzzy searching.

The motivating application is searching short (length 20 to 100) DNA sequences in a human genome or e.g. in a set of reads. Sassy generally works well for patterns/queries up to length 1000, and supports both ASCII and DNA.

Highlights:

  • Sassy uses bitpacking and SIMD (both AVX2 and NEON supported). Its main novelty is tiling these in the text direction.
  • Support for overhang alignments where the pattern extends beyond the text.
  • Support for (case-insensitive) ASCII, DNA (ACGT), and IUPAC (=ACGT+NYR...) alphabets.
  • Rust library (cargo add sassy), binary (cargo install sassy), Python bindings (pip install sassy-rs), and C bindings (see below).

See the paper below, and corresponding evals in evals/.

Rick Beeloo and Ragnar Groot Koerkamp.
Sassy: Searching Short DNA Strings in the 2020s.
bioRxiv, July 2025. https://doi.org/10.1101/2025.07.22.666207.

Usage

Sassy can be used via the CLI, or as Rust, Python, or C library.

0. Rust library

The library can be used to search for ASCII or DNA strings. A larger example can be found in src/lib.rs.

// cargo add sassy
use sassy::{Searcher, Match, profiles::{Dna}, Strand};

let pattern = b"ATCG";
let text = b"AAAATTGAAA";
let k = 1;

let mut searcher = Searcher::<Dna>::new_fwd();
let matches = searcher.search(pattern, &text, k);

assert_eq!(matches.len(), 1);

assert_eq!(matches[0].text_start, 3);
assert_eq!(matches[0].text_end, 7);
assert_eq!(matches[0].cost, 1);
assert_eq!(matches[0].strand, Strand::Fwd);
assert_eq!(matches[0].cigar.to_string(), "2=1X1=");

1. Command-line interface (CLI)

Build and install using cargo

cargo install sassy

The CLI can be used via:

  1. sassy grep: to show nicely coloured output.
  2. sassy search: to write a .tsv of matching locations.
  3. sassy filter: to write a .fasta/.fastq of (non)-matching records.
  4. sassy crispr: to search for CRISPR guides.

grep, search, and filter all take the same arguments, and are implemented by forwarding to grep. Thus, they can all be combined via e.g.

sassy grep -p ACGTCAAACCTA -k 3 --matches matches.tsv --output filtered.fastq reads.fastq.gz

1.1: Grep for a pattern

Search a pattern ATGAGCA in text.fasta with ≤1 edit:

sassy search --pattern ATGAGCA -k 1 text.fasta

or search all records of a fasta file with --pattern-fasta <fasta-file> instead of --pattern.

The grep output is coloured:

  • green shows matching characters,
  • orange shows mismatches,
  • red shows deleted characters (in pattern but not in text),
  • blue shows inserted characters (in text but not in pattern). screenshot of sassy grep output

1.2: TSV output for matches

sassy search -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa > matches.tsv
# or
sassy search -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa --matches matches.tsv
# or
sassy grep   -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa --matches matches.tsv

gives .tsv output like this:

pat_id	text_id	cost	strand	start	end	match_region	cigar
pattern	AC_000001.1__1_1	0	+	6	48	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_35	0	+	897	939	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_49	1	+	866	908	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCGCGCG	37=1X4=
pattern	AC_000001.1__1_64	0	-	1267	1309	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_67	0	+	600	642	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_68	0	-	1826	1868	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_78	3	-	4381	4425	GTACAGAAACGAGCGGATGGAAAATAGTAGTGAGCGGCCTCGCG	23=1X1I10=1I8=
pattern	AC_000001.1__1_92	0	-	6554	6596	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_94	0	-	6413	6455	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_115	2	+	2091	2131	GTACAGAAACGAGCATGGAAAGAGTAGTGAGCGCCTCGCG	14=2D26=
pattern	AC_000001.1__1_118	0	-	3062	3104	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_123	0	+	1416	1458	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_127	0	+	27	69	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=

1.3: Filter matching records

sassy filter -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq > filtered.fq
# or
sassy filter -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq -o filtered.fq
# or
sassy grep   -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq -o filtered.fq

Writes a file containing only matching records. Use --invert to only write non-matching records.

1.4: CRISPR off-target search

Search for one or more guides in guides.txt:

sassy crispr --threads 8 --guide guides.txt --k 5 --max-n-frac 0.1 --output hits.tsv hg38.fasta

Allows <= k edits in the sgRNA, and the PAM (the last 3 characters of each guide) has to match exactly, unless --allow-pam-edits is given.

Output of the crispr command is a tab-delimited file with one row per hit, e.g.:

guide                    text_id  cost  strand  start     end       match_region             cigar
GAGTCCGAGCAGAAGAAGAANGG  chr21    5     +       5024135   5024154   GAGGCCACAGAGAAGAGGG      3=1X2=1D1=1D3=1D5=1D4=
GAGTCCGAGCAGAAGAAGAANGG  chr21    3     +       21087337  21087359  gagaccgaggagaagaaaaagg   3=1X5=1X7=1D5=
GAGTCCGAGCAGAAGAAGAANGG  chr21    3     -       9701297   9701320   GACTCGAGCATGAAGAAGAAAGG  2=1X1=1D6=1I12=
GAGTCCGAGCAGAAGAAGAANGG  chr21    5     -       46396975  46396998  CAGTCCCAGCAGACGACGGACGG  1X5=1X6=1X2=1X1=1X4=

The start and end are 0-based open-ended (i.e. 0-based inclusive of the start, but exclusive of the end), and start is always less than end (regardless of the strand). The match_region reported will be the sequence from the target file when strand is +, or the reverse complement of the sequence from the target file when strand is -, so that it matches the guide sequence. The cigar is always oriented to read left-to-right with the provided guide and match_region sequences.

Note that this searches for approximate occurrences of the guide sequence itself, and not for reverse-complement binding sites. If binding sites are to be found, please reverse-complement the input or output manually.

2. Python bindings

PyPI wheels can be installed with:

pip install sassy-rs 
import sassy

pattern = b"ACTG"
text    = b"ACGGCTACGCAGCATCATCAGCAT"

searcher = sassy.Searcher("dna") # ascii / dna / iupac
matches  = searcher.search(pattern, text, k=1)

for m in matches:
    print(m)

See python/README.md for more details.

3. C library

See c/README.md for details. Quick example:

#include "sassy.h"

int main() {
    const char* pattern = "ACTG";
    const char* text    = "ACGGCTACGCAGCATCATCAGCAT";

    // DNA alphabet, with reverse complement, without overhang.
    sassy_SearcherType* searcher = sassy_searcher("dna", true, NAN);
    sassy_Match* out_matches = NULL;
    size_t n_matches = search(searcher,
                              pattern, strlen(pattern),
                              text, strlen(text),
                              1, // k=1
                              &out_matches);

    sassy_matches_free(out_matches, n_matches);
    sassy_searcher_free(searcher);
}

About

Fast approximate string searching

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •