Skip to content

Conversation

@behroozazarkhalili
Copy link

@behroozazarkhalili behroozazarkhalili commented Dec 31, 2025

Summary

This PR adds support for loading FASTQ files directly with load_dataset().

FASTQ is an extension of FASTA that includes quality scores for each base, widely used for storing output from high-throughput sequencing instruments.

Key Features

  • Zero external dependencies - Pure Python parser based on readfq.py by Heng Li
  • Quality score support - Preserves per-base quality scores as ASCII-encoded strings
  • Streaming support - Generator-based parsing for memory efficiency with large NGS files
  • Compression support - Automatic detection of gzip, bzip2, and xz compressed files
  • Large sequence support - Uses large_string for both sequence and quality columns
  • Parquet-safe batching - Dual-threshold batching (batch_size + max_batch_bytes) prevents page size errors

Columns

Column Type Description
id string Sequence identifier (first word after @)
description string Full description line (everything after id)
sequence large_string The nucleotide sequence
quality large_string ASCII-encoded quality scores (Phred+33 by default)

Supported Extensions

.fq, .fastq (and compressed variants: .fq.gz, .fastq.gz, .fq.bz2, .fq.xz)

Usage

from datasets import load_dataset

# Load FASTQ file
dataset = load_dataset("fastq", data_files="reads.fastq")

# Load gzipped file
dataset = load_dataset("fastq", data_files="reads.fq.gz")

# Filter columns
dataset = load_dataset("fastq", data_files="reads.fq", columns=["sequence", "quality"])

Quality Score Format

Quality scores use Sanger/Illumina 1.8+ encoding (Phred+33):

  • ASCII character \! (33) = quality 0
  • ASCII character I (73) = quality 40

Testing

  • 22 comprehensive tests covering basic loading, multi-line sequences, compression, batching, schema types, and edge cases
  • All tests passing
  • Linting clean

References

cc: @georgia-hf

Add support for loading FASTQ files directly with load_dataset().

FASTQ is a text-based format for storing nucleotide sequences together
with their quality scores, widely used for high-throughput sequencing.

Key features:
- Zero external dependencies using pure Python parser based on readfq.py
- Streaming support via generator-based parsing for large NGS files
- Compression support for gzip, bzip2, and xz formats
- Large sequence support using large_string Arrow type
- Dual-threshold batching (batch_size + max_batch_bytes) for Parquet safety

Columns: id, description, sequence, quality
Extensions: .fq, .fastq
FastqConfig was missing the __post_init__ method that calls
super().__post_init__(). This is required to inherit BuilderConfig's
validation for:
- Invalid config name characters (InvalidConfigName)
- data_files type validation (ValueError)

This aligns with the pattern used in ArrowConfig, MmcifFolderConfig,
FastaConfig, and other packaged module configs.

Also includes minor style formatting from ruff.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant