Add lightweight FASTQ file format support #7924

behroozazarkhalili · 2025-12-31T19:46:42Z

Summary

This PR adds support for loading FASTQ files directly with load_dataset().

FASTQ is an extension of FASTA that includes quality scores for each base, widely used for storing output from high-throughput sequencing instruments.

Key Features

Zero external dependencies - Pure Python parser based on readfq.py by Heng Li
Quality score support - Preserves per-base quality scores as ASCII-encoded strings
Streaming support - Generator-based parsing for memory efficiency with large NGS files
Compression support - Automatic detection of gzip, bzip2, and xz compressed files
Large sequence support - Uses large_string for both sequence and quality columns
Parquet-safe batching - Dual-threshold batching (batch_size + max_batch_bytes) prevents page size errors

Columns

Column	Type	Description
`id`	string	Sequence identifier (first word after `@`)
`description`	string	Full description line (everything after id)
`sequence`	large_string	The nucleotide sequence
`quality`	large_string	ASCII-encoded quality scores (Phred+33 by default)

Supported Extensions

.fq, .fastq (and compressed variants: .fq.gz, .fastq.gz, .fq.bz2, .fq.xz)

Usage

from datasets import load_dataset

# Load FASTQ file
dataset = load_dataset("fastq", data_files="reads.fastq")

# Load gzipped file
dataset = load_dataset("fastq", data_files="reads.fq.gz")

# Filter columns
dataset = load_dataset("fastq", data_files="reads.fq", columns=["sequence", "quality"])

Quality Score Format

Quality scores use Sanger/Illumina 1.8+ encoding (Phred+33):

ASCII character \! (33) = quality 0
ASCII character I (73) = quality 40

Testing

22 comprehensive tests covering basic loading, multi-line sequences, compression, batching, schema types, and edge cases
All tests passing
Linting clean

References

Follows pattern established in feat(fasta): add lightweight FASTA file format support #7923 (FASTA support)
Parser based on: https://github.com/lh3/readfq
Addresses feedback from Add fasta support #7851

cc: @georgia-hf

Add support for loading FASTQ files directly with load_dataset(). FASTQ is a text-based format for storing nucleotide sequences together with their quality scores, widely used for high-throughput sequencing. Key features: - Zero external dependencies using pure Python parser based on readfq.py - Streaming support via generator-based parsing for large NGS files - Compression support for gzip, bzip2, and xz formats - Large sequence support using large_string Arrow type - Dual-threshold batching (batch_size + max_batch_bytes) for Parquet safety Columns: id, description, sequence, quality Extensions: .fq, .fastq

FastqConfig was missing the __post_init__ method that calls super().__post_init__(). This is required to inherit BuilderConfig's validation for: - Invalid config name characters (InvalidConfigName) - data_files type validation (ValueError) This aligns with the pattern used in ArrowConfig, MmcifFolderConfig, FastaConfig, and other packaged module configs. Also includes minor style formatting from ruff.

This was referenced Dec 31, 2025

feat: Add mmCIF file support for macromolecular structures #7925

Open

Add lightweight PDB (Protein Data Bank) file support #7926

Open

Proposal: Protein 3D Structure Visualization for Dataset Viewer #7930

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lightweight FASTQ file format support #7924

Add lightweight FASTQ file format support #7924

behroozazarkhalili commented Dec 31, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add lightweight FASTQ file format support #7924

Are you sure you want to change the base?

Add lightweight FASTQ file format support #7924

Conversation

behroozazarkhalili commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Columns

Supported Extensions

Usage

Quality Score Format

Testing

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

behroozazarkhalili commented Dec 31, 2025 •

edited

Loading