Skip to content

Support custom S3 endpoints for DataFusion object store #3120

@rgauny

Description

@rgauny

Bug Report: DataFusion object store registration fails for custom S3 endpoints

Confirmation

  • I've re-read the relevant sections of the documentation.
  • I've searched existing issues and discussions to avoid duplicates.
  • I've reviewed or skimmed the source code (or examples) to confirm the behavior is not by design.
  • I've tested this issue using a recent development wheel - Using version 1.221.0 (latest release)

Expected Behavior

When using ParquetDataCatalog with S3 protocol and custom endpoints (e.g., Minio), DataFusion should successfully register the object store and be able to read parquet files.

Based on concepts/data.md lines 584-675, S3 catalog integration should work with:

from nautilus_trader.persistence.catalog.parquet import ParquetDataCatalog

catalog = ParquetDataCatalog(
    path="s3://bucket-name/catalog",
    fs_protocol='s3',
    fs_storage_options={
        'endpoint_url': 'http://localhost:9000',  # Minio
        'key': 'minioadmin',
        'secret': 'minioadmin',
    },
)
bars = catalog.bars()  # Should load data successfully

Actual Behavior

ParquetDataCatalog fails to load data from S3-compatible storage (Minio) with error:

DataFusion error: Internal("No suitable object store found for s3://trademan-data/catalog/data/bar/NVDA.XNAS-1-MINUTE-LAST-EXTERNAL/2023-10-27T13-30-00-000000000Z_2025-10-24T19-59-00-000000000Z.parquet")

Root cause: In crates/persistence/src/backend/catalog.rs line 946, the object store is registered with a bucket-specific URL (s3://bucket-name) instead of the generic URL (s3://) that DataFusion expects.

Steps to Reproduce the Problem

  1. Set up Minio (Docker Compose or standalone):

    docker run -p 9000:9000 -p 9001:9001 \
      -e "MINIO_ROOT_USER=minioadmin" \
      -e "MINIO_ROOT_PASSWORD=minioadmin" \
      quay.io/minio/minio server /data --console-address ":9001"
  2. Create a test script (test_minio_catalog.py):

    import os
    from nautilus_trader.persistence.catalog.parquet import ParquetDataCatalog
    
    os.environ['MINIO_ENDPOINT'] = 'http://localhost:9000'
    os.environ['MINIO_ACCESS_KEY'] = 'minioadmin'
    os.environ['MINIO_SECRET_KEY'] = 'minioadmin'
    
    catalog = ParquetDataCatalog(
        path="s3://trademan-data/catalog",
        fs_protocol='s3',
        fs_storage_options={
            'key': 'minioadmin',
            'secret': 'minioadmin',
            'endpoint_url': 'http://localhost:9000',
            'endpoint': 'http://localhost:9000',
            'region': 'us-east-1',
            'allow_http': 'true',
        },
    )
    
    # This fails with "No suitable object store found"
    bars = catalog.bars()
  3. Observe that catalog query fails with "No suitable object store found"

Working Proof of Concept:

Using Python DataFusion directly works when registering with generic URL:

import datafusion

ctx = datafusion.SessionContext()
s3 = datafusion.object_store.AmazonS3(
    bucket_name="trademan-data",
    access_key_id="minioadmin",
    secret_access_key="minioadmin",
    endpoint="http://localhost:9000",
    allow_http=True,
)

# ✅ WORKS: Generic URL pattern
ctx.register_object_store("s3://", s3)  # DataFusion expects this
ctx.register_parquet("test", "s3://trademan-data/catalog/data/bar/.../file.parquet")

# ❌ FAILS: Bucket-specific URL pattern (what Nautilus does)
ctx.register_object_store("s3://trademan-data", s3)  # This is what Nautilus does
ctx.register_parquet("test", "s3://trademan-data/catalog/data/bar/.../file.parquet")

Code Snippets or Logs

Location of Bug

File: crates/persistence/src/backend/catalog.rs lines 940-949

// Register the object store with the session for remote URIs
if self.is_remote_uri() {
    let url = url::Url::parse(&self.original_uri)?;
    let host = url
        .host_str()
        .ok_or_else(|| anyhow::anyhow!("Remote URI missing host/bucket name"))?;
    let base_url = url::Url::parse(&format!("{}://{}", url.scheme(), host))?;  // ❌ BUG: Creates "s3://bucket-name"
    self.session
        .register_object_store(&base_url, self.object_store.clone());
}

Proposed Fix

Line 946 should be changed from:

let base_url = url::Url::parse(&format!("{}://{}", url.scheme(), host))?;  // Creates "s3://trademan-data"

To:

let base_url = url::Url::parse("s3://")?;  // Generic URL that DataFusion expects

Why This is a Bug

  1. DataFusion's registration pattern: DataFusion requires object stores to be registered with the generic s3:// URL pattern. When you register with "s3://", DataFusion can resolve any s3://bucket/path/file.parquet URL.

  2. Current implementation: NautilusTrader registers with bucket-specific URL (s3://bucket-name), which DataFusion cannot resolve because it's looking for the generic pattern.

  3. Impact scope: This affects ANY custom S3 endpoint:

    • Minio (what we're using)
    • DigitalOcean Spaces
    • Backblaze B2
    • Linode Object Storage
    • Even real AWS S3 (though less noticeable)
  4. Evidence from testing: The included test-df.py demonstrates:

    • Python DataFusion registration with "s3://" works ✅ (reads 194,968 rows)
    • Python DataFusion registration with "s3://trademan-data" fails ❌

Complete Evidence Script

Test (test-df.py) demonstrates the bug:

import datafusion

# FAILS: Nautilus Rust pattern (bucket-specific URL)
s3 = datafusion.object_store.AmazonS3(
    bucket_name="trademan-data",
    access_key_id="minioadmin",
    secret_access_key="minioadmin",
    endpoint="http://localhost:9000",
    allow_http=True,
)
ctx = datafusion.SessionContext()
ctx.register_object_store("s3://trademan-data", s3)  # ❌ FAILS
ctx.register_parquet("test", "s3://trademan-data/...")  # Error: "No suitable object store found"

# WORKS: Correct pattern (generic URL)
ctx2 = datafusion.SessionContext()
ctx2.register_object_store("s3://", s3)  # ✅ WORKS
ctx2.register_parquet("test2", "s3://trademan-data/...")  # Success: reads 194,968 rows

Specifications

  • OS platform: macOS 24.5.0 (Darwin)
  • Python version: 3.13.4
  • nautilus_trader version: 1.221.0

Additional Context

This bug prevents the use of custom S3-compatible storage for the ParquetDataCatalog feature, which is a core NautilusTrader capability according to the documentation.

The bug exists in the Rust backend where DataFusion object store registration occurs. The issue affects both:

  1. Custom S3 endpoints (Minio, DigitalOcean, etc.) - completely broken
  2. Real AWS S3 - may appear to work due to AWS defaults, but uses incorrect registration pattern

References:

  • Official docs: concepts/data.md lines 584-675
  • Bug location: crates/persistence/src/backend/catalog.rs lines 940-949
  • Proof: Python DataFusion test showing correct vs incorrect patterns

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions