-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Bug Report: DataFusion object store registration fails for custom S3 endpoints
Confirmation
- I've re-read the relevant sections of the documentation.
- I've searched existing issues and discussions to avoid duplicates.
- I've reviewed or skimmed the source code (or examples) to confirm the behavior is not by design.
- I've tested this issue using a recent development wheel - Using version 1.221.0 (latest release)
Expected Behavior
When using ParquetDataCatalog with S3 protocol and custom endpoints (e.g., Minio), DataFusion should successfully register the object store and be able to read parquet files.
Based on concepts/data.md lines 584-675, S3 catalog integration should work with:
from nautilus_trader.persistence.catalog.parquet import ParquetDataCatalog
catalog = ParquetDataCatalog(
path="s3://bucket-name/catalog",
fs_protocol='s3',
fs_storage_options={
'endpoint_url': 'http://localhost:9000', # Minio
'key': 'minioadmin',
'secret': 'minioadmin',
},
)
bars = catalog.bars() # Should load data successfullyActual Behavior
ParquetDataCatalog fails to load data from S3-compatible storage (Minio) with error:
DataFusion error: Internal("No suitable object store found for s3://trademan-data/catalog/data/bar/NVDA.XNAS-1-MINUTE-LAST-EXTERNAL/2023-10-27T13-30-00-000000000Z_2025-10-24T19-59-00-000000000Z.parquet")
Root cause: In crates/persistence/src/backend/catalog.rs line 946, the object store is registered with a bucket-specific URL (s3://bucket-name) instead of the generic URL (s3://) that DataFusion expects.
Steps to Reproduce the Problem
-
Set up Minio (Docker Compose or standalone):
docker run -p 9000:9000 -p 9001:9001 \ -e "MINIO_ROOT_USER=minioadmin" \ -e "MINIO_ROOT_PASSWORD=minioadmin" \ quay.io/minio/minio server /data --console-address ":9001"
-
Create a test script (
test_minio_catalog.py):import os from nautilus_trader.persistence.catalog.parquet import ParquetDataCatalog os.environ['MINIO_ENDPOINT'] = 'http://localhost:9000' os.environ['MINIO_ACCESS_KEY'] = 'minioadmin' os.environ['MINIO_SECRET_KEY'] = 'minioadmin' catalog = ParquetDataCatalog( path="s3://trademan-data/catalog", fs_protocol='s3', fs_storage_options={ 'key': 'minioadmin', 'secret': 'minioadmin', 'endpoint_url': 'http://localhost:9000', 'endpoint': 'http://localhost:9000', 'region': 'us-east-1', 'allow_http': 'true', }, ) # This fails with "No suitable object store found" bars = catalog.bars()
-
Observe that catalog query fails with "No suitable object store found"
Working Proof of Concept:
Using Python DataFusion directly works when registering with generic URL:
import datafusion
ctx = datafusion.SessionContext()
s3 = datafusion.object_store.AmazonS3(
bucket_name="trademan-data",
access_key_id="minioadmin",
secret_access_key="minioadmin",
endpoint="http://localhost:9000",
allow_http=True,
)
# ✅ WORKS: Generic URL pattern
ctx.register_object_store("s3://", s3) # DataFusion expects this
ctx.register_parquet("test", "s3://trademan-data/catalog/data/bar/.../file.parquet")
# ❌ FAILS: Bucket-specific URL pattern (what Nautilus does)
ctx.register_object_store("s3://trademan-data", s3) # This is what Nautilus does
ctx.register_parquet("test", "s3://trademan-data/catalog/data/bar/.../file.parquet")Code Snippets or Logs
Location of Bug
File: crates/persistence/src/backend/catalog.rs lines 940-949
// Register the object store with the session for remote URIs
if self.is_remote_uri() {
let url = url::Url::parse(&self.original_uri)?;
let host = url
.host_str()
.ok_or_else(|| anyhow::anyhow!("Remote URI missing host/bucket name"))?;
let base_url = url::Url::parse(&format!("{}://{}", url.scheme(), host))?; // ❌ BUG: Creates "s3://bucket-name"
self.session
.register_object_store(&base_url, self.object_store.clone());
}Proposed Fix
Line 946 should be changed from:
let base_url = url::Url::parse(&format!("{}://{}", url.scheme(), host))?; // Creates "s3://trademan-data"To:
let base_url = url::Url::parse("s3://")?; // Generic URL that DataFusion expectsWhy This is a Bug
-
DataFusion's registration pattern: DataFusion requires object stores to be registered with the generic
s3://URL pattern. When you register with"s3://", DataFusion can resolve anys3://bucket/path/file.parquetURL. -
Current implementation: NautilusTrader registers with bucket-specific URL (
s3://bucket-name), which DataFusion cannot resolve because it's looking for the generic pattern. -
Impact scope: This affects ANY custom S3 endpoint:
- Minio (what we're using)
- DigitalOcean Spaces
- Backblaze B2
- Linode Object Storage
- Even real AWS S3 (though less noticeable)
-
Evidence from testing: The included
test-df.pydemonstrates:- Python DataFusion registration with
"s3://"works ✅ (reads 194,968 rows) - Python DataFusion registration with
"s3://trademan-data"fails ❌
- Python DataFusion registration with
Complete Evidence Script
Test (test-df.py) demonstrates the bug:
import datafusion
# FAILS: Nautilus Rust pattern (bucket-specific URL)
s3 = datafusion.object_store.AmazonS3(
bucket_name="trademan-data",
access_key_id="minioadmin",
secret_access_key="minioadmin",
endpoint="http://localhost:9000",
allow_http=True,
)
ctx = datafusion.SessionContext()
ctx.register_object_store("s3://trademan-data", s3) # ❌ FAILS
ctx.register_parquet("test", "s3://trademan-data/...") # Error: "No suitable object store found"
# WORKS: Correct pattern (generic URL)
ctx2 = datafusion.SessionContext()
ctx2.register_object_store("s3://", s3) # ✅ WORKS
ctx2.register_parquet("test2", "s3://trademan-data/...") # Success: reads 194,968 rowsSpecifications
- OS platform: macOS 24.5.0 (Darwin)
- Python version: 3.13.4
nautilus_traderversion: 1.221.0
Additional Context
This bug prevents the use of custom S3-compatible storage for the ParquetDataCatalog feature, which is a core NautilusTrader capability according to the documentation.
The bug exists in the Rust backend where DataFusion object store registration occurs. The issue affects both:
- Custom S3 endpoints (Minio, DigitalOcean, etc.) - completely broken
- Real AWS S3 - may appear to work due to AWS defaults, but uses incorrect registration pattern
References:
- Official docs:
concepts/data.mdlines 584-675 - Bug location:
crates/persistence/src/backend/catalog.rslines 940-949 - Proof: Python DataFusion test showing correct vs incorrect patterns
Metadata
Metadata
Assignees
Labels
Type
Projects
Status