-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Outer limits seem to be able to impact the inner limits of a subquery
To Reproduce
Run the following python script to create test data (parquet_files.zip):
import os
from datetime import datetime, timedelta
import polars as pl
# Start date
base_date = datetime(1970, 1, 1)
# Create directory structure if it doesn't exist
os.makedirs('parquet_files', exist_ok=True)
# Generate 100 files
for i in range(100):
# Calculate the date for this partition
current_date = base_date + timedelta(days=i)
partition_path = f'parquet_files/day={current_date.strftime("%Y-%m-%d")}'
# Create partition directory
os.makedirs(partition_path, exist_ok=True)
# Create DataFrame with single row
df = pl.DataFrame({'duration': [1.0]})
# Write to parquet file
df.write_parquet(f'{partition_path}/file_{i}.parquet')Now in datafusion-cli (datafusion-cli 43.0.0 for me) run:
with selection as (
select *
from 'parquet_files/*'
limit 1
)
select 1 as foo
from selection
order by duration
limit 1000;I get:
+-----+
| foo |
+-----+
| 1 |
| 1 |
+-----+
2 row(s) fetched.
Which is wrong! It should only ever return 1 row.
This is an MRE of a problem I found in our production stack. In real world tests it's not 2x the rows, it can be varying numbers, it seems to depend on the number of partitions chosen to execute with. Setting SET datafusion.execution.target_partitions = 1; the problem goes away. Also without the outer limit 1000 the problem goes away.
Expected behavior
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working