Skip to content

Limits are not applied correctly #14406

@adriangb

Description

@adriangb

Describe the bug

Outer limits seem to be able to impact the inner limits of a subquery

To Reproduce

Run the following python script to create test data (parquet_files.zip):

import os
from datetime import datetime, timedelta

import polars as pl

# Start date
base_date = datetime(1970, 1, 1)

# Create directory structure if it doesn't exist
os.makedirs('parquet_files', exist_ok=True)

# Generate 100 files
for i in range(100):
    # Calculate the date for this partition
    current_date = base_date + timedelta(days=i)
    partition_path = f'parquet_files/day={current_date.strftime("%Y-%m-%d")}'

    # Create partition directory
    os.makedirs(partition_path, exist_ok=True)

    # Create DataFrame with single row
    df = pl.DataFrame({'duration': [1.0]})

    # Write to parquet file
    df.write_parquet(f'{partition_path}/file_{i}.parquet')

Now in datafusion-cli (datafusion-cli 43.0.0 for me) run:

with selection as (
    select *
    from 'parquet_files/*'
    limit 1
)
select 1 as foo
from selection
order by duration
limit 1000;

I get:

+-----+
| foo |
+-----+
| 1   |
| 1   |
+-----+
2 row(s) fetched.

Which is wrong! It should only ever return 1 row.

This is an MRE of a problem I found in our production stack. In real world tests it's not 2x the rows, it can be varying numbers, it seems to depend on the number of partitions chosen to execute with. Setting SET datafusion.execution.target_partitions = 1; the problem goes away. Also without the outer limit 1000 the problem goes away.

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions