Support reverse parquet scan and fast parquet order inversion at row group level #18817

zhuqi-lucas · 2025-11-19T12:12:08Z

Which issue does this PR close?

Overview

This PR implements reverse scanning for Parquet files to optimize ORDER BY ... DESC LIMIT N queries on sorted data. When DataFusion detects that reversing the scan order would eliminate the need for a separate sort operation, it can now directly read Parquet files in reverse order.

Implementation Note: This PR implements Part 1 of the vision outlined in #17172 (Order Inversion at the DataFusion level).

Current implementation:

✅ Row-group-level reversal (this PR)
✅ Memory bounded by row group size (typically ~128MB)
✅ Significant performance improvements for common use cases

Future improvements (requires arrow-rs changes):

Page-level reverse decoding (would reduce memory to ~1MB per page)
In-place array flipping (would eliminate take kernel overhead)
See issue Fast parquet order inversion #17172 for full details

These enhancements would further optimize memory usage and latency, but the current implementation already provides substantial benefits for most workloads.

Rationale for this change

Motivation

Currently, queries like SELECT * FROM table ORDER BY sorted_column DESC LIMIT 100 require DataFusion to:

Read the entire file in forward order
Sort/reverse all results in memory
Apply the limit

For files that are already sorted in ascending order, this is inefficient. With this optimization, DataFusion can:

Read row groups in reverse order
Reverse individual batches progressively
Stream results directly without a separate sort operation
Stop reading early when the limit is reached (single-partition case)

Performance Benefits:

Eliminates memory-intensive sort operations for large files
Enables streaming for reverse-order queries with limits
Reduces query latency significantly for sorted data
Reduces I/O by stopping early when limit is satisfied (single-partition)

Scope and Limitations

This optimization applies to:

✅ Single-partition scans (most common case for sorted Parquet files) - Full optimization: sort completely eliminated
✅ Multi-partition scans - Partial optimization: each partition scans in reverse, but SortPreservingMerge is still required
✅ Queries with ORDER BY ... DESC on pre-sorted columns
✅ Queries with LIMIT clauses (most beneficial for single-partition)

This optimization does NOT apply to:

❌ Unsorted files - No benefit from reverse scanning
❌ Complex sort expressions that don't match file ordering

Single-partition vs Multi-partition:

Scenario	Optimization Effect	Why
Single-partition	Sort operation completely eliminated	No merge needed, direct reverse streaming with limit pushdown
Multi-partition	Per-partition sorts eliminated, but merge still required	Each partition reads in reverse (eliminating per-partition sorts), but `SortPreservingMergeExec` is needed to combine streams. Limit cannot be pushed to individual partitions.

Performance comparison:

Single-partition: ORDER BY DESC LIMIT N → Direct reverse scan with limit pushed down to DataSource
Multi-partition: ORDER BY DESC LIMIT N → Reverse scan per partition + LocalLimitExec + SortPreservingMergeExec

While multi-partition scans still require a merge operation, they benefit significantly from:

Elimination of per-partition sort operations
Parallel reverse scanning across partitions
Reduced data volume entering the merge stage via LocalLimitExec

Configuration

This optimization is enabled by default but can be controlled via:

SQL:

SET datafusion.execution.parquet.enable_reverse_scan = true/false;

Rust API:

let ctx = SessionContext::new()
    .with_config(
        SessionConfig::new()
            .with_parquet_reverse_scan(false)  // Disable optimization
    );

When to disable:

If your Parquet files have very large row groups (> 256MB) and memory is constrained (row group buffering required for correctness)
For debugging or performance comparison purposes
If you observe unexpected behavior (please report as a bug!)

Default: Enabled (true)

Implementation Details

Architecture

The implementation consists of four main components:

1. ParquetSource API (`source.rs`)

Added reverse_scan: bool field to ParquetSource
Added with_reverse_scan() and reverse_scan() methods
The flag is propagated through the file scan configuration

2. ParquetOpener (`opener.rs`)

Added reverse_scan: bool field
Row Group Reversal: Before building the stream, row group indices are reversed: row_group_indexes.reverse()
Stream Selection: Based on reverse_scan flag, creates either:
- Normal stream: RecordBatchStreamAdapter
- Reverse stream: ReversedParquetStream with row-group-level buffering

3. ReversedParquetStream (`opener.rs`)

A custom stream implementation that performs two-stage reversal with optional limit support:

Stage 1 - Row Reversal: Reverse rows within each batch using Arrow's take kernel

let indices = UInt32Array::from_iter_values((0..num_rows as u32).rev());
take(column, &indices, None)

Stage 2 - Batch Reversal: Reverse the order of batches within each row group

reversed_batches.into_iter().rev()

Key Properties:

Bounded Memory: Buffers at most one row group at a time (typically ~128MB with default Parquet writer settings)
Progressive Output: Outputs reversed batches immediately after each row group completes
Limit Support: Unified implementation handles both limited and unlimited scans
- With limit: Stops processing when limit is reached, avoiding unnecessary I/O
- Without limit: Processes entire file in reverse order
Metrics: Tracks row_groups_reversed, batches_reversed, and reverse_time

4. Physical Optimizer (`reverse_order.rs`)

New ReverseOrder optimization rule
Detects patterns where reversing the input satisfies sort requirements:
- SortExec with reversible input ordering
- GlobalLimitExec -> SortExec patterns (most beneficial case)
Uses TreeNodeRewriter to push reverse flag down to ParquetSource
Single-partition check: Only pushes limit to single-partition DataSourceExec to avoid correctness issues with multi-partition scans
Preserves correctness by checking:
- Input ordering compatibility
- Required input ordering constraints
- Ordering preservation properties

Why Row-Group-Level Buffering?

Row group buffering is necessary for correctness:

Parquet Structure: Files are organized into independent row groups (typically ~128MB with default settings)
Batch Boundaries: The parquet reader's batches may not align with row group boundaries
Correct Ordering: We must ensure complete row groups are reversed to maintain semantic correctness

This is the minimal buffering granularity that ensures correct results while still being compatible with arrow-rs's existing parquet reader architecture.

Memory Characteristics:

Maximum memory: Size of largest row group (typically ~128MB with default Parquet writer settings)
Not O(file_size), but O(row_group_size)
Acceptable trade-off for elimination of full-file sort operation

Why this is necessary:

Parquet batches don't align with row group boundaries
Must buffer complete row groups to ensure correct ordering
This is the minimal buffering granularity for correctness

Future Optimization: Page-level reverse scanning in arrow-rs could further reduce memory usage and improve latency by eliminating row-group buffering entirely.

What changes are included in this PR?

Core Implementation:
- ParquetSource: Added reverse scan flag and methods
- ParquetOpener: Row group reversal and stream creation logic
- ReversedParquetStream: Unified stream implementation with optional limit support
Physical Optimization:
- ReverseOrder: New optimizer rule for detecting and applying reverse scan optimization
- Pattern matching for SortExec and GlobalLimitExec -> SortExec
- Single-partition validation to ensure optimization is beneficial
Configuration:
- Added enable_reverse_scan config option (default: true)
- SQL and Rust API support
Metrics:
- row_groups_reversed: Count of reversed row groups
- batches_reversed: Count of reversed batches
- reverse_time: Time spent reversing data

Are these changes tested?

Yes, comprehensive tests added:

Unit Tests (opener.rs):

Single batch reversal
Multiple batch reversal
Multiple row group handling
Limit enforcement
Null value handling
ParquetSource flag propagation

Integration Tests (reverse_order.rs):

Sort removal optimization
Limit + Sort pattern optimization
Multi-partition handling (partial optimization with merge)
Nested sort patterns
Edge cases (empty exec, multiple columns, etc.)

SQL Logic Tests (.slt files):

End-to-end query validation
Single-partition reverse scan with multiple row groups
Multi-partition reverse scan with file reversal
Uneven partition handling
Performance comparisons
Correctness verification across various scenarios

Are there any user-facing changes?

New Configuration Option:

datafusion.execution.parquet.enable_reverse_scan (default: true)

Behavioral Changes:

Queries with ORDER BY ... DESC LIMIT N on sorted single-partition Parquet files will automatically use reverse scanning when beneficial
Multi-partition queries benefit from per-partition reverse scanning, though merge is still required
No changes to query results - only performance improvements
New metrics available in query execution metrics

Breaking Changes:

None. This is a purely additive optimization that maintains backward compatibility.

alamb · 2025-11-30T14:36:16Z

run benchmarks

alamb · 2025-11-30T14:36:57Z

run benchmarks

(BTW I am testing out some new scripts: #18115 (comment))

@zhuqi-lucas / @xudong963 let me know if you would like to be added to the whitelist

alamb · 2025-11-30T14:51:08Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing reverse_parquet (86e027c) to 7925735 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-11-30T15:50:31Z

🤖: Benchmark completed

Details

Comparing HEAD and reverse_parquet
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ reverse_parquet ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2758.70 ms │      2664.09 ms │     no change │
│ QQuery 1     │  1306.02 ms │      1225.21 ms │ +1.07x faster │
│ QQuery 2     │  2444.91 ms │      2378.03 ms │     no change │
│ QQuery 3     │  1161.97 ms │      1147.66 ms │     no change │
│ QQuery 4     │  2368.99 ms │      2382.54 ms │     no change │
│ QQuery 5     │ 28439.65 ms │     28840.09 ms │     no change │
│ QQuery 6     │  4264.74 ms │      4118.34 ms │     no change │
│ QQuery 7     │  3897.64 ms │      3818.70 ms │     no change │
└──────────────┴─────────────┴─────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary              ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 46642.64ms │
│ Total Time (reverse_parquet)   │ 46574.65ms │
│ Average Time (HEAD)            │  5830.33ms │
│ Average Time (reverse_parquet) │  5821.83ms │
│ Queries Faster                 │          1 │
│ Queries Slower                 │          0 │
│ Queries with No Change         │          7 │
│ Queries with Failure           │          0 │
└────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ reverse_parquet ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.68 ms │         2.59 ms │     no change │
│ QQuery 1     │    50.49 ms │        48.59 ms │     no change │
│ QQuery 2     │   137.13 ms │       136.98 ms │     no change │
│ QQuery 3     │   161.78 ms │       160.98 ms │     no change │
│ QQuery 4     │  1148.92 ms │      1101.08 ms │     no change │
│ QQuery 5     │  1613.34 ms │      1563.22 ms │     no change │
│ QQuery 6     │     2.22 ms │         2.29 ms │     no change │
│ QQuery 7     │    57.11 ms │        55.10 ms │     no change │
│ QQuery 8     │  1538.73 ms │      1473.96 ms │     no change │
│ QQuery 9     │  1996.44 ms │      1854.46 ms │ +1.08x faster │
│ QQuery 10    │   404.03 ms │       363.43 ms │ +1.11x faster │
│ QQuery 11    │   454.60 ms │       417.14 ms │ +1.09x faster │
│ QQuery 12    │  1447.41 ms │      1387.72 ms │     no change │
│ QQuery 13    │  2271.75 ms │      2141.22 ms │ +1.06x faster │
│ QQuery 14    │  1339.13 ms │      1294.92 ms │     no change │
│ QQuery 15    │  1333.78 ms │      1286.78 ms │     no change │
│ QQuery 16    │  2790.01 ms │      2753.47 ms │     no change │
│ QQuery 17    │  2759.16 ms │      2730.38 ms │     no change │
│ QQuery 18    │  5591.07 ms │      5169.15 ms │ +1.08x faster │
│ QQuery 19    │   126.87 ms │       125.69 ms │     no change │
│ QQuery 20    │  2026.21 ms │      2014.34 ms │     no change │
│ QQuery 21    │  2376.95 ms │      2331.82 ms │     no change │
│ QQuery 22    │  4126.05 ms │      3989.54 ms │     no change │
│ QQuery 23    │ 29485.23 ms │     13017.00 ms │ +2.27x faster │
│ QQuery 24    │   242.02 ms │       220.08 ms │ +1.10x faster │
│ QQuery 25    │   492.25 ms │       486.30 ms │     no change │
│ QQuery 26    │   234.05 ms │       218.64 ms │ +1.07x faster │
│ QQuery 27    │  2939.61 ms │      2848.30 ms │     no change │
│ QQuery 28    │ 24541.56 ms │     23573.96 ms │     no change │
│ QQuery 29    │   949.79 ms │       999.04 ms │  1.05x slower │
│ QQuery 30    │  1406.83 ms │      1355.84 ms │     no change │
│ QQuery 31    │  1454.02 ms │      1426.96 ms │     no change │
│ QQuery 32    │  4805.92 ms │      4840.46 ms │     no change │
│ QQuery 33    │  6107.78 ms │      6571.56 ms │  1.08x slower │
│ QQuery 34    │  6539.51 ms │      6943.50 ms │  1.06x slower │
│ QQuery 35    │  2050.16 ms │      1944.12 ms │ +1.05x faster │
│ QQuery 36    │   123.55 ms │       120.80 ms │     no change │
│ QQuery 37    │    56.58 ms │        52.08 ms │ +1.09x faster │
│ QQuery 38    │   124.23 ms │       122.64 ms │     no change │
│ QQuery 39    │   203.84 ms │       202.28 ms │     no change │
│ QQuery 40    │    42.85 ms │        43.45 ms │     no change │
│ QQuery 41    │    40.69 ms │        40.84 ms │     no change │
│ QQuery 42    │    34.34 ms │        32.52 ms │ +1.06x faster │
└──────────────┴─────────────┴─────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary              ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 115630.69ms │
│ Total Time (reverse_parquet)   │  97465.24ms │
│ Average Time (HEAD)            │   2689.09ms │
│ Average Time (reverse_parquet) │   2266.63ms │
│ Queries Faster                 │          11 │
│ Queries Slower                 │           3 │
│ Queries with No Change         │          29 │
│ Queries with Failure           │           0 │
└────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ reverse_parquet ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 127.92 ms │       134.33 ms │ 1.05x slower │
│ QQuery 2     │  27.88 ms │        27.86 ms │    no change │
│ QQuery 3     │  32.81 ms │        39.13 ms │ 1.19x slower │
│ QQuery 4     │  28.06 ms │        29.43 ms │    no change │
│ QQuery 5     │  85.52 ms │        88.58 ms │    no change │
│ QQuery 6     │  19.37 ms │        19.72 ms │    no change │
│ QQuery 7     │ 223.00 ms │       222.02 ms │    no change │
│ QQuery 8     │  33.75 ms │        34.75 ms │    no change │
│ QQuery 9     │ 103.59 ms │       102.77 ms │    no change │
│ QQuery 10    │  63.12 ms │        63.66 ms │    no change │
│ QQuery 11    │  17.38 ms │        17.69 ms │    no change │
│ QQuery 12    │  51.73 ms │        50.86 ms │    no change │
│ QQuery 13    │  46.02 ms │        46.16 ms │    no change │
│ QQuery 14    │  13.63 ms │        14.07 ms │    no change │
│ QQuery 15    │  24.20 ms │        25.26 ms │    no change │
│ QQuery 16    │  25.66 ms │        25.67 ms │    no change │
│ QQuery 17    │ 148.71 ms │       153.50 ms │    no change │
│ QQuery 18    │ 277.18 ms │       288.04 ms │    no change │
│ QQuery 19    │  37.84 ms │        38.06 ms │    no change │
│ QQuery 20    │  48.59 ms │        50.59 ms │    no change │
│ QQuery 21    │ 312.91 ms │       334.82 ms │ 1.07x slower │
│ QQuery 22    │  17.59 ms │        18.25 ms │    no change │
└──────────────┴───────────┴─────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary              ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 1766.46ms │
│ Total Time (reverse_parquet)   │ 1825.20ms │
│ Average Time (HEAD)            │   80.29ms │
│ Average Time (reverse_parquet) │   82.96ms │
│ Queries Faster                 │         0 │
│ Queries Slower                 │         3 │
│ Queries with No Change         │        19 │
│ Queries with Failure           │         0 │
└────────────────────────────────┴───────────┘

alamb · 2025-11-30T20:25:07Z

run benchmark clickbench_partitioned

alamb · 2025-11-30T20:28:30Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing reverse_parquet (86e027c) to 7925735 diff using: clickbench_partitioned
Results will be posted here when complete

alamb · 2025-11-30T21:00:11Z

🤖: Benchmark completed

Details

Comparing HEAD and reverse_parquet
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ reverse_parquet ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.48 ms │         2.22 ms │ +1.12x faster │
│ QQuery 1     │    50.16 ms │        50.08 ms │     no change │
│ QQuery 2     │   133.40 ms │       139.58 ms │     no change │
│ QQuery 3     │   164.66 ms │       161.46 ms │     no change │
│ QQuery 4     │  1091.17 ms │      1078.76 ms │     no change │
│ QQuery 5     │  1486.75 ms │      1513.08 ms │     no change │
│ QQuery 6     │     2.30 ms │         2.31 ms │     no change │
│ QQuery 7     │    55.40 ms │        55.05 ms │     no change │
│ QQuery 8     │  1457.64 ms │      1431.16 ms │     no change │
│ QQuery 9     │  1843.33 ms │      1846.58 ms │     no change │
│ QQuery 10    │   393.34 ms │       368.55 ms │ +1.07x faster │
│ QQuery 11    │   443.33 ms │       416.22 ms │ +1.07x faster │
│ QQuery 12    │  1387.03 ms │      1381.09 ms │     no change │
│ QQuery 13    │  2134.87 ms │      2155.86 ms │     no change │
│ QQuery 14    │  1277.23 ms │      1259.77 ms │     no change │
│ QQuery 15    │  1244.84 ms │      1247.28 ms │     no change │
│ QQuery 16    │  2718.87 ms │      2727.31 ms │     no change │
│ QQuery 17    │  2702.24 ms │      2694.43 ms │     no change │
│ QQuery 18    │  5553.93 ms │      5044.15 ms │ +1.10x faster │
│ QQuery 19    │   128.84 ms │       122.91 ms │     no change │
│ QQuery 20    │  2091.73 ms │      1995.54 ms │     no change │
│ QQuery 21    │  2357.94 ms │      2310.57 ms │     no change │
│ QQuery 22    │  4018.00 ms │      3973.11 ms │     no change │
│ QQuery 23    │ 17232.00 ms │     12942.16 ms │ +1.33x faster │
│ QQuery 24    │   229.18 ms │       211.51 ms │ +1.08x faster │
│ QQuery 25    │   501.10 ms │       473.70 ms │ +1.06x faster │
│ QQuery 26    │   238.12 ms │       214.18 ms │ +1.11x faster │
│ QQuery 27    │  2964.51 ms │      2751.22 ms │ +1.08x faster │
│ QQuery 28    │ 23628.96 ms │     23294.09 ms │     no change │
│ QQuery 29    │   978.17 ms │       985.62 ms │     no change │
│ QQuery 30    │  1351.92 ms │      1335.56 ms │     no change │
│ QQuery 31    │  1409.06 ms │      1351.65 ms │     no change │
│ QQuery 32    │  5463.16 ms │      5079.21 ms │ +1.08x faster │
│ QQuery 33    │  6250.86 ms │      5939.28 ms │     no change │
│ QQuery 34    │  6613.53 ms │      6093.67 ms │ +1.09x faster │
│ QQuery 35    │  1952.58 ms │      1962.50 ms │     no change │
│ QQuery 36    │   121.34 ms │       123.10 ms │     no change │
│ QQuery 37    │    51.52 ms │        54.41 ms │  1.06x slower │
│ QQuery 38    │   119.42 ms │       119.16 ms │     no change │
│ QQuery 39    │   197.37 ms │       204.51 ms │     no change │
│ QQuery 40    │    41.48 ms │        42.44 ms │     no change │
│ QQuery 41    │    39.95 ms │        39.71 ms │     no change │
│ QQuery 42    │    32.96 ms │        33.32 ms │     no change │
└──────────────┴─────────────┴─────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary              ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 102156.66ms │
│ Total Time (reverse_parquet)   │  95228.06ms │
│ Average Time (HEAD)            │   2375.74ms │
│ Average Time (reverse_parquet) │   2214.61ms │
│ Queries Faster                 │          11 │
│ Queries Slower                 │           1 │
│ Queries with No Change         │          31 │
│ Queries with Failure           │           0 │
└────────────────────────────────┴─────────────┘

zhuqi-lucas · 2025-12-01T02:28:30Z

run benchmarks

(BTW I am testing out some new scripts: #18115 (comment))

@zhuqi-lucas / @xudong963 let me know if you would like to be added to the whitelist

Amazing @alamb , pretty cool! Appreciate to be added, thanks!

xudong963 · 2025-12-01T04:37:29Z

I'll review another round today

xudong963

I think the PR as the initial version is very close

xudong963 · 2025-12-01T08:18:53Z

datafusion/datasource-parquet/src/opener.rs

+/// - **Latency**: First batch available after reading complete first (reversed) row group
+/// - **Throughput**: Near-native speed with ~5-10% overhead for reversal operations
+/// - **Memory**: O(row_group_size), not O(file_size)
+///   TODO should we support max cache size to limit memory usage further? But if we exceed the cache size we can't reverse properly, so we need to fall back to normal reading?


Could you please open a issue to track and descrip the todo?

Sure @xudong963 , created the follow-up issue:
#19048

xudong963 · 2025-12-01T08:19:33Z

datafusion/datasource-parquet/src/opener.rs

+/// ## Performance Characteristics
+///
+/// - **Latency**: First batch available after reading complete first (reversed) row group
+/// - **Throughput**: Near-native speed with ~5-10% overhead for reversal operations


How did you get the "~5-10% overhead"?

This is from my estimation, i can add create a benchmark for this operation as follow-up.

xudong963 · 2025-12-01T08:20:10Z

datafusion/datasource-parquet/src/opener.rs

+///
+/// - **Latency**: First batch available after reading complete first (reversed) row group
+/// - **Throughput**: Near-native speed with ~5-10% overhead for reversal operations
+/// - **Memory**: O(row_group_size), not O(file_size)


I'm a bit confused about the line, what does it mean?

It means, the memory max size is max row_group memory size, not the file_size.

xudong963 · 2025-12-01T08:42:06Z

datafusion/datasource/src/file_scan_config.rs

+/// Requested: [number ASC]
+/// Reversed:  [number ASC, letter DESC]  ✓ Prefix match!
+/// ```
+fn is_reverse_ordering(


I'm wondering if we could leverage the ordering equivalence to do the check not by string match.

Addressed this in latest PR, thanks!

xudong963 · 2025-12-01T08:58:59Z

datafusion/datasource-parquet/src/opener.rs

+
+                    // Check if current row group is complete
+                    if this.current_rg_rows_read >= this.current_rg_total_rows {
+                        this.finalize_current_row_group();


When finalize_current_row_group() encounters an error (e.g., during reverse_batch), it sets self.pending_error and returns early. However, it is called inside the loop in poll_next. Here does NOT check if pending_error was set.

Addressed this in latest PR, thanks!

xudong963 · 2025-12-01T09:04:17Z

datafusion/datasource-parquet/src/opener.rs

+        return Ok(batch);
+    }
+
+    let indices = UInt32Array::from_iter_values((0..num_rows as u32).rev());


The UInt32Array is allocated for every batch, not sure if it could be a potential cost, if it is, a cache may be good

Good point @xudong963 , i added the cache in latest PR.

zhuqi-lucas · 2025-12-02T04:02:18Z

I think the PR as the initial version is very close

Thank you @xudong963 for review, i will try to address it. And we also need benchmark for this PR to compare the performance with original dynamic topk.

adriangb · 2025-12-02T07:38:37Z

Please don't hold this up for me (I'm on vacation until Thursday) but I would love to review this, it's very exciting work 🚀

zhuqi-lucas · 2025-12-02T08:06:47Z

Please don't hold this up for me (I'm on vacation until Thursday) but I would love to review this, it's very exciting work 🚀

Thank you @adriangb , i am working on the benchmark first.

zhuqi-lucas · 2025-12-02T09:58:14Z

#19042 (comment)

@alamb @xudong963 @2010YOUY01 @adriangb
I got some real data benchmark result for this PR comparing to main branch, it's 30X faster, so above benchmark, i can add more benchmark cases as follow-up.

adriangb

Very cool work!

My main concern is https://github.com/apache/datafusion/pull/18817/files#r2582092389.

My intuition is that this would be end in a better result if we split it into smaller pieces of work:

Establish the high level API for sort pushdown and the optimizer rule. Only re-arrange files and return Inexact.
Implement scan reversal for Parquet.
Refactor ownership of ordering information from FileScanConfig into PartitionedFile and FileGroup.
Any followups to reduce or cap memory use, etc.

adriangb · 2025-12-02T16:56:31Z

datafusion/common/src/config.rs

+
+        /// Enable sort pushdown optimization for sorted Parquet files.
+        /// Currently, this optimization only has reverse order support.
+        /// When a query requires ordering that can be satisfied by reversing
+        /// the file's natural ordering, row groups and batches are read in
+        /// reverse order to eliminate sort operations.
+        /// Note: This buffers one row group at a time (typically ~128MB).
+        /// Default: true


Can this support non-reverse cases where the query order is the same as the file order? i.e. can we eliminate sorts in that simpler case as well? Or does that already happen / was already implemented?

adriangb · 2025-12-02T16:57:22Z

datafusion/datasource-parquet/src/source.rs

+    pub fn with_reverse_scan(mut self, reverse_scan: bool) -> Self {
+        self.reverse_scan = reverse_scan;
+        self
+    }


Are these used from production code (the optimizers, etc.) or only from our tests?

The similar code already in our production for very long time, and it works well.

adriangb · 2025-12-02T16:59:14Z

datafusion/datasource-parquet/src/source.rs

+        // Note: We ignore the specific `order` parameter here because the decision
+        // about whether we can reverse is made at the FileScanConfig level.
+        // This method creates a reversed version of the current ParquetSource,
+        // and the FileScanConfig will reverse both the file list and the declared ordering.


Just to clarify: since FileScanConfig is called first we are assuming that if we were called the intention is to reverse the order? I would rather verify this / decouple these two things. That is I'd rather parse the sort order here and do whatever we need to do internally to satisfy it (or respond None if we can't) instead of relying on FileScanConfig to do it for us.

adriangb · 2025-12-02T16:59:46Z

datafusion/datasource-parquet/src/source.rs

+        // - batch_size
+        // - table_parquet_options
+        // - all other settings
+        let new_source = self.clone().with_reverse_scan(true);


If this is the only place with_reverse_scan is used I suggest we set the private field directly to avoid polluting the public API.

adriangb · 2025-12-02T17:00:06Z

datafusion/datasource-parquet/src/source.rs

+        let schema = Arc::new(Schema::empty());
+        let source = ParquetSource::new(schema);
+
+        assert!(!source.reverse_scan());


We could just access the private field here since it's in the same module.

adriangb · 2025-12-02T17:02:15Z

datafusion/datasource/src/file.rs

+    fn try_pushdown_sort(
+        &self,
+        _order: &[PhysicalSortExpr],
+    ) -> Result<Option<Arc<dyn FileSource>>> {


I think it might be nice to make this an enum instead of an option. Specifically:

pub enum SortOrderPushdownResult<T> { Exact { inner: T }, Inexact { inner: T }, Unsupported, }

The idea being that Inexact means "I changed myself to better satisfy the join order, but cannot guarantee it's perfectly sorted (e.g. only re-arranged files and row groups based on stats but not the actual data stream).

This matters because e.g. Inexact would already be a huge win for TopK + dynamic filter pushdown.

This is great idea!

adriangb · 2025-12-02T17:06:43Z

datafusion/datasource/src/file_scan_config.rs

+        &self,
+        order: &[PhysicalSortExpr],
+    ) -> Result<Option<Arc<dyn DataSource>>> {
+        let current_ordering = match self.output_ordering.first() {


Nit for a bigger picture plan: I think we should re-arrange things such that the ordering of each file is recorded in PartitionedFile and any known ordering of groups in FileGroup. Then FileScanConfig should calculate it's output ordering from that instead of being given one. And if sort pushdown is requested then FileScanConfig can:

Try to re-arrange the order by reversing groups, by re-creating groups entirely using stats. This is enough for Inexact in https://github.com/apache/datafusion/pull/18817/files#r2582092389.

Try to push down the preferred order to the FileSource which determines if the order of reading a given file can be reversed. Maybe it needs a reference to the files since it has to handle row group order i.e. Parquet specific stuff? But if it is able to reverse the order of the scans then the whole result becomes Exact.

adriangb · 2025-12-02T17:09:13Z

datafusion/physical-plan/src/execution_plan.rs

+    fn try_pushdown_sort(
+        &self,
+        _order: &[PhysicalSortExpr],
+    ) -> Result<Option<Arc<dyn ExecutionPlan>>> {


I similarly think there should be the option to return a new plan that is optimized for the order but doesn't satisfy it enough to remove the sort (https://github.com/apache/datafusion/pull/18817/files#r2582092389)

zhuqi-lucas · 2025-12-03T03:12:35Z

Very cool work!

My main concern is https://github.com/apache/datafusion/pull/18817/files#r2582092389.

My intuition is that this would be end in a better result if we split it into smaller pieces of work:

Establish the high level API for sort pushdown and the optimizer rule. Only re-arrange files and return Inexact.

Implement scan reversal for Parquet.

Refactor ownership of ordering information from FileScanConfig into PartitionedFile and FileGroup.

Any followups to reduce or cap memory use, etc.

Thank you @adriangb for review, great idea, i will try to address this today.

zhuqi-lucas · 2025-12-03T05:38:05Z

@adriangb Created the sub-task, i will do the first step in another PR:
#19059

zhuqi-lucas · 2025-12-03T09:50:45Z

cc @2010YOUY01 @adriangb @alamb @xudong963
I got the performance result for row group reverse with topk dynamic pushdown, PR, and the PR is ready for review:

     Running `/Users/zhuqi/arrow-datafusion/target/release/dfbench clickbench --iterations 5 --path /Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data --sorted-by EventTime --sort-order ASC -o /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_19059/data_sorted_clickbench.json`
Running benchmarks with the following options: RunOpt { query: None, pushdown: false, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet", queries_path: "/Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data", output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/issue_19059/data_sorted_clickbench.json"), sorted_by: Some("EventTime"), sort_order: "ASC" }
⚠️  Forcing target_partitions=1 to preserve sort order
⚠️  (Because we want to get the pure performance benefit of sorted data to compare)
📊 Session config target_partitions: 1
Registering table with sort order: EventTime ASC
Executing: CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet' WITH ORDER ("EventTime" ASC)
Q0: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true
SELECT * FROM hits ORDER BY "EventTime" DESC limit 10;

Query 0 iteration 0 took 21.5 ms and returned 10 rows
Query 0 iteration 1 took 12.7 ms and returned 10 rows
Query 0 iteration 2 took 11.1 ms and returned 10 rows
Query 0 iteration 3 took 10.6 ms and returned 10 rows
Query 0 iteration 4 took 9.9 ms and returned 10 rows
Query 0 avg time: 13.17 ms
+ set +x
Done

So it's very close to reverse parquet implementation:

The main branch result is:
300ms

The reverse parquet implementation this PR is:
9.8ms

The reverse row group with dynamic topk is #19064:
13ms

alamb · 2025-12-03T21:11:27Z

I plan to try and re-review this in detail tomorrow

alamb · 2025-12-03T21:14:51Z

Amazing @alamb , pretty cool! Appreciate to be added, thanks!

Added in alamb/datafusion-benchmarking@c9f25c4

zhuqi-lucas · 2025-12-04T02:57:05Z

I plan to try and re-review this in detail tomorrow

Amazing @alamb , pretty cool! Appreciate to be added, thanks!

Added in alamb/datafusion-benchmarking@c9f25c4

Thank you @alamb!

github-actions bot added optimizer Optimizer rules datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt) physical-expr Changes to the physical-expr crates labels Nov 19, 2025

reverse parquet draft version

e8588b1

zhuqi-lucas force-pushed the reverse_parquet branch from 3c23790 to e8588b1 Compare November 20, 2025 06:10

zhuqi-lucas and others added 8 commits November 20, 2025 21:34

Support limit pushdown for reverse scan

2cf2e31

Support row group level cache

2f73a4a

Merge branch 'main' into reverse_parquet

f546a7f

fix

07ef9a2

fmt

c123a37

add more test

46cfd89

Add metrics for reverse scan row groups.

99e50de

fix

dbcf598

github-actions bot added the core Core DataFusion crate label Nov 21, 2025

zhuqi-lucas and others added 2 commits November 21, 2025 21:26

Merge branch 'main' into reverse_parquet

12f74d6

optimize code

2cfd73e

zhuqi-lucas mentioned this pull request Nov 21, 2025

Fast parquet order inversion #17172

Open

zhuqi-lucas and others added 3 commits November 21, 2025 23:38

Merge branch 'main' into reverse_parquet

3479672

Add more comments

98fac46

fmt

92b6487

zhuqi-lucas changed the title ~~Draft Reverse parquet~~ Support reverse Parquet Scan and fast parquet order inversion at row group level Nov 21, 2025

zhuqi-lucas changed the title ~~Support reverse Parquet Scan and fast parquet order inversion at row group level~~ Support reverse parquet scan and fast parquet order inversion at row group level Nov 21, 2025

zhuqi-lucas and others added 2 commits November 22, 2025 00:12

Merge branch 'main' into reverse_parquet

475bf3d

add enable/disable option

8ccca52

github-actions bot added common Related to common crate execution Related to the execution crate proto Related to proto crate labels Nov 22, 2025

zhuqi-lucas and others added 2 commits November 22, 2025 10:49

Merge branch 'main' into reverse_parquet

5d63557

fix slt

a2b44a5

remove datasource from optimizer

86e027c

xudong963 reviewed Dec 1, 2025

View reviewed changes

alamb mentioned this pull request Dec 1, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-12-01 #19016

Open

34 tasks

adriangb mentioned this pull request Dec 1, 2025

Row group limit pruning #18868

Open

zhuqi-lucas mentioned this pull request Dec 2, 2025

Add sorted data benchmark. #19042

Open

Merge branch 'main' into reverse_parquet

da4586b

zhuqi-lucas mentioned this pull request Dec 2, 2025

Follow up improvement for reverse parquet when row group size is huge, so it may cause OOM. #19048

Open

adriangb requested changes Dec 2, 2025

View reviewed changes

address new comments

354a82b

zhuqi-lucas mentioned this pull request Dec 3, 2025

Phase 1 Establish the high level API for sort pushdown and the optimizer rule. Only re-arrange files and row groups and return Inexact. #19059

Open

zhuqi-lucas mentioned this pull request Dec 3, 2025

Establish the high level API for sort pushdown and the optimizer rule and support reverse files and row groups #19064

Open

Support reverse parquet scan and fast parquet order inversion at row group level #18817

Are you sure you want to change the base?

Support reverse parquet scan and fast parquet order inversion at row group level #18817

Conversation

zhuqi-lucas commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Overview

Rationale for this change

Motivation

Scope and Limitations

Configuration

Implementation Details

Architecture

1. ParquetSource API (source.rs)

2. ParquetOpener (opener.rs)

3. ReversedParquetStream (opener.rs)

4. Physical Optimizer (reverse_order.rs)

Why Row-Group-Level Buffering?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Nov 30, 2025

Uh oh!

alamb commented Nov 30, 2025

Uh oh!

alamb commented Nov 30, 2025

Uh oh!

alamb commented Nov 30, 2025

Uh oh!

alamb commented Nov 30, 2025

Uh oh!

alamb commented Nov 30, 2025

Uh oh!

alamb commented Nov 30, 2025

Uh oh!

zhuqi-lucas commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xudong963 commented Dec 1, 2025

Uh oh!

xudong963 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Dec 2, 2025

Uh oh!

adriangb commented Dec 2, 2025

Uh oh!

zhuqi-lucas commented Dec 2, 2025

Uh oh!

zhuqi-lucas commented Dec 2, 2025

Uh oh!

adriangb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

zhuqi-lucas commented Nov 19, 2025 •

edited

Loading

1. ParquetSource API (`source.rs`)

2. ParquetOpener (`opener.rs`)

3. ReversedParquetStream (`opener.rs`)

4. Physical Optimizer (`reverse_order.rs`)

zhuqi-lucas commented Dec 1, 2025 •

edited

Loading

adriangb left a comment •

edited

Loading

zhuqi-lucas commented Dec 3, 2025 •

edited

Loading

zhuqi-lucas commented Dec 4, 2025 •

edited

Loading