Skip to content

Conversation

@zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Nov 19, 2025

Which issue does this PR close?

Closes #17172

Overview

This PR implements reverse scanning for Parquet files to optimize ORDER BY ... DESC LIMIT N queries on sorted data. When DataFusion detects that reversing the scan order would eliminate the need for a separate sort operation, it can now directly read Parquet files in reverse order.

Implementation Note: This PR implements Part 1 of the vision outlined in #17172 (Order Inversion at the DataFusion level).

Current implementation:

  • ✅ Row-group-level reversal (this PR)
  • ✅ Memory bounded by row group size (typically ~128MB)
  • ✅ Significant performance improvements for common use cases

Future improvements (requires arrow-rs changes):

  • Page-level reverse decoding (would reduce memory to ~1MB per page)
  • In-place array flipping (would eliminate take kernel overhead)
  • See issue Fast parquet order inversion #17172 for full details

These enhancements would further optimize memory usage and latency, but the current implementation already provides substantial benefits for most workloads.

Rationale for this change

Motivation

Currently, queries like SELECT * FROM table ORDER BY sorted_column DESC LIMIT 100 require DataFusion to:

  1. Read the entire file in forward order
  2. Sort/reverse all results in memory
  3. Apply the limit

For files that are already sorted in ascending order, this is inefficient. With this optimization, DataFusion can:

  1. Read row groups in reverse order
  2. Reverse individual batches progressively
  3. Stream results directly without a separate sort operation
  4. Stop reading early when the limit is reached (single-partition case)

Performance Benefits:

  • Eliminates memory-intensive sort operations for large files
  • Enables streaming for reverse-order queries with limits
  • Reduces query latency significantly for sorted data
  • Reduces I/O by stopping early when limit is satisfied (single-partition)

Scope and Limitations

This optimization applies to:

  • ✅ Single-partition scans (most common case for sorted Parquet files) - Full optimization: sort completely eliminated
  • ✅ Multi-partition scans - Partial optimization: each partition scans in reverse, but SortPreservingMerge is still required
  • ✅ Queries with ORDER BY ... DESC on pre-sorted columns
  • ✅ Queries with LIMIT clauses (most beneficial for single-partition)

This optimization does NOT apply to:

  • ❌ Unsorted files - No benefit from reverse scanning
  • ❌ Complex sort expressions that don't match file ordering

Single-partition vs Multi-partition:

Scenario Optimization Effect Why
Single-partition Sort operation completely eliminated No merge needed, direct reverse streaming with limit pushdown
Multi-partition Per-partition sorts eliminated, but merge still required Each partition reads in reverse (eliminating per-partition sorts), but SortPreservingMergeExec is needed to combine streams. Limit cannot be pushed to individual partitions.

Performance comparison:

  • Single-partition: ORDER BY DESC LIMIT N → Direct reverse scan with limit pushed down to DataSource
  • Multi-partition: ORDER BY DESC LIMIT N → Reverse scan per partition + LocalLimitExec + SortPreservingMergeExec

While multi-partition scans still require a merge operation, they benefit significantly from:

  • Elimination of per-partition sort operations
  • Parallel reverse scanning across partitions
  • Reduced data volume entering the merge stage via LocalLimitExec

Configuration

This optimization is enabled by default but can be controlled via:

SQL:

SET datafusion.execution.parquet.enable_reverse_scan = true/false;

Rust API:

let ctx = SessionContext::new()
    .with_config(
        SessionConfig::new()
            .with_parquet_reverse_scan(false)  // Disable optimization
    );

When to disable:

  • If your Parquet files have very large row groups (> 256MB) and memory is constrained (row group buffering required for correctness)
  • For debugging or performance comparison purposes
  • If you observe unexpected behavior (please report as a bug!)

Default: Enabled (true)

Implementation Details

Architecture

The implementation consists of four main components:

1. ParquetSource API (source.rs)

  • Added reverse_scan: bool field to ParquetSource
  • Added with_reverse_scan() and reverse_scan() methods
  • The flag is propagated through the file scan configuration

2. ParquetOpener (opener.rs)

  • Added reverse_scan: bool field
  • Row Group Reversal: Before building the stream, row group indices are reversed: row_group_indexes.reverse()
  • Stream Selection: Based on reverse_scan flag, creates either:
    • Normal stream: RecordBatchStreamAdapter
    • Reverse stream: ReversedParquetStream with row-group-level buffering

3. ReversedParquetStream (opener.rs)

A custom stream implementation that performs two-stage reversal with optional limit support:

Stage 1 - Row Reversal: Reverse rows within each batch using Arrow's take kernel

let indices = UInt32Array::from_iter_values((0..num_rows as u32).rev());
take(column, &indices, None)

Stage 2 - Batch Reversal: Reverse the order of batches within each row group

reversed_batches.into_iter().rev()

Key Properties:

  • Bounded Memory: Buffers at most one row group at a time (typically ~128MB with default Parquet writer settings)
  • Progressive Output: Outputs reversed batches immediately after each row group completes
  • Limit Support: Unified implementation handles both limited and unlimited scans
    • With limit: Stops processing when limit is reached, avoiding unnecessary I/O
    • Without limit: Processes entire file in reverse order
  • Metrics: Tracks row_groups_reversed, batches_reversed, and reverse_time

4. Physical Optimizer (reverse_order.rs)

  • New ReverseOrder optimization rule
  • Detects patterns where reversing the input satisfies sort requirements:
    • SortExec with reversible input ordering
    • GlobalLimitExec -> SortExec patterns (most beneficial case)
  • Uses TreeNodeRewriter to push reverse flag down to ParquetSource
  • Single-partition check: Only pushes limit to single-partition DataSourceExec to avoid correctness issues with multi-partition scans
  • Preserves correctness by checking:
    • Input ordering compatibility
    • Required input ordering constraints
    • Ordering preservation properties

Why Row-Group-Level Buffering?

Row group buffering is necessary for correctness:

  1. Parquet Structure: Files are organized into independent row groups (typically ~128MB with default settings)
  2. Batch Boundaries: The parquet reader's batches may not align with row group boundaries
  3. Correct Ordering: We must ensure complete row groups are reversed to maintain semantic correctness

This is the minimal buffering granularity that ensures correct results while still being compatible with arrow-rs's existing parquet reader architecture.

Memory Characteristics:

  • Maximum memory: Size of largest row group (typically ~128MB with default Parquet writer settings)
  • Not O(file_size), but O(row_group_size)
  • Acceptable trade-off for elimination of full-file sort operation

Why this is necessary:

  • Parquet batches don't align with row group boundaries
  • Must buffer complete row groups to ensure correct ordering
  • This is the minimal buffering granularity for correctness

Future Optimization: Page-level reverse scanning in arrow-rs could further reduce memory usage and improve latency by eliminating row-group buffering entirely.

What changes are included in this PR?

  1. Core Implementation:

    • ParquetSource: Added reverse scan flag and methods
    • ParquetOpener: Row group reversal and stream creation logic
    • ReversedParquetStream: Unified stream implementation with optional limit support
  2. Physical Optimization:

    • ReverseOrder: New optimizer rule for detecting and applying reverse scan optimization
    • Pattern matching for SortExec and GlobalLimitExec -> SortExec
    • Single-partition validation to ensure optimization is beneficial
  3. Configuration:

    • Added enable_reverse_scan config option (default: true)
    • SQL and Rust API support
  4. Metrics:

    • row_groups_reversed: Count of reversed row groups
    • batches_reversed: Count of reversed batches
    • reverse_time: Time spent reversing data

Are these changes tested?

Yes, comprehensive tests added:

Unit Tests (opener.rs):

  • Single batch reversal
  • Multiple batch reversal
  • Multiple row group handling
  • Limit enforcement
  • Null value handling
  • ParquetSource flag propagation

Integration Tests (reverse_order.rs):

  • Sort removal optimization
  • Limit + Sort pattern optimization
  • Multi-partition handling (partial optimization with merge)
  • Nested sort patterns
  • Edge cases (empty exec, multiple columns, etc.)

SQL Logic Tests (.slt files):

  • End-to-end query validation
  • Single-partition reverse scan with multiple row groups
  • Multi-partition reverse scan with file reversal
  • Uneven partition handling
  • Performance comparisons
  • Correctness verification across various scenarios

Are there any user-facing changes?

New Configuration Option:

  • datafusion.execution.parquet.enable_reverse_scan (default: true)

Behavioral Changes:

  • Queries with ORDER BY ... DESC LIMIT N on sorted single-partition Parquet files will automatically use reverse scanning when beneficial
  • Multi-partition queries benefit from per-partition reverse scanning, though merge is still required
  • No changes to query results - only performance improvements
  • New metrics available in query execution metrics

Breaking Changes:

  • None. This is a purely additive optimization that maintains backward compatibility.

@github-actions github-actions bot added optimizer Optimizer rules datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt) physical-expr Changes to the physical-expr crates labels Nov 19, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Nov 21, 2025
@zhuqi-lucas zhuqi-lucas changed the title Draft Reverse parquet Support reverse Parquet Scan and fast parquet order inversion at row group level Nov 21, 2025
@zhuqi-lucas zhuqi-lucas changed the title Support reverse Parquet Scan and fast parquet order inversion at row group level Support reverse parquet scan and fast parquet order inversion at row group level Nov 21, 2025
@github-actions github-actions bot added common Related to common crate execution Related to the execution crate proto Related to proto crate labels Nov 22, 2025
@alamb
Copy link
Contributor

alamb commented Nov 30, 2025

run benchmarks

@alamb
Copy link
Contributor

alamb commented Nov 30, 2025

run benchmarks

(BTW I am testing out some new scripts: #18115 (comment))

@zhuqi-lucas / @xudong963 let me know if you would like to be added to the whitelist

@alamb
Copy link
Contributor

alamb commented Nov 30, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing reverse_parquet (86e027c) to 7925735 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Nov 30, 2025

🤖: Benchmark completed

Details

Comparing HEAD and reverse_parquet
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ reverse_parquet ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2758.70 ms │      2664.09 ms │     no change │
│ QQuery 1     │  1306.02 ms │      1225.21 ms │ +1.07x faster │
│ QQuery 2     │  2444.91 ms │      2378.03 ms │     no change │
│ QQuery 3     │  1161.97 ms │      1147.66 ms │     no change │
│ QQuery 4     │  2368.99 ms │      2382.54 ms │     no change │
│ QQuery 5     │ 28439.65 ms │     28840.09 ms │     no change │
│ QQuery 6     │  4264.74 ms │      4118.34 ms │     no change │
│ QQuery 7     │  3897.64 ms │      3818.70 ms │     no change │
└──────────────┴─────────────┴─────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary              ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 46642.64ms │
│ Total Time (reverse_parquet)   │ 46574.65ms │
│ Average Time (HEAD)            │  5830.33ms │
│ Average Time (reverse_parquet) │  5821.83ms │
│ Queries Faster                 │          1 │
│ Queries Slower                 │          0 │
│ Queries with No Change         │          7 │
│ Queries with Failure           │          0 │
└────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ reverse_parquet ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.68 ms │         2.59 ms │     no change │
│ QQuery 1     │    50.49 ms │        48.59 ms │     no change │
│ QQuery 2     │   137.13 ms │       136.98 ms │     no change │
│ QQuery 3     │   161.78 ms │       160.98 ms │     no change │
│ QQuery 4     │  1148.92 ms │      1101.08 ms │     no change │
│ QQuery 5     │  1613.34 ms │      1563.22 ms │     no change │
│ QQuery 6     │     2.22 ms │         2.29 ms │     no change │
│ QQuery 7     │    57.11 ms │        55.10 ms │     no change │
│ QQuery 8     │  1538.73 ms │      1473.96 ms │     no change │
│ QQuery 9     │  1996.44 ms │      1854.46 ms │ +1.08x faster │
│ QQuery 10    │   404.03 ms │       363.43 ms │ +1.11x faster │
│ QQuery 11    │   454.60 ms │       417.14 ms │ +1.09x faster │
│ QQuery 12    │  1447.41 ms │      1387.72 ms │     no change │
│ QQuery 13    │  2271.75 ms │      2141.22 ms │ +1.06x faster │
│ QQuery 14    │  1339.13 ms │      1294.92 ms │     no change │
│ QQuery 15    │  1333.78 ms │      1286.78 ms │     no change │
│ QQuery 16    │  2790.01 ms │      2753.47 ms │     no change │
│ QQuery 17    │  2759.16 ms │      2730.38 ms │     no change │
│ QQuery 18    │  5591.07 ms │      5169.15 ms │ +1.08x faster │
│ QQuery 19    │   126.87 ms │       125.69 ms │     no change │
│ QQuery 20    │  2026.21 ms │      2014.34 ms │     no change │
│ QQuery 21    │  2376.95 ms │      2331.82 ms │     no change │
│ QQuery 22    │  4126.05 ms │      3989.54 ms │     no change │
│ QQuery 23    │ 29485.23 ms │     13017.00 ms │ +2.27x faster │
│ QQuery 24    │   242.02 ms │       220.08 ms │ +1.10x faster │
│ QQuery 25    │   492.25 ms │       486.30 ms │     no change │
│ QQuery 26    │   234.05 ms │       218.64 ms │ +1.07x faster │
│ QQuery 27    │  2939.61 ms │      2848.30 ms │     no change │
│ QQuery 28    │ 24541.56 ms │     23573.96 ms │     no change │
│ QQuery 29    │   949.79 ms │       999.04 ms │  1.05x slower │
│ QQuery 30    │  1406.83 ms │      1355.84 ms │     no change │
│ QQuery 31    │  1454.02 ms │      1426.96 ms │     no change │
│ QQuery 32    │  4805.92 ms │      4840.46 ms │     no change │
│ QQuery 33    │  6107.78 ms │      6571.56 ms │  1.08x slower │
│ QQuery 34    │  6539.51 ms │      6943.50 ms │  1.06x slower │
│ QQuery 35    │  2050.16 ms │      1944.12 ms │ +1.05x faster │
│ QQuery 36    │   123.55 ms │       120.80 ms │     no change │
│ QQuery 37    │    56.58 ms │        52.08 ms │ +1.09x faster │
│ QQuery 38    │   124.23 ms │       122.64 ms │     no change │
│ QQuery 39    │   203.84 ms │       202.28 ms │     no change │
│ QQuery 40    │    42.85 ms │        43.45 ms │     no change │
│ QQuery 41    │    40.69 ms │        40.84 ms │     no change │
│ QQuery 42    │    34.34 ms │        32.52 ms │ +1.06x faster │
└──────────────┴─────────────┴─────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary              ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 115630.69ms │
│ Total Time (reverse_parquet)   │  97465.24ms │
│ Average Time (HEAD)            │   2689.09ms │
│ Average Time (reverse_parquet) │   2266.63ms │
│ Queries Faster                 │          11 │
│ Queries Slower                 │           3 │
│ Queries with No Change         │          29 │
│ Queries with Failure           │           0 │
└────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ reverse_parquet ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 127.92 ms │       134.33 ms │ 1.05x slower │
│ QQuery 2     │  27.88 ms │        27.86 ms │    no change │
│ QQuery 3     │  32.81 ms │        39.13 ms │ 1.19x slower │
│ QQuery 4     │  28.06 ms │        29.43 ms │    no change │
│ QQuery 5     │  85.52 ms │        88.58 ms │    no change │
│ QQuery 6     │  19.37 ms │        19.72 ms │    no change │
│ QQuery 7     │ 223.00 ms │       222.02 ms │    no change │
│ QQuery 8     │  33.75 ms │        34.75 ms │    no change │
│ QQuery 9     │ 103.59 ms │       102.77 ms │    no change │
│ QQuery 10    │  63.12 ms │        63.66 ms │    no change │
│ QQuery 11    │  17.38 ms │        17.69 ms │    no change │
│ QQuery 12    │  51.73 ms │        50.86 ms │    no change │
│ QQuery 13    │  46.02 ms │        46.16 ms │    no change │
│ QQuery 14    │  13.63 ms │        14.07 ms │    no change │
│ QQuery 15    │  24.20 ms │        25.26 ms │    no change │
│ QQuery 16    │  25.66 ms │        25.67 ms │    no change │
│ QQuery 17    │ 148.71 ms │       153.50 ms │    no change │
│ QQuery 18    │ 277.18 ms │       288.04 ms │    no change │
│ QQuery 19    │  37.84 ms │        38.06 ms │    no change │
│ QQuery 20    │  48.59 ms │        50.59 ms │    no change │
│ QQuery 21    │ 312.91 ms │       334.82 ms │ 1.07x slower │
│ QQuery 22    │  17.59 ms │        18.25 ms │    no change │
└──────────────┴───────────┴─────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary              ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 1766.46ms │
│ Total Time (reverse_parquet)   │ 1825.20ms │
│ Average Time (HEAD)            │   80.29ms │
│ Average Time (reverse_parquet) │   82.96ms │
│ Queries Faster                 │         0 │
│ Queries Slower                 │         3 │
│ Queries with No Change         │        19 │
│ Queries with Failure           │         0 │
└────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Nov 30, 2025

run benchmark clickbench_partitioned

@alamb
Copy link
Contributor

alamb commented Nov 30, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing reverse_parquet (86e027c) to 7925735 diff using: clickbench_partitioned
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Nov 30, 2025

🤖: Benchmark completed

Details

Comparing HEAD and reverse_parquet
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ reverse_parquet ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.48 ms │         2.22 ms │ +1.12x faster │
│ QQuery 1     │    50.16 ms │        50.08 ms │     no change │
│ QQuery 2     │   133.40 ms │       139.58 ms │     no change │
│ QQuery 3     │   164.66 ms │       161.46 ms │     no change │
│ QQuery 4     │  1091.17 ms │      1078.76 ms │     no change │
│ QQuery 5     │  1486.75 ms │      1513.08 ms │     no change │
│ QQuery 6     │     2.30 ms │         2.31 ms │     no change │
│ QQuery 7     │    55.40 ms │        55.05 ms │     no change │
│ QQuery 8     │  1457.64 ms │      1431.16 ms │     no change │
│ QQuery 9     │  1843.33 ms │      1846.58 ms │     no change │
│ QQuery 10    │   393.34 ms │       368.55 ms │ +1.07x faster │
│ QQuery 11    │   443.33 ms │       416.22 ms │ +1.07x faster │
│ QQuery 12    │  1387.03 ms │      1381.09 ms │     no change │
│ QQuery 13    │  2134.87 ms │      2155.86 ms │     no change │
│ QQuery 14    │  1277.23 ms │      1259.77 ms │     no change │
│ QQuery 15    │  1244.84 ms │      1247.28 ms │     no change │
│ QQuery 16    │  2718.87 ms │      2727.31 ms │     no change │
│ QQuery 17    │  2702.24 ms │      2694.43 ms │     no change │
│ QQuery 18    │  5553.93 ms │      5044.15 ms │ +1.10x faster │
│ QQuery 19    │   128.84 ms │       122.91 ms │     no change │
│ QQuery 20    │  2091.73 ms │      1995.54 ms │     no change │
│ QQuery 21    │  2357.94 ms │      2310.57 ms │     no change │
│ QQuery 22    │  4018.00 ms │      3973.11 ms │     no change │
│ QQuery 23    │ 17232.00 ms │     12942.16 ms │ +1.33x faster │
│ QQuery 24    │   229.18 ms │       211.51 ms │ +1.08x faster │
│ QQuery 25    │   501.10 ms │       473.70 ms │ +1.06x faster │
│ QQuery 26    │   238.12 ms │       214.18 ms │ +1.11x faster │
│ QQuery 27    │  2964.51 ms │      2751.22 ms │ +1.08x faster │
│ QQuery 28    │ 23628.96 ms │     23294.09 ms │     no change │
│ QQuery 29    │   978.17 ms │       985.62 ms │     no change │
│ QQuery 30    │  1351.92 ms │      1335.56 ms │     no change │
│ QQuery 31    │  1409.06 ms │      1351.65 ms │     no change │
│ QQuery 32    │  5463.16 ms │      5079.21 ms │ +1.08x faster │
│ QQuery 33    │  6250.86 ms │      5939.28 ms │     no change │
│ QQuery 34    │  6613.53 ms │      6093.67 ms │ +1.09x faster │
│ QQuery 35    │  1952.58 ms │      1962.50 ms │     no change │
│ QQuery 36    │   121.34 ms │       123.10 ms │     no change │
│ QQuery 37    │    51.52 ms │        54.41 ms │  1.06x slower │
│ QQuery 38    │   119.42 ms │       119.16 ms │     no change │
│ QQuery 39    │   197.37 ms │       204.51 ms │     no change │
│ QQuery 40    │    41.48 ms │        42.44 ms │     no change │
│ QQuery 41    │    39.95 ms │        39.71 ms │     no change │
│ QQuery 42    │    32.96 ms │        33.32 ms │     no change │
└──────────────┴─────────────┴─────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary              ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 102156.66ms │
│ Total Time (reverse_parquet)   │  95228.06ms │
│ Average Time (HEAD)            │   2375.74ms │
│ Average Time (reverse_parquet) │   2214.61ms │
│ Queries Faster                 │          11 │
│ Queries Slower                 │           1 │
│ Queries with No Change         │          31 │
│ Queries with Failure           │           0 │
└────────────────────────────────┴─────────────┘

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Dec 1, 2025

run benchmarks

(BTW I am testing out some new scripts: #18115 (comment))

@zhuqi-lucas / @xudong963 let me know if you would like to be added to the whitelist

Amazing @alamb , pretty cool! Appreciate to be added, thanks!

@xudong963
Copy link
Member

I'll review another round today

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PR as the initial version is very close

/// - **Latency**: First batch available after reading complete first (reversed) row group
/// - **Throughput**: Near-native speed with ~5-10% overhead for reversal operations
/// - **Memory**: O(row_group_size), not O(file_size)
/// TODO should we support max cache size to limit memory usage further? But if we exceed the cache size we can't reverse properly, so we need to fall back to normal reading?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please open a issue to track and descrip the todo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @xudong963 , created the follow-up issue:
#19048

/// ## Performance Characteristics
///
/// - **Latency**: First batch available after reading complete first (reversed) row group
/// - **Throughput**: Near-native speed with ~5-10% overhead for reversal operations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you get the "~5-10% overhead"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from my estimation, i can add create a benchmark for this operation as follow-up.

///
/// - **Latency**: First batch available after reading complete first (reversed) row group
/// - **Throughput**: Near-native speed with ~5-10% overhead for reversal operations
/// - **Memory**: O(row_group_size), not O(file_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about the line, what does it mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means, the memory max size is max row_group memory size, not the file_size.

/// Requested: [number ASC]
/// Reversed: [number ASC, letter DESC] ✓ Prefix match!
/// ```
fn is_reverse_ordering(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could leverage the ordering equivalence to do the check not by string match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in latest PR, thanks!


// Check if current row group is complete
if this.current_rg_rows_read >= this.current_rg_total_rows {
this.finalize_current_row_group();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When finalize_current_row_group() encounters an error (e.g., during reverse_batch), it sets self.pending_error and returns early. However, it is called inside the loop in poll_next. Here does NOT check if pending_error was set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in latest PR, thanks!

return Ok(batch);
}

let indices = UInt32Array::from_iter_values((0..num_rows as u32).rev());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UInt32Array is allocated for every batch, not sure if it could be a potential cost, if it is, a cache may be good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @xudong963 , i added the cache in latest PR.

@zhuqi-lucas
Copy link
Contributor Author

I think the PR as the initial version is very close

Thank you @xudong963 for review, i will try to address it. And we also need benchmark for this PR to compare the performance with original dynamic topk.

@adriangb
Copy link
Contributor

adriangb commented Dec 2, 2025

Please don't hold this up for me (I'm on vacation until Thursday) but I would love to review this, it's very exciting work 🚀

@zhuqi-lucas
Copy link
Contributor Author

Please don't hold this up for me (I'm on vacation until Thursday) but I would love to review this, it's very exciting work 🚀

Thank you @adriangb , i am working on the benchmark first.

@zhuqi-lucas
Copy link
Contributor Author

#19042 (comment)

@alamb @xudong963 @2010YOUY01 @adriangb
I got some real data benchmark result for this PR comparing to main branch, it's 30X faster, so above benchmark, i can add more benchmark cases as follow-up.

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool work!

My main concern is https://github.com/apache/datafusion/pull/18817/files#r2582092389.

My intuition is that this would be end in a better result if we split it into smaller pieces of work:

  1. Establish the high level API for sort pushdown and the optimizer rule. Only re-arrange files and return Inexact.
  2. Implement scan reversal for Parquet.
  3. Refactor ownership of ordering information from FileScanConfig into PartitionedFile and FileGroup.
  4. Any followups to reduce or cap memory use, etc.

Comment on lines +834 to +841

/// Enable sort pushdown optimization for sorted Parquet files.
/// Currently, this optimization only has reverse order support.
/// When a query requires ordering that can be satisfied by reversing
/// the file's natural ordering, row groups and batches are read in
/// reverse order to eliminate sort operations.
/// Note: This buffers one row group at a time (typically ~128MB).
/// Default: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this support non-reverse cases where the query order is the same as the file order? i.e. can we eliminate sorts in that simpler case as well? Or does that already happen / was already implemented?

Comment on lines +494 to +497
pub fn with_reverse_scan(mut self, reverse_scan: bool) -> Self {
self.reverse_scan = reverse_scan;
self
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these used from production code (the optimizers, etc.) or only from our tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The similar code already in our production for very long time, and it works well.

Comment on lines +814 to +817
// Note: We ignore the specific `order` parameter here because the decision
// about whether we can reverse is made at the FileScanConfig level.
// This method creates a reversed version of the current ParquetSource,
// and the FileScanConfig will reverse both the file list and the declared ordering.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify: since FileScanConfig is called first we are assuming that if we were called the intention is to reverse the order? I would rather verify this / decouple these two things. That is I'd rather parse the sort order here and do whatever we need to do internally to satisfy it (or respond None if we can't) instead of relying on FileScanConfig to do it for us.

// - batch_size
// - table_parquet_options
// - all other settings
let new_source = self.clone().with_reverse_scan(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the only place with_reverse_scan is used I suggest we set the private field directly to avoid polluting the public API.

let schema = Arc::new(Schema::empty());
let source = ParquetSource::new(schema);

assert!(!source.reverse_scan());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just access the private field here since it's in the same module.

Comment on lines +149 to +152
fn try_pushdown_sort(
&self,
_order: &[PhysicalSortExpr],
) -> Result<Option<Arc<dyn FileSource>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be nice to make this an enum instead of an option. Specifically:

pub enum SortOrderPushdownResult<T> {
    Exact { inner: T },
    Inexact { inner: T },
    Unsupported,
}

The idea being that Inexact means "I changed myself to better satisfy the join order, but cannot guarantee it's perfectly sorted (e.g. only re-arranged files and row groups based on stats but not the actual data stream).

This matters because e.g. Inexact would already be a huge win for TopK + dynamic filter pushdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great idea!

&self,
order: &[PhysicalSortExpr],
) -> Result<Option<Arc<dyn DataSource>>> {
let current_ordering = match self.output_ordering.first() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit for a bigger picture plan: I think we should re-arrange things such that the ordering of each file is recorded in PartitionedFile and any known ordering of groups in FileGroup. Then FileScanConfig should calculate it's output ordering from that instead of being given one. And if sort pushdown is requested then FileScanConfig can:

  1. Try to re-arrange the order by reversing groups, by re-creating groups entirely using stats. This is enough for Inexact in https://github.com/apache/datafusion/pull/18817/files#r2582092389.
  2. Try to push down the preferred order to the FileSource which determines if the order of reading a given file can be reversed. Maybe it needs a reference to the files since it has to handle row group order i.e. Parquet specific stuff? But if it is able to reverse the order of the scans then the whole result becomes Exact.

fn try_pushdown_sort(
&self,
_order: &[PhysicalSortExpr],
) -> Result<Option<Arc<dyn ExecutionPlan>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I similarly think there should be the option to return a new plan that is optimized for the order but doesn't satisfy it enough to remove the sort (https://github.com/apache/datafusion/pull/18817/files#r2582092389)

@zhuqi-lucas
Copy link
Contributor Author

Very cool work!

My main concern is https://github.com/apache/datafusion/pull/18817/files#r2582092389.

My intuition is that this would be end in a better result if we split it into smaller pieces of work:

  1. Establish the high level API for sort pushdown and the optimizer rule. Only re-arrange files and return Inexact.
  2. Implement scan reversal for Parquet.
  3. Refactor ownership of ordering information from FileScanConfig into PartitionedFile and FileGroup.
  4. Any followups to reduce or cap memory use, etc.

Thank you @adriangb for review, great idea, i will try to address this today.

@zhuqi-lucas
Copy link
Contributor Author

@adriangb Created the sub-task, i will do the first step in another PR:
#19059

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Dec 3, 2025

cc @2010YOUY01 @adriangb @alamb @xudong963
I got the performance result for row group reverse with topk dynamic pushdown, PR, and the PR is ready for review:

     Running `/Users/zhuqi/arrow-datafusion/target/release/dfbench clickbench --iterations 5 --path /Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data --sorted-by EventTime --sort-order ASC -o /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_19059/data_sorted_clickbench.json`
Running benchmarks with the following options: RunOpt { query: None, pushdown: false, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet", queries_path: "/Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data", output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/issue_19059/data_sorted_clickbench.json"), sorted_by: Some("EventTime"), sort_order: "ASC" }
⚠️  Forcing target_partitions=1 to preserve sort order
⚠️  (Because we want to get the pure performance benefit of sorted data to compare)
📊 Session config target_partitions: 1
Registering table with sort order: EventTime ASC
Executing: CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet' WITH ORDER ("EventTime" ASC)
Q0: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true
SELECT * FROM hits ORDER BY "EventTime" DESC limit 10;

Query 0 iteration 0 took 21.5 ms and returned 10 rows
Query 0 iteration 1 took 12.7 ms and returned 10 rows
Query 0 iteration 2 took 11.1 ms and returned 10 rows
Query 0 iteration 3 took 10.6 ms and returned 10 rows
Query 0 iteration 4 took 9.9 ms and returned 10 rows
Query 0 avg time: 13.17 ms
+ set +x
Done

So it's very close to reverse parquet implementation:

The main branch result is:
300ms

The reverse parquet implementation this PR is:
9.8ms

The reverse row group with dynamic topk is #19064:
13ms

@alamb
Copy link
Contributor

alamb commented Dec 3, 2025

I plan to try and re-review this in detail tomorrow

@alamb
Copy link
Contributor

alamb commented Dec 3, 2025

Amazing @alamb , pretty cool! Appreciate to be added, thanks!

Added in alamb/datafusion-benchmarking@c9f25c4

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Dec 4, 2025

I plan to try and re-review this in detail tomorrow

Amazing @alamb , pretty cool! Appreciate to be added, thanks!

Added in alamb/datafusion-benchmarking@c9f25c4

Thank you @alamb!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation execution Related to the execution crate optimizer Optimizer rules physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fast parquet order inversion

5 participants