feat: late materialization of vectors in filtered vector search #5205

wkalt · 2025-11-11T01:29:44Z

KNN search is performed when a vector index is not present. When a table is partially covered by a vector index, we perform a union of an ANN search over the indexed data, and a KNN search over the unindexed data. If the table is completely unindexed it is just a KNN search on the data.

Prior to this commit, when we would execute the KNN portion of a filtered vector search, we would perform a scan of all columns and remove results that did not match the filter. For large vectors, this amounts to a lot of overfetch from storage.

When filters are selective, it is more efficient to read the filter column (typically much smaller than the vector), apply the filter, and then select matching vectors by row ID.

This patch implements that strategy as well as an adaptive mechanism for deciding when to apply it. There is a new configuration concept in the scanner for specifying the filter selectivity at which it will be cheaper to do a scan. We will compute a target rowcount based on that threshold and scan the filter column for matches. If we encounter more matches than the target, we will give up and switch to a scan.

wkalt · 2025-11-11T01:31:29Z

rust/lance/src/dataset/scanner.rs

+/// materialization to avoid fetching vector data for rows that will be filtered out.
+/// If the filter selects more rows than this threshold, we do a single scan with vectors
+/// to avoid the random access overhead of the take operation.
+pub const LATE_MATERIALIZE_SELECTIVITY_THRESHOLD: f64 = 0.005; // 0.5%


I think to be conservative we are gonna want to set this to the appropriate value for cloud storage. This will be too low for some kinds of storage, but should be an improvement over current behavior in all cases (which wouldn't be the case if we went too high).

I'm not sure if 0.005 is right, I think the real number may be lower.

also related to this - this mechanism is very blunt. There are ways we can do much better by estimating row widths and incorporating the math on the actual IOs we perform, but I'm not aware we have an approach for that today (essentially a cost model).

on further consideration, I don't think we want this at all. We should just always late materialize here. The reason is that takes and scans are both going to do lots of small object storage requests and at the limit they become pretty indistinguishable. That is why we see convergence in the results across even high selectivities:

Selectivity Rows Late Mat Full Scan Winner Speedup ------------------------------------------------------------------------------------------ 0.01% 100 0.0372s 3.8490s late_mat 103.55x 0.05% 500 0.0659s 3.8437s late_mat 58.36x 0.10% 1,000 0.1098s 3.7994s late_mat 34.61x 0.20% 2,000 0.1987s 3.8193s late_mat 19.22x 0.50% 5,000 0.4633s 3.8555s late_mat 8.32x 1.00% 10,000 0.7899s 3.8544s late_mat 4.88x 2.00% 20,000 1.3052s 3.9145s late_mat 3.00x 5.00% 50,000 2.0625s 4.0943s late_mat 1.99x 10.00% 100,000 2.5498s 4.4041s late_mat 1.73x 20.00% 200,000 4.2165s 5.2398s late_mat 1.24x 40.00% 400,000 5.9357s 7.1053s late_mat 1.20x 80.00% 800,000 8.1393s 10.8025s late_mat 1.33x

I'll rip this out and simplify the patch a bit.

Copilot

Pull Request Overview

This PR implements late materialization for filtered vector searches to reduce I/O overhead when filters are highly selective. When executing KNN searches on unindexed data with selective filters, the new approach first scans filter columns to check selectivity, then either uses late materialization (collect row IDs, then fetch vectors) for selective filters or falls back to a full scan for non-selective ones.

Key changes:

Added adaptive late materialization with configurable selectivity threshold (default 0.5%)
Implemented early termination when filter selectivity exceeds threshold to avoid collecting all row IDs
Added stats collection for late materialization strategy decisions

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
rust/lance/src/dataset/scanner.rs	Implements adaptive late materialization logic, adds threshold configuration, updates KNN search paths to use late materialization when appropriate, and adds comprehensive test coverage
rust/lance/src/dataset.rs	Removes extraneous `None` parameters in function calls (unrelated cleanup)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rust/lance/src/dataset/scanner.rs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

rust/lance/src/dataset/scanner.rs

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rust/lance/src/dataset/scanner.rs

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rust/lance/src/dataset/scanner.rs

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rust/lance/src/dataset/scanner.rs

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rust/lance/src/dataset/scanner.rs

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rust/lance/src/dataset/scanner.rs

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rust/lance/src/dataset/scanner.rs

Copilot

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rust/lance/src/dataset/scanner.rs

KNN search is performed when a vector index is not present. When a table is partially covered by a vector index, we perform a union of an ANN search over the indexed data, and a KNN search over the unindexed data. If the table is completely unindexed it is just a KNN search on the data. Prior to this commit, when we would execute the KNN portion of a filtered vector search, we would perform a scan of all columns and remove results that did not match the filter. For large vectors, this amounts to a lot of overfetch from storage. When filters are selective, it is more efficient to read the filter column (typically much smaller than the vector), apply the filter, and then select matching vectors by row ID. This patch implements that strategy as well as an adaptive mechanism for deciding when to apply it. There is a new configuration concept in the scanner for specifying the filter selectivity at which it will be cheaper to do a scan. We will compute a target rowcount based on that threshold and scan the filter column for matches. If we encounter more matches than the target, we will give up and switch to a scan.

Copilot

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wkalt · 2025-11-11T15:16:50Z

I have got a script here: https://gist.github.com/wkalt/a40d824f3770dfc46f01bcbbb722804b that analyzes various selectivity thresholds for the storage that you point it at. It uses an unindexed table with a scalar and 1024d vector column, with 1M rows. I have some local modifications that expose the tuning parameter through the python SDK for testing. It's unclear to me if we will actually want those. The queries here are KNN vector search with a scalar filter.

This is the output it reports on my minio (NVME-backed over LAN):

Selectivity  Rows         Late Mat     Full Scan    Winner       Speedup
------------------------------------------------------------------------------------------
  0.01%      100           0.0372s      3.8490s    late_mat     103.55x
  0.05%      500           0.0659s      3.8437s    late_mat     58.36x
  0.10%      1,000         0.1098s      3.7994s    late_mat     34.61x
  0.20%      2,000         0.1987s      3.8193s    late_mat     19.22x
  0.50%      5,000         0.4633s      3.8555s    late_mat     8.32x
  1.00%      10,000        0.7899s      3.8544s    late_mat     4.88x
  2.00%      20,000        1.3052s      3.9145s    late_mat     3.00x
  5.00%      50,000        2.0625s      4.0943s    late_mat     1.99x
 10.00%      100,000       2.5498s      4.4041s    late_mat     1.73x
 20.00%      200,000       4.2165s      5.2398s    late_mat     1.24x
 40.00%      400,000       5.9357s      7.1053s    late_mat     1.20x
 80.00%      800,000       8.1393s     10.8025s    late_mat     1.33x

Here is what it produces on my local NVME:

Selectivity  Rows         Late Mat     Full Scan    Winner       Speedup
------------------------------------------------------------------------------------------
  0.01%      100           0.0123s      1.0905s    late_mat     88.88x
  0.05%      500           0.0130s      1.0484s    late_mat     80.52x
  0.10%      1,000         0.0206s      1.0480s    late_mat     50.96x
  0.20%      2,000         0.0260s      0.9911s    late_mat     38.05x
  0.50%      5,000         0.0516s      0.8935s    late_mat     17.31x
  1.00%      10,000        0.0917s      0.8993s    late_mat     9.80x
  2.00%      20,000        0.1762s      0.9805s    late_mat     5.57x
  5.00%      50,000        0.4544s      1.4152s    late_mat     3.11x
 10.00%      100,000       0.7644s      1.9031s    late_mat     2.49x
 20.00%      200,000       1.8754s      2.9551s    late_mat     1.58x
 40.00%      400,000       3.1304s      5.1878s    late_mat     1.66x
 80.00%      800,000       5.5229s      9.8710s    late_mat     1.79x

In both of the instances above, late materialization wins ever time, and the speedup trend reverses at the end. I am thinking we probably have some inefficiencies in the full scan path that account for this. I still need to test on slow S3 storage; was hoping to replicate that with minio but I think it's too fast.

wkalt · 2025-11-11T15:34:48Z

I also have this script which motivated the investigation: https://gist.github.com/wkalt/aa62e5fc4cb1b90cdaa3c9ffce7099a9

On current lancedb, it outputs this for me:

=== fully-indexed ===
AnalyzeExec verbose=true, metrics=[], cumulative_cpu=36.13771ms
  TracedExec, metrics=[], cumulative_cpu=36.13771ms
    ProjectionExec: expr=[scalar_id@2 as scalar_id, vector@3 as vector, _distance@0 as _distance], metrics=[output_rows=0, elapsed_compute=1ns], cumulative_cpu=36.13771ms
      Take: columns="_distance, _rowid, (scalar_id), (vector)", metrics=[output_rows=0, elapsed_compute=1ns, batches_processed=0, bytes_read=0, iops=0, requests=0], cumulative_cpu=36.137709ms
        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=0, elapsed_compute=11ns], cumulative_cpu=36.137708ms
          SortExec: TopK(fetch=100), expr=[_distance@0 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=0, elapsed_compute=17.527µs, row_replacements=0], cumulative_cpu=36.137697ms
            ANNSubIndex: name=vector_idx, k=100, deltas=1, metrics=[output_rows=0, elapsed_compute=25.126056ms, index_comparisons=68181, indices_loaded=0, partitions_searched=20, parts_loaded=20], cumulative_cpu=36.12017ms
              ANNIvfPartition: uuid=6335819c-b15c-45b9-93eb-71bafc1141a0, minimum_nprobes=20, maximum_nprobes=Some(20), deltas=1, metrics=[output_rows=1, elapsed_compute=10.310017ms, deltas_searched=1, index_comparisons=0, indices_loaded=1, partitions_ranked=836, parts_loaded=0], cumulative_cpu=10.310017ms
              ScalarIndexQuery: query=[scalar_id = 9999]@scalar_id_idx, metrics=[output_rows=2, elapsed_compute=684.097µs, index_comparisons=4096, indices_loaded=1, output_batches=1, parts_loaded=1, search_time=1.044709ms, ser_time=17.342µs], cumulative_cpu=684.097µs

Query time: 0.0281s

=== fully-unindexed ===
AnalyzeExec verbose=true, metrics=[], cumulative_cpu=5.148648714s
  TracedExec, metrics=[], cumulative_cpu=5.148648714s
    ProjectionExec: expr=[scalar_id@0 as scalar_id, vector@1 as vector, _distance@3 as _distance], metrics=[output_rows=1, elapsed_compute=571ns], cumulative_cpu=5.148648714s
      GlobalLimitExec: skip=0, fetch=100, metrics=[output_rows=1, elapsed_compute=10ns], cumulative_cpu=5.148648143s
        FilterExec: _distance@3 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=2.404µs], cumulative_cpu=5.148648133s
          SortExec: TopK(fetch=100), expr=[_distance@3 ASC NULLS LAST, _rowid@2 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=1, elapsed_compute=54.323µs, row_replacements=1], cumulative_cpu=5.148645729s
            KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=2.588792479s, output_batches=1], cumulative_cpu=5.148591406s
              FilterExec: scalar_id@0 = 9999, metrics=[output_rows=1, elapsed_compute=4.238µs], cumulative_cpu=2.559798927s
                LanceRead: uri=mnt/work/home/wyatt/reproducer/test_random_vectors_fully_unindexed.lance/data, projection=[scalar_id, vector], num_fragments=10, range_before=None, range_after=None, row_id=true, row_addr=false, full_filter=scalar_id = Int64(9999), refine_filter=scalar_id = Int64(9999), metrics=[output_rows=1, elapsed_compute=2.559794689s, fragments_scanned=10, ranges_scanned=10, rows_scanned=1000000, bytes_read=16392044130, iops=1010, requests=520, task_wait_time=2.559564762s], cumulative_cpu=2.559794689s

Query time: 2.6296s

=== partially-indexed ===
AnalyzeExec verbose=true, metrics=[], cumulative_cpu=1.544708124s
  TracedExec, metrics=[], cumulative_cpu=1.544708124s
    ProjectionExec: expr=[scalar_id@3 as scalar_id, vector@1 as vector, _distance@2 as _distance], metrics=[output_rows=1, elapsed_compute=1.633µs], cumulative_cpu=1.544708124s
      Take: columns="_rowid, vector, _distance, (scalar_id)", metrics=[output_rows=1, elapsed_compute=104.659µs, batches_processed=1, bytes_read=0, iops=0, requests=0], cumulative_cpu=1.544706491s
        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=1, elapsed_compute=3.035µs], cumulative_cpu=1.544601832s
          GlobalLimitExec: skip=0, fetch=100, metrics=[output_rows=1, elapsed_compute=10ns], cumulative_cpu=1.544598797s
            FilterExec: _distance@2 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=2.314µs], cumulative_cpu=1.544598787s
              SortExec: TopK(fetch=100), expr=[_distance@2 ASC NULLS LAST, _rowid@0 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=1, elapsed_compute=16.131µs, row_replacements=1], cumulative_cpu=1.544596473s
                KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=22.854µs, output_batches=1], cumulative_cpu=1.544580342s
                  RepartitionExec: partitioning=RoundRobinBatch(1), input_partitions=2, metrics=[fetch_time=792.268144ms, repartition_time=2ns, send_time=2.075µs], cumulative_cpu=1.544557488s
                    UnionExec, metrics=[output_rows=1, elapsed_compute=14.859µs], cumulative_cpu=1.544557488s
                      ProjectionExec: expr=[_distance@3 as _distance, _rowid@2 as _rowid, vector@0 as vector], metrics=[output_rows=1, elapsed_compute=762ns], cumulative_cpu=1.532555941s
                        FilterExec: _distance@3 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=2.324µs], cumulative_cpu=1.532555179s
                          SortExec: TopK(fetch=100), expr=[_distance@3 ASC NULLS LAST, _rowid@2 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=1, elapsed_compute=25.969µs, row_replacements=1], cumulative_cpu=1.532552855s
                            KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=768.705412ms, output_batches=1], cumulative_cpu=1.532526886s
                              FilterExec: scalar_id@1 = 9999, metrics=[output_rows=1, elapsed_compute=2.114738ms], cumulative_cpu=763.821474ms
                                LanceScan: uri=mnt/work/home/wyatt/reproducer/test_random_vectors_partially_indexed.lance/data, projection=[vector, scalar_id], row_id=true, row_addr=false, ordered=false, range=None, metrics=[output_rows=300000, elapsed_compute=761.706736ms, bytes_read=4917613239, iops=303, requests=156], cumulative_cpu=761.706736ms
                      Take: columns="_distance, _rowid, (vector)", metrics=[output_rows=0, elapsed_compute=1ns, batches_processed=0, bytes_read=0, iops=0, requests=0], cumulative_cpu=11.986688ms
                        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=0, elapsed_compute=240ns], cumulative_cpu=11.986687ms
                          SortExec: TopK(fetch=100), expr=[_distance@0 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=0, elapsed_compute=108.014µs, row_replacements=0], cumulative_cpu=11.986447ms
                            ANNSubIndex: name=vector_idx, k=100, deltas=1, metrics=[output_rows=0, elapsed_compute=10.614492ms, index_comparisons=72927, indices_loaded=0, partitions_searched=20, parts_loaded=20], cumulative_cpu=11.878433ms
                              ANNIvfPartition: uuid=38a2dc14-fd91-4d0b-9425-494c6aa15485, minimum_nprobes=20, maximum_nprobes=Some(20), deltas=1, metrics=[output_rows=1, elapsed_compute=777.623µs, deltas_searched=1, index_comparisons=0, indices_loaded=0, partitions_ranked=836, parts_loaded=0], cumulative_cpu=777.623µs
                              ScalarIndexQuery: query=[scalar_id = 9999]@scalar_id_idx, metrics=[output_rows=2, elapsed_compute=486.318µs, index_comparisons=4096, indices_loaded=1, output_batches=1, parts_loaded=1, search_time=1.313898ms, ser_time=13.195µs], cumulative_cpu=486.318µs

Query time: 0.7914s

=== full-vector-partial-btree ===
AnalyzeExec verbose=true, metrics=[], cumulative_cpu=16.49465ms
  TracedExec, metrics=[], cumulative_cpu=16.49465ms
    ProjectionExec: expr=[scalar_id@2 as scalar_id, vector@3 as vector, _distance@0 as _distance], metrics=[output_rows=0, elapsed_compute=1ns], cumulative_cpu=16.49465ms
      Take: columns="_distance, _rowid, (scalar_id), (vector)", metrics=[output_rows=0, elapsed_compute=1ns, batches_processed=0, bytes_read=0, iops=0, requests=0], cumulative_cpu=16.494649ms
        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=0, elapsed_compute=80ns], cumulative_cpu=16.494648ms
          SortExec: TopK(fetch=100), expr=[_distance@0 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=0, elapsed_compute=31.659µs, row_replacements=0], cumulative_cpu=16.494568ms
            ANNSubIndex: name=vector_idx, k=100, deltas=1, metrics=[output_rows=0, elapsed_compute=8.843018ms, index_comparisons=92874, indices_loaded=0, partitions_searched=20, parts_loaded=20], cumulative_cpu=16.462909ms
              ANNIvfPartition: uuid=4d5900bc-e30b-4b3b-9125-f6f3c19fdff5, minimum_nprobes=20, maximum_nprobes=Some(20), deltas=1, metrics=[output_rows=1, elapsed_compute=6.409689ms, deltas_searched=1, index_comparisons=0, indices_loaded=1, partitions_ranked=1000, parts_loaded=0], cumulative_cpu=6.409689ms
              LanceRead: uri=mnt/work/home/wyatt/reproducer/test_random_vectors_full_vector_partial_btree.lance/data, projection=[], num_fragments=10, range_before=None, range_after=None, row_id=true, row_addr=false, full_filter=scalar_id = Int64(9999), refine_filter=--, metrics=[output_rows=2, elapsed_compute=980.397µs, fragments_scanned=4, ranges_scanned=4, rows_scanned=300001, bytes_read=2413239, iops=9, requests=9, task_wait_time=968.067µs], cumulative_cpu=1.210202ms
                ScalarIndexQuery: query=[scalar_id = 9999]@scalar_id_idx, metrics=[output_rows=2, elapsed_compute=229.805µs, index_comparisons=4096, indices_loaded=1, output_batches=1, parts_loaded=1, search_time=363.829µs, ser_time=6.613µs], cumulative_cpu=229.805µs

Query time: 0.0109s

=== partial-vector-partial-btree ===
AnalyzeExec verbose=true, metrics=[], cumulative_cpu=1.540100347s
  TracedExec, metrics=[], cumulative_cpu=1.540100347s
    ProjectionExec: expr=[scalar_id@3 as scalar_id, vector@1 as vector, _distance@2 as _distance], metrics=[output_rows=1, elapsed_compute=832ns], cumulative_cpu=1.540100347s
      Take: columns="_rowid, vector, _distance, (scalar_id)", metrics=[output_rows=1, elapsed_compute=148.871µs, batches_processed=1, bytes_read=0, iops=0, requests=0], cumulative_cpu=1.540099515s
        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=1, elapsed_compute=2.104µs], cumulative_cpu=1.539950644s
          GlobalLimitExec: skip=0, fetch=100, metrics=[output_rows=1, elapsed_compute=20ns], cumulative_cpu=1.53994854s
            FilterExec: _distance@2 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=1.483µs], cumulative_cpu=1.53994852s
              SortExec: TopK(fetch=100), expr=[_distance@2 ASC NULLS LAST, _rowid@0 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=1, elapsed_compute=14.597µs, row_replacements=1], cumulative_cpu=1.539947037s
                KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=33.505µs, output_batches=1], cumulative_cpu=1.53993244s
                  RepartitionExec: partitioning=RoundRobinBatch(1), input_partitions=2, metrics=[fetch_time=789.406346ms, repartition_time=2ns, send_time=2.426µs], cumulative_cpu=1.539898935s
                    UnionExec, metrics=[output_rows=1, elapsed_compute=26.61µs], cumulative_cpu=1.539898935s
                      ProjectionExec: expr=[_distance@3 as _distance, _rowid@2 as _rowid, vector@0 as vector], metrics=[output_rows=1, elapsed_compute=1.032µs], cumulative_cpu=1.52819678s
                        FilterExec: _distance@3 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=2.054µs], cumulative_cpu=1.528195748s
                          SortExec: TopK(fetch=100), expr=[_distance@3 ASC NULLS LAST, _rowid@2 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=1, elapsed_compute=35.707µs, row_replacements=1], cumulative_cpu=1.528193694s
                            KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=765.693524ms, output_batches=1], cumulative_cpu=1.528157987s
                              FilterExec: scalar_id@1 = 9999, metrics=[output_rows=1, elapsed_compute=1.704055ms], cumulative_cpu=762.464463ms
                                LanceScan: uri=mnt/work/home/wyatt/reproducer/test_random_vectors_partial_vector_partial_btree.lance/data, projection=[vector, scalar_id], row_id=true, row_addr=false, ordered=false, range=None, metrics=[output_rows=300000, elapsed_compute=760.760408ms, bytes_read=4917613239, iops=303, requests=156], cumulative_cpu=760.760408ms
                      Take: columns="_distance, _rowid, (vector)", metrics=[output_rows=0, elapsed_compute=1ns, batches_processed=0, bytes_read=0, iops=0, requests=0], cumulative_cpu=11.675545ms
                        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=0, elapsed_compute=10ns], cumulative_cpu=11.675544ms
                          SortExec: TopK(fetch=100), expr=[_distance@0 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=0, elapsed_compute=99.88µs, row_replacements=0], cumulative_cpu=11.675534ms
                            ANNSubIndex: name=vector_idx, k=100, deltas=1, metrics=[output_rows=0, elapsed_compute=8.898463ms, index_comparisons=75406, indices_loaded=0, partitions_searched=20, parts_loaded=20], cumulative_cpu=11.575654ms
                              ANNIvfPartition: uuid=7f22bcc3-3c14-4e96-a294-8fb2105b80de, minimum_nprobes=20, maximum_nprobes=Some(20), deltas=1, metrics=[output_rows=1, elapsed_compute=1.905088ms, deltas_searched=1, index_comparisons=0, indices_loaded=0, partitions_ranked=836, parts_loaded=0], cumulative_cpu=1.905088ms
                              ScalarIndexQuery: query=[scalar_id = 9999]@scalar_id_idx, metrics=[output_rows=2, elapsed_compute=772.103µs, index_comparisons=4096, indices_loaded=1, output_batches=1, parts_loaded=1, search_time=1.08257ms, ser_time=50.977µs], cumulative_cpu=772.103µs

Query time: 0.7889s

On lancedb patched with this lance change, it produces this:

(venv) [~/work/sophon4/src/lancedb/python] (main) $ python3 original_script.py 

=== fully-indexed ===
AnalyzeExec verbose=true, metrics=[], cumulative_cpu=30.628614ms
  TracedExec, metrics=[], cumulative_cpu=30.628614ms
    ProjectionExec: expr=[scalar_id@2 as scalar_id, vector@3 as vector, _distance@0 as _distance], metrics=[output_rows=0, elapsed_compute=1ns], cumulative_cpu=30.628614ms
      Take: columns="_distance, _rowid, (scalar_id), (vector)", metrics=[output_rows=0, elapsed_compute=1ns, batches_processed=0, bytes_read=0, iops=0, requests=0], cumulative_cpu=30.628613ms
        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=0, elapsed_compute=20ns], cumulative_cpu=30.628612ms
          SortExec: TopK(fetch=100), expr=[_distance@0 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=0, elapsed_compute=32.451µs, row_replacements=0], cumulative_cpu=30.628592ms
            ANNSubIndex: name=vector_idx, k=100, deltas=1, metrics=[output_rows=0, elapsed_compute=18.772815ms, index_comparisons=75776, indices_loaded=0, partitions_searched=20, parts_loaded=20], cumulative_cpu=30.596141ms
              ANNIvfPartition: uuid=4c698882-6d24-4aa4-a88d-8e72307ad89a, minimum_nprobes=20, maximum_nprobes=Some(20), deltas=1, metrics=[output_rows=1, elapsed_compute=11.232787ms, deltas_searched=1, index_comparisons=0, indices_loaded=1, partitions_ranked=836, parts_loaded=0], cumulative_cpu=11.232787ms
              ScalarIndexQuery: query=[scalar_id = 9999]@scalar_id_idx, metrics=[output_rows=2, elapsed_compute=590.539µs, index_comparisons=4096, indices_loaded=1, output_batches=1, parts_loaded=1, search_time=881.654µs, ser_time=20.298µs], cumulative_cpu=590.539µs

Query time: 0.0217s

=== fully-unindexed ===
AnalyzeExec verbose=true, metrics=[], cumulative_cpu=146.585µs
  TracedExec, metrics=[], cumulative_cpu=146.585µs
    ProjectionExec: expr=[scalar_id@3 as scalar_id, vector@1 as vector, _distance@2 as _distance], metrics=[output_rows=1, elapsed_compute=5.09µs], cumulative_cpu=146.585µs
      Take: columns="_rowid, vector, _distance, (scalar_id)", metrics=[output_rows=1, elapsed_compute=29.936µs, batches_processed=1, bytes_read=0, iops=0, requests=0], cumulative_cpu=141.495µs
        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=1, elapsed_compute=1.372µs], cumulative_cpu=111.559µs
          GlobalLimitExec: skip=0, fetch=100, metrics=[output_rows=1, elapsed_compute=20ns], cumulative_cpu=110.187µs
            FilterExec: _distance@2 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=1.693µs], cumulative_cpu=110.167µs
              SortExec: TopK(fetch=100), expr=[_distance@2 ASC NULLS LAST, _rowid@0 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=1, elapsed_compute=38.873µs, row_replacements=1], cumulative_cpu=108.474µs
                KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=33.963µs, output_batches=1], cumulative_cpu=69.601µs
                  Take: columns="_rowid, (vector)", metrics=[output_rows=1, elapsed_compute=33.263µs, batches_processed=1, bytes_read=0, iops=0, requests=0], cumulative_cpu=35.638µs
                    CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=1, elapsed_compute=2.375µs], cumulative_cpu=2.375µs
                      OneShotStream: EXHAUSTEDcolumns=[_rowid], metrics=[], cumulative_cpu=0ns

Query time: 0.0101s

=== partially-indexed ===
AnalyzeExec verbose=true, metrics=[], cumulative_cpu=7.556838ms
  TracedExec, metrics=[], cumulative_cpu=7.556838ms
    ProjectionExec: expr=[scalar_id@3 as scalar_id, vector@1 as vector, _distance@2 as _distance], metrics=[output_rows=2, elapsed_compute=1.112µs], cumulative_cpu=7.556838ms
      Take: columns="_rowid, vector, _distance, (scalar_id)", metrics=[output_rows=2, elapsed_compute=60.414µs, batches_processed=1, bytes_read=0, iops=0, requests=0], cumulative_cpu=7.555726ms
        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=2, elapsed_compute=2.004µs], cumulative_cpu=7.495312ms
          GlobalLimitExec: skip=0, fetch=100, metrics=[output_rows=2, elapsed_compute=20ns], cumulative_cpu=7.493308ms
            FilterExec: _distance@2 IS NOT NULL, metrics=[output_rows=2, elapsed_compute=1.793µs], cumulative_cpu=7.493288ms
              SortExec: TopK(fetch=100), expr=[_distance@2 ASC NULLS LAST, _rowid@0 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=2, elapsed_compute=37.42µs, row_replacements=2], cumulative_cpu=7.491495ms
                KNNVectorDistance: metric=l2, metrics=[output_rows=2, elapsed_compute=25.279µs, output_batches=2], cumulative_cpu=7.454075ms
                  RepartitionExec: partitioning=RoundRobinBatch(1), input_partitions=2, metrics=[fetch_time=6.866213ms, repartition_time=2ns, send_time=2.295µs], cumulative_cpu=7.428796ms
                    UnionExec, metrics=[output_rows=2, elapsed_compute=24.245µs], cumulative_cpu=7.428796ms
                      ProjectionExec: expr=[_distance@2 as _distance, _rowid@0 as _rowid, vector@1 as vector], metrics=[output_rows=1, elapsed_compute=1.182µs], cumulative_cpu=86.772µs
                        FilterExec: _distance@2 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=2.334µs], cumulative_cpu=85.59µs
                          SortExec: TopK(fetch=100), expr=[_distance@2 ASC NULLS LAST, _rowid@0 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=1, elapsed_compute=28.152µs, row_replacements=1], cumulative_cpu=83.256µs
                            KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=20.65µs, output_batches=1], cumulative_cpu=55.104µs
                              Take: columns="_rowid, (vector)", metrics=[output_rows=1, elapsed_compute=32.621µs, batches_processed=1, bytes_read=0, iops=0, requests=0], cumulative_cpu=34.454µs
                                CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=1, elapsed_compute=1.833µs], cumulative_cpu=1.833µs
                                  OneShotStream: EXHAUSTEDcolumns=[_rowid], metrics=[], cumulative_cpu=0ns
                      Take: columns="_distance, _rowid, (vector)", metrics=[output_rows=1, elapsed_compute=168.637µs, batches_processed=1, bytes_read=0, iops=0, requests=0], cumulative_cpu=7.317779ms
                        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=1, elapsed_compute=1.564µs], cumulative_cpu=7.149142ms
                          SortExec: TopK(fetch=100), expr=[_distance@0 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=1, elapsed_compute=25.203µs, row_replacements=1], cumulative_cpu=7.147578ms
                            ANNSubIndex: name=vector_idx, k=100, deltas=1, metrics=[output_rows=1, elapsed_compute=6.194543ms, index_comparisons=67131, indices_loaded=0, partitions_searched=20, parts_loaded=20], cumulative_cpu=7.122375ms
                              ANNIvfPartition: uuid=00418576-3c31-485b-a461-37ea8969e9be, minimum_nprobes=20, maximum_nprobes=Some(20), deltas=1, metrics=[output_rows=1, elapsed_compute=682.922µs, deltas_searched=1, index_comparisons=0, indices_loaded=0, partitions_ranked=836, parts_loaded=0], cumulative_cpu=682.922µs
                              ScalarIndexQuery: query=[scalar_id = 9999]@scalar_id_idx, metrics=[output_rows=2, elapsed_compute=244.91µs, index_comparisons=4096, indices_loaded=1, output_batches=1, parts_loaded=1, search_time=391.976µs, ser_time=6.121µs], cumulative_cpu=244.91µs

Query time: 0.0214s

=== full-vector-partial-btree ===
AnalyzeExec verbose=true, metrics=[], cumulative_cpu=37.515406ms
  TracedExec, metrics=[], cumulative_cpu=37.515406ms
    ProjectionExec: expr=[scalar_id@2 as scalar_id, vector@3 as vector, _distance@0 as _distance], metrics=[output_rows=0, elapsed_compute=1ns], cumulative_cpu=37.515406ms
      Take: columns="_distance, _rowid, (scalar_id), (vector)", metrics=[output_rows=0, elapsed_compute=1ns, batches_processed=0, bytes_read=0, iops=0, requests=0], cumulative_cpu=37.515405ms
        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=0, elapsed_compute=90ns], cumulative_cpu=37.515404ms
          SortExec: TopK(fetch=100), expr=[_distance@0 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=0, elapsed_compute=9.559µs, row_replacements=0], cumulative_cpu=37.515314ms
            ANNSubIndex: name=vector_idx, k=100, deltas=1, metrics=[output_rows=0, elapsed_compute=21.96289ms, index_comparisons=100550, indices_loaded=0, partitions_searched=20, parts_loaded=20], cumulative_cpu=37.505755ms
              ANNIvfPartition: uuid=e23e0fc3-3fa5-43f4-aa5d-f420181bd364, minimum_nprobes=20, maximum_nprobes=Some(20), deltas=1, metrics=[output_rows=1, elapsed_compute=13.858444ms, deltas_searched=1, index_comparisons=0, indices_loaded=1, partitions_ranked=1000, parts_loaded=0], cumulative_cpu=13.858444ms
              LanceRead: uri=home/wyatt/work/sophon4/src/lancedb/python/test_random_vectors_full_vector_partial_btree.lance/data, projection=[], num_fragments=10, range_before=None, range_after=None, row_id=true, row_addr=false, full_filter=scalar_id = Int64(9999), refine_filter=--, metrics=[output_rows=2, elapsed_compute=1.433832ms, fragments_scanned=4, ranges_scanned=4, rows_scanned=300001, bytes_read=2413239, iops=9, requests=9, task_wait_time=1.421611ms], cumulative_cpu=1.684421ms
                ScalarIndexQuery: query=[scalar_id = 9999]@scalar_id_idx, metrics=[output_rows=2, elapsed_compute=250.589µs, index_comparisons=4096, indices_loaded=1, output_batches=1, parts_loaded=1, search_time=412.865µs, ser_time=5.16µs], cumulative_cpu=250.589µs

Query time: 0.0241s

=== partial-vector-partial-btree ===
AnalyzeExec verbose=true, metrics=[], cumulative_cpu=9.379618ms
  TracedExec, metrics=[], cumulative_cpu=9.379618ms
    ProjectionExec: expr=[scalar_id@3 as scalar_id, vector@1 as vector, _distance@2 as _distance], metrics=[output_rows=1, elapsed_compute=531ns], cumulative_cpu=9.379618ms
      Take: columns="_rowid, vector, _distance, (scalar_id)", metrics=[output_rows=1, elapsed_compute=48.521µs, batches_processed=1, bytes_read=0, iops=0, requests=0], cumulative_cpu=9.379087ms
        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=1, elapsed_compute=2.135µs], cumulative_cpu=9.330566ms
          GlobalLimitExec: skip=0, fetch=100, metrics=[output_rows=1, elapsed_compute=20ns], cumulative_cpu=9.328431ms
            FilterExec: _distance@2 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=3.276µs], cumulative_cpu=9.328411ms
              SortExec: TopK(fetch=100), expr=[_distance@2 ASC NULLS LAST, _rowid@0 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=1, elapsed_compute=21.099µs, row_replacements=1], cumulative_cpu=9.325135ms
                KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=18.716µs, output_batches=1], cumulative_cpu=9.304036ms
                  RepartitionExec: partitioning=RoundRobinBatch(1), input_partitions=2, metrics=[fetch_time=9.008191ms, repartition_time=2ns, send_time=752ns], cumulative_cpu=9.28532ms
                    UnionExec, metrics=[output_rows=1, elapsed_compute=24.847µs], cumulative_cpu=9.28532ms
                      ProjectionExec: expr=[_distance@2 as _distance, _rowid@0 as _rowid, vector@1 as vector], metrics=[output_rows=1, elapsed_compute=841ns], cumulative_cpu=91.821µs
                        FilterExec: _distance@2 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=3.066µs], cumulative_cpu=90.98µs
                          SortExec: TopK(fetch=100), expr=[_distance@2 ASC NULLS LAST, _rowid@0 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=1, elapsed_compute=25.989µs, row_replacements=1], cumulative_cpu=87.914µs
                            KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=36.899µs, output_batches=1], cumulative_cpu=61.925µs
                              Take: columns="_rowid, (vector)", metrics=[output_rows=1, elapsed_compute=21.459µs, batches_processed=1, bytes_read=0, iops=0, requests=0], cumulative_cpu=25.026µs
                                CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=1, elapsed_compute=3.567µs], cumulative_cpu=3.567µs
                                  OneShotStream: EXHAUSTEDcolumns=[_rowid], metrics=[], cumulative_cpu=0ns
                      Take: columns="_distance, _rowid, (vector)", metrics=[output_rows=0, elapsed_compute=1ns, batches_processed=0, bytes_read=0, iops=0, requests=0], cumulative_cpu=9.168652ms
                        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=0, elapsed_compute=80ns], cumulative_cpu=9.168651ms
                          SortExec: TopK(fetch=100), expr=[_distance@0 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=0, elapsed_compute=11.021µs, row_replacements=0], cumulative_cpu=9.168571ms
                            ANNSubIndex: name=vector_idx, k=100, deltas=1, metrics=[output_rows=0, elapsed_compute=8.215164ms, index_comparisons=82545, indices_loaded=0, partitions_searched=20, parts_loaded=20], cumulative_cpu=9.15755ms
                              ANNIvfPartition: uuid=5deb6d1a-a177-4778-b0be-9b566f653896, minimum_nprobes=20, maximum_nprobes=Some(20), deltas=1, metrics=[output_rows=1, elapsed_compute=676.78µs, deltas_searched=1, index_comparisons=0, indices_loaded=0, partitions_ranked=836, parts_loaded=0], cumulative_cpu=676.78µs
                              ScalarIndexQuery: query=[scalar_id = 9999]@scalar_id_idx, metrics=[output_rows=2, elapsed_compute=265.606µs, index_comparisons=4096, indices_loaded=1, output_batches=1, parts_loaded=1, search_time=416.692µs, ser_time=6.422µs], cumulative_cpu=265.606µs

Query time: 0.0220s

The partially/unindexed cases now use this optimization, and all plans are fast.

wkalt · 2025-11-11T15:38:42Z

rust/lance/src/dataset/scanner.rs

+    /// # Errors
+    ///
+    /// Returns an error if the threshold is not finite or is outside the range [0.0, 1.0].
+    pub fn late_materialize_selectivity_threshold(&mut self, threshold: f64) -> Result<&mut Self> {


Should we expose this in the python SDK? I'm not sure if we tend to mirror exactly or not. Similar question for lancedb SDK.

I think it's useful to expose and I have done it locally for testing, but it's also an implementation detail that will likely change as our optimizer becomes more sophisticated.

wjones127

Main concern is we should validate filtering on nested columns doesn't break.

wjones127 · 2025-11-11T17:32:13Z

rust/lance/src/dataset/scanner.rs

+        vector_column: &str,
+        skip_recheck: bool,
+    ) -> Result<Arc<dyn ExecutionPlan>> {
+        let mut filter_columns_set: BTreeSet<String> = BTreeSet::new();


Column names feels a bit fragile. It might be nice to collect field ids. Does this work if we are filtering on nested columns?

I feel like FilterPlan should have a method fn full_projection(&self) -> Projection or something like that to make this easy.

wjones127 · 2025-11-11T17:38:24Z

rust/lance/src/dataset/scanner.rs

Eventually, I think we want to move this logic within FilteredReadExec. I don't think we want to have create_plan() actually be running part of the plan in the end.

Here's how we handled late materialization in the scan with V1 files:

https://github.com/lancedb/lance/blob/96cfdf2143d7422ee107e0405a636f79a16ace70/rust/lance/src/io/exec/pushdown_scan.rs#L399-L400

codecov-commenter · 2025-11-11T19:49:10Z

Codecov Report

❌ Patch coverage is 72.15686% with 71 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.07%. Comparing base (ae0dcc4) to head (24e0d88).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/scanner.rs	74.39%	37 Missing and 26 partials ⚠️
rust/lance-datafusion/src/exec.rs	0.00%	7 Missing ⚠️
rust/lance-tools/src/meta.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5205      +/-   ##
==========================================
+ Coverage   81.98%   82.07%   +0.09%     
==========================================
  Files         341      342       +1     
  Lines      141199   141709     +510     
  Branches   141199   141709     +510     
==========================================
+ Hits       115758   116311     +553     
+ Misses      21628    21539      -89     
- Partials     3813     3859      +46

Flag	Coverage Δ
unittests	`82.07% <72.15%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

westonpace

I like the idea of moving late materialization into the filtered read. It simplifies the scanner logic and it makes sense. This node reads from a source, with a filter. Late materialization is part of that.

Minor nit: this approach makes a lot of sense for small selectivity thresholds. If the selectivity threshold is large (or the dataset is very large) then it seems like we could do a lot of work before we start scanning expensive columns. Because we need to read in at least N * s rows (where N is total rows and s is selectivity threshold). Since we can't return data to the user until we have all columns this means the time to first batch increases.

What do you think of this alternative approach? It might be more complicated, and potentially prone to filter bias in the beginning / end of the data.

Read the first 10K rows (the 10K is counted pre-filter here, and we could pick a different number)
Estimate the selectivity by calculating # of matched rows / 10K
Compare estimated selectivity with selectivity threshold to determine if we scan or take

The 10K is just some heuristic. Larger values will take more time before we start reading expensive columns, smaller values will be more prone to estimation error and less able to handle very small selectivity thresholds.

westonpace · 2025-11-13T14:47:14Z

python/python/lance/dataset.py

+    def late_materialize_selectivity_threshold(
+        self, threshold: float
+    ) -> ScannerBuilder:
+        """
+        Set the selectivity threshold for late materialization in filtered KNN searches.
+
+        When a filter is present in a KNN search, Lance first executes it to measure selectivity.
+        If the filter selects fewer than this percentage of rows, Lance uses late materialization
+        (scan scalars first, then take vectors for filtered rows only). If the filter selects
+        this percentage or more rows, Lance does a single scan with both filter and vector columns
+        to avoid the random access overhead of the take operation.
+
+        The optimal value depends on your storage medium:
+        - **Object storage (S3, GCS, Azure)**: Use a low threshold like 0.005 (0.5%) since
+          random access is very expensive
+        - **Local SSD**: Can use a higher threshold like 0.05 (5%) since random access is cheaper
+        - **NVMe**: Can use even higher thresholds like 0.1 (10%)
+
+        The default is 0.005 (0.5%), which is conservative for object storage.
+
+        Parameters
+        ----------
+        threshold : float
+            The selectivity threshold as a fraction (e.g., 0.005 for 0.5%)
+
+        Returns
+        -------
+        ScannerBuilder
+            Returns self for method chaining
+        """
+        if not isinstance(threshold, (int, float)):
+            raise TypeError(
+                f"late_materialize_selectivity_threshold must be a number, got {type(threshold)}"
+            )
+        if not (0.0 <= threshold <= 1.0):
+            raise ValueError(
+                f"late_materialize_selectivity_threshold must be between 0.0 and 1.0 (inclusive), got {threshold}"
+            )
+        self._late_materialize_selectivity_threshold = float(threshold)
+        return self


Can we make this a property of the ObjectStore? We already have fields like io_parallelism, max_iop_size, and block_size which describe how an object store should be interacted with. I think this could be added to the list. Then the defaults will work for most users. Users with a custom storage scenario will need to create a custom object store.

This way, we hide the parameter from most users, and get a better default.

that seems like a great idea, thanks.

westonpace · 2025-11-13T14:48:20Z

rust/lance/src/dataset/scanner.rs

+    /// Adaptive late materialization for filtered vector scans.
+    ///
+    /// When a filter is present, this method:
+    /// 1. Scans with scalar columns first to check selectivity
+    /// 2. If selective (< threshold): uses late materialization (collect row IDs, then take vectors)
+    /// 3. If not selective (>= threshold): does a full scan with both filter and vector columns
+    ///
+    /// This avoids expensive random access for non-selective filters while benefiting
+    /// from late materialization for selective ones.
+    async fn adaptive_column_scan(
+        &self,
+        filter_plan: &FilterPlan,
+        frags: Option<Arc<Vec<Fragment>>>,
+        take_column: &str,
+        skip_recheck: bool,
+    ) -> Result<Arc<dyn ExecutionPlan>> {
+        // FilteredRead doesn't support v1/legacy files, so fall back for legacy datasets
+        if self.dataset.is_legacy_storage() {
+            let mut filter_cols: BTreeSet<String> = BTreeSet::new();
+            if let Some(refine_expr) = filter_plan.refine_expr.as_ref() {
+                filter_cols.extend(Planner::column_names_in_expr(refine_expr));
+            }
+            if let Some(full_expr) = filter_plan.full_expr.as_ref() {
+                filter_cols.extend(Planner::column_names_in_expr(full_expr));
+            }
+            let mut filter_cols: Vec<String> = filter_cols.into_iter().collect();
+            filter_cols.sort();
+            return self
+                .full_scan_with_filter(filter_plan, frags, take_column, filter_cols, skip_recheck)
+                .await;
+        }
+
+        // Build full projection (filter columns + vector column)
+        let mut filter_cols: BTreeSet<String> = BTreeSet::new();
+        if let Some(refine_expr) = filter_plan.refine_expr.as_ref() {
+            filter_cols.extend(Planner::column_names_in_expr(refine_expr));
+        }
+        if let Some(full_expr) = filter_plan.full_expr.as_ref() {
+            filter_cols.extend(Planner::column_names_in_expr(full_expr));
+        }
+        let mut filter_cols: Vec<String> = filter_cols.into_iter().collect();
+        filter_cols.sort();
+        let mut full_projection = self
+            .dataset
+            .empty_projection()
+            .with_row_id()
+            .union_columns(&filter_cols, OnMissing::Error)?
+            .union_column(take_column, OnMissing::Error)?;
+        full_projection.with_row_addr = self.projection_plan.physical_projection.with_row_addr;
+
+        // Use new_filtered_read but add adaptive config
+        let threshold = self
+            .late_materialize_selectivity_threshold
+            .unwrap_or(LATE_MATERIALIZE_SELECTIVITY_THRESHOLD);
+
+        let total_count = self.get_total_row_count(frags.as_ref());
+
+        let mut plan = self
+            .new_filtered_read(
+                filter_plan,
+                full_projection,
+                /*make_deletions_null=*/ false,
+                frags.clone(),
+                /*scan_range=*/ None,
+            )
+            .await?;
+
+        // Unwrap FilteredReadExec to add adaptive config
+        if let Some(filtered_exec) = plan.as_any().downcast_ref::<FilteredReadExec>() {
+            let mut opts = filtered_exec.options().clone();
+            opts.adaptive_expensive_column = Some(AdaptiveColumnConfig {
+                expensive_column: take_column.to_string(),
+                threshold,
+                total_row_count: total_count,
+            });
+
+            plan = Arc::new(FilteredReadExec::try_new(
+                filtered_exec.dataset().clone(),
+                opts,
+                filtered_exec.index_input().cloned(),
+            )?);
+        }
+
+        // Apply refine filter if needed
+        if !skip_recheck {
+            if let Some(refine_expr) = &filter_plan.refine_expr {
+                plan = Arc::new(LanceFilterExec::try_new(refine_expr.clone(), plan)?);
+            }
+        }
+
+        Ok(plan)
+    }


Do we need the adaptive column scan both here and in filtered read? I wasn't sure if Will's comment had been addressed yet or not and, if it has, did we still need something in scanner?

oh, sorry. I think I started moving it into filtered_read and stopped halfway through. I think it's better in there.

westonpace · 2025-11-13T14:49:12Z

rust/lance/src/io/exec/filtered_read.rs

+    fn obtain_adaptive_stream(
+        &self,
+        partition: usize,
+        context: Arc<TaskContext>,
+    ) -> SendableRecordBatchStream {


This is a very big method (admittedly, pot calling kettle black scenario here), can we break it up?

westonpace · 2025-11-13T15:00:15Z

rust/lance/src/io/exec/filtered_read.rs

+            // Concatenate all cheap batches into one
+            use arrow_select::concat::concat_batches;
+            let cheap_schema = cheap_batches[0].schema();
+            let cheap_batch_combined =
+                concat_batches(&cheap_schema, &cheap_batches).map_err(|e| Error::Arrow {
+                    message: format!("Failed to concatenate cheap batches: {}", e),
+                    location: location!(),
+                })?;
+
+            // Collect all expensive batches
+            let mut expensive_batches = Vec::new();
+            while let Some(batch_result) = expensive_stream.next().await {
+                let batch = batch_result.map_err(Error::from)?;
+                expensive_batches.push(batch);
+            }


Is this materializing the entire read into memory? We will need an approach that can work iteratively.

westonpace

Also, there is always the dumb easy alternative (what we do today) which is just pick one of take/scan and always use it for expensive columns. I'm not actually sure the two should have different performance characteristics anymore.

Nevermind, I kind of confused myself. There is still some possibility here, but it isn't as simple as I described. We can talk it through offline but I'm not sure we need to worry about it.

wkalt · 2025-11-13T17:11:05Z

I like the idea of moving late materialization into the filtered read. It simplifies the scanner logic and it makes sense. This node reads from a source, with a filter. Late materialization is part of that.

Minor nit: this approach makes a lot of sense for small selectivity thresholds. If the selectivity threshold is large (or the dataset is very large) then it seems like we could do a lot of work before we start scanning expensive columns. Because we need to read in at least N * s rows (where N is total rows and s is selectivity threshold). Since we can't return data to the user until we have all columns this means the time to first batch increases.

What do you think of this alternative approach? It might be more complicated, and potentially prone to filter bias in the beginning / end of the data.
* Read the first 10K rows (the 10K is counted pre-filter here, and we could pick a different number)

* Estimate the selectivity by calculating # of matched rows / 10K

* Compare estimated selectivity with selectivity threshold to determine if we scan or take
The 10K is just some heuristic. Larger values will take more time before we start reading expensive columns, smaller values will be more prone to estimation error and less able to handle very small selectivity thresholds.

@westonpace regarding the risk of reading a lot - yeah, I should do some better math to actually quantify what it'll look like. I think in general, in cases where this approach becomes expensive the scanning approach is also expensive. But there are various thresholds to consider and also maybe a question of whether the materialized filter results are too expensive to materialize for the superlarge table case.

I also think the time to first batch in the previous setup is very dependent on the location in the table of the first matching record. If you are lucky you get one immediately and scanning is a win, but if you are not lucky then using the approach here could win out even if you need to read a lot of (cheaper) rows.

For your heuristic I think the main concern is what you say -- it's really biased toward the first 10K rows of the table. I think that could be a serious limitation.

Another thing I think I need to examine is how things work when the filter is very large/complicated.

wkalt · 2025-11-13T17:39:24Z

another idea that occurs to me for limiting the downside once we have a better analysis, is we could limit this to filter expressions with only one or two columns, or also put some heuristics in place around aborting early if the expected size of the materialized X% is assessed to be too large. My guess is the majority of filters are on small data types and small values, and don't include very many conditions.

github-actions bot added the enhancement New feature or request label Nov 11, 2025

wkalt requested a review from Copilot November 11, 2025 01:30

Copilot started reviewing on behalf of wkalt November 11, 2025 01:30 View session

Copilot finished reviewing on behalf of wkalt November 11, 2025 01:31

wkalt commented Nov 11, 2025

View reviewed changes

Copilot AI reviewed Nov 11, 2025

View reviewed changes

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 11, 2025

View reviewed changes

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

wkalt requested a review from Copilot November 11, 2025 02:39

Copilot started reviewing on behalf of wkalt November 11, 2025 02:40 View session

Copilot finished reviewing on behalf of wkalt November 11, 2025 02:40

Copilot AI reviewed Nov 11, 2025

View reviewed changes

wkalt requested a review from Copilot November 11, 2025 03:02

Copilot started reviewing on behalf of wkalt November 11, 2025 03:02 View session

Copilot finished reviewing on behalf of wkalt November 11, 2025 03:04

Copilot AI reviewed Nov 11, 2025

View reviewed changes

wkalt requested a review from Copilot November 11, 2025 03:09

Copilot started reviewing on behalf of wkalt November 11, 2025 03:10 View session

Copilot finished reviewing on behalf of wkalt November 11, 2025 03:13

Copilot AI reviewed Nov 11, 2025

View reviewed changes

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

wkalt requested a review from Copilot November 11, 2025 03:26

Copilot started reviewing on behalf of wkalt November 11, 2025 03:26 View session

Copilot finished reviewing on behalf of wkalt November 11, 2025 03:27

Copilot AI reviewed Nov 11, 2025

View reviewed changes

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

wkalt requested a review from Copilot November 11, 2025 03:37

Copilot started reviewing on behalf of wkalt November 11, 2025 03:37 View session

Copilot finished reviewing on behalf of wkalt November 11, 2025 03:39

Copilot AI reviewed Nov 11, 2025

View reviewed changes

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

wkalt requested a review from Copilot November 11, 2025 03:41

Copilot started reviewing on behalf of wkalt November 11, 2025 03:41 View session

Copilot finished reviewing on behalf of wkalt November 11, 2025 03:43

Copilot AI reviewed Nov 11, 2025

View reviewed changes

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

wkalt requested a review from Copilot November 11, 2025 04:59

Copilot started reviewing on behalf of wkalt November 11, 2025 04:59 View session

Copilot finished reviewing on behalf of wkalt November 11, 2025 05:04

Copilot AI reviewed Nov 11, 2025

View reviewed changes

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

rust/lance/src/dataset/scanner.rs Show resolved Hide resolved

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved

wkalt force-pushed the task/late-materialize-vector-search branch from 55fac97 to b02d08b Compare November 11, 2025 05:49

format

c8e842b

wkalt requested a review from Copilot November 11, 2025 05:56

Copilot started reviewing on behalf of wkalt November 11, 2025 05:56 View session

Copilot finished reviewing on behalf of wkalt November 11, 2025 05:59

Copilot AI reviewed Nov 11, 2025

View reviewed changes

wkalt added 2 commits November 10, 2025 22:01

clippy

fa6d871

format

7f080cb

github-actions bot added the python label Nov 11, 2025

update python test

c2e4d42

wkalt force-pushed the task/late-materialize-vector-search branch from e366bd0 to c2e4d42 Compare November 11, 2025 13:57

wkalt commented Nov 11, 2025

View reviewed changes

wjones127 reviewed Nov 11, 2025

View reviewed changes

python bindings

63773c0

migrate logic to filtered read node

825a666

wkalt force-pushed the task/late-materialize-vector-search branch from 24e0d88 to 825a666 Compare November 12, 2025 05:57

lint fix

f759623

westonpace reviewed Nov 13, 2025

View reviewed changes

feat: late materialization of vectors in filtered vector search #5205

Are you sure you want to change the base?

feat: late materialization of vectors in filtered vector search #5205

Uh oh!

Conversation

wkalt commented Nov 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

wkalt commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

wkalt commented Nov 11, 2025 •

edited

Loading

westonpace left a comment •

edited

Loading