Speed up TermQuery by gf2121 · Pull Request #14709 · apache/lucene

gf2121 · 2025-05-24T11:21:17Z

This tries to speed up TermQuery with the new API Scorer#nextDocsAndScores

TopN

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                   TermMonthSort     3577.64      (3.6%)     3491.60      (5.4%)   -2.4% ( -10% -    6%) 0.290
                          OrMany        9.72      (4.0%)        9.56      (4.2%)   -1.6% (  -9% -    6%) 0.424
                AndMedOrHighHigh       51.22      (3.4%)       50.57      (4.7%)   -1.3% (  -9% -    7%) 0.538
                 FilteredPrefix3      647.69      (1.6%)      640.85      (4.0%)   -1.1% (  -6% -    4%) 0.488
                          IntSet      634.61      (2.2%)      628.14      (3.1%)   -1.0% (  -6% -    4%) 0.447
                     CountPhrase        3.98      (3.0%)        3.94      (3.1%)   -0.9% (  -6% -    5%) 0.570
               CombinedOrHighMed       91.42      (1.5%)       90.67      (2.7%)   -0.8% (  -4% -    3%) 0.448
                       CountTerm    10500.02      (9.4%)    10419.36      (8.7%)   -0.8% ( -17% -   19%) 0.865
              CombinedOrHighHigh       12.61      (1.4%)       12.52      (3.6%)   -0.7% (  -5% -    4%) 0.586
               FilteredOrHighMed      163.12      (3.4%)      162.07      (3.6%)   -0.6% (  -7% -    6%) 0.713
               TermDayOfYearSort      328.81      (0.9%)      327.11      (1.7%)   -0.5% (  -3% -    2%) 0.447
                          Fuzzy1      113.73      (2.6%)      113.32      (2.4%)   -0.4% (  -5% -    4%) 0.771
             CountFilteredIntNRQ       21.75      (1.8%)       21.68      (1.7%)   -0.3% (  -3% -    3%) 0.717
          CountFilteredOrHighMed       50.21      (3.8%)       50.05      (3.8%)   -0.3% (  -7% -    7%) 0.868
             CombinedAndHighHigh       10.29      (1.6%)       10.27      (3.0%)   -0.3% (  -4% -    4%) 0.823
         CountFilteredOrHighHigh       42.70      (3.3%)       42.59      (3.5%)   -0.3% (  -6% -    6%) 0.884
                  FilteredIntNRQ       37.85      (0.5%)       37.76      (0.8%)   -0.2% (  -1% -    1%) 0.488
              CombinedAndHighMed       81.00      (1.7%)       80.84      (2.3%)   -0.2% (  -4% -    3%) 0.852
                  FilteredOrMany        7.31      (1.9%)        7.30      (1.8%)   -0.1% (  -3% -    3%) 0.881
             CountFilteredOrMany       13.60      (2.3%)       13.58      (2.2%)   -0.1% (  -4% -    4%) 0.910
                          Fuzzy2      126.14      (2.5%)      126.00      (3.2%)   -0.1% (  -5% -    5%) 0.938
                          IntNRQ       77.45      (0.4%)       77.44      (0.4%)   -0.0% (   0% -    0%) 0.959
             FilteredAndHighHigh       27.40      (3.1%)       27.40      (4.2%)   -0.0% (  -7% -    7%) 0.999
                         Prefix3       73.55      (1.4%)       73.59      (1.6%)    0.1% (  -2% -    3%) 0.936
                       And3Terms      525.88      (3.5%)      526.24      (3.7%)    0.1% (  -6% -    7%) 0.969
             FilteredOrStopWords       25.33      (3.2%)       25.35      (1.8%)    0.1% (  -4% -    5%) 0.938
                    SloppyPhrase        1.10      (1.3%)        1.10      (1.4%)    0.2% (  -2% -    2%) 0.813
              Or2Terms2StopWords      385.17      (1.3%)      385.85      (2.3%)    0.2% (  -3% -    3%) 0.852
                CountAndHighHigh       83.88      (1.4%)       84.06      (1.0%)    0.2% (  -2% -    2%) 0.717
                FilteredOr3Terms       86.29      (3.5%)       86.49      (2.4%)    0.2% (  -5% -    6%) 0.876
            FilteredAndStopWords       16.87      (4.9%)       16.92      (5.1%)    0.2% (  -9% -   10%) 0.922
                        Wildcard      112.39      (1.3%)      112.67      (1.2%)    0.2% (  -2% -    2%) 0.698
                 AndHighOrMedMed       44.53      (1.1%)       44.67      (1.6%)    0.3% (  -2% -    2%) 0.639
                         Respell       83.94      (0.9%)       84.23      (1.4%)    0.3% (  -1% -    2%) 0.562
                   TermTitleSort      145.80      (1.1%)      146.31      (2.0%)    0.4% (  -2% -    3%) 0.658
      FilteredOr2Terms2StopWords      196.81      (1.7%)      197.59      (2.1%)    0.4% (  -3% -    4%) 0.687
                        SpanNear        6.22      (0.8%)        6.26      (1.0%)    0.5% (  -1% -    2%) 0.243
             CountFilteredPhrase       90.57      (4.0%)       91.05      (2.9%)    0.5% (  -6% -    7%) 0.763
     FilteredAnd2Terms2StopWords      459.38      (2.4%)      462.00      (2.8%)    0.6% (  -4% -    5%) 0.660
                 CountOrHighHigh       83.30      (3.8%)       83.78      (2.1%)    0.6% (  -5% -    6%) 0.704
                      OrHighRare      948.62      (4.7%)      954.83      (2.1%)    0.7% (  -5% -    7%) 0.719
                 DismaxOrHighMed       96.87      (2.6%)       97.53      (4.1%)    0.7% (  -5% -    7%) 0.688
                      AndHighMed      116.14      (3.6%)      117.01      (4.0%)    0.7% (  -6% -    8%) 0.694
                     CountOrMany       11.71      (3.3%)       11.81      (2.2%)    0.8% (  -4% -    6%) 0.565
                          Phrase       12.48      (3.3%)       12.61      (2.5%)    1.0% (  -4% -    6%) 0.502
                     OrStopWords       37.89      (5.4%)       38.27      (7.4%)    1.0% ( -11% -   14%) 0.758
                IntervalsOrdered        2.19      (1.9%)        2.21      (1.4%)    1.0% (  -2% -    4%) 0.239
                       OrHighMed      195.95      (2.5%)      198.05      (4.7%)    1.1% (  -6% -    8%) 0.573
              FilteredOrHighHigh       28.37      (3.2%)       28.71      (1.6%)    1.2% (  -3% -    6%) 0.352
              FilteredAndHighMed      112.41      (3.2%)      113.89      (3.8%)    1.3% (  -5% -    8%) 0.452
                  FilteredPhrase       17.13      (4.4%)       17.36      (2.8%)    1.3% (  -5% -    8%) 0.469
                    FilteredTerm      128.12      (4.5%)      129.88      (4.8%)    1.4% (  -7% -   11%) 0.553
               FilteredAnd3Terms      118.53      (3.4%)      120.22      (3.4%)    1.4% (  -5% -    8%) 0.405
                DismaxOrHighHigh       93.55      (3.4%)       94.92      (4.4%)    1.5% (  -6% -    9%) 0.457
                  CountOrHighMed      159.93      (5.6%)      162.51      (5.6%)    1.6% (  -9% -   13%) 0.567
                 CountAndHighMed      133.67      (5.4%)      135.90      (5.4%)    1.7% (  -8% -   13%) 0.535
             And2Terms2StopWords       39.19      (4.3%)       39.94      (6.2%)    1.9% (  -8% -   12%) 0.470
                    CombinedTerm       25.29      (1.3%)       25.78      (2.0%)    1.9% (  -1% -    5%) 0.022
                      OrHighHigh       25.46      (4.2%)       25.96      (7.7%)    2.0% (  -9% -   14%) 0.519
                        Or3Terms      127.74      (5.3%)      130.56      (4.4%)    2.2% (  -7% -   12%) 0.365
                    AndStopWords       37.41      (5.7%)       38.25      (6.7%)    2.2% (  -9% -   15%) 0.473
                      TermDTSort      369.80      (6.3%)      379.16      (8.2%)    2.5% ( -11% -   18%) 0.490
                     AndHighHigh       82.62      (4.4%)       85.31      (3.3%)    3.3% (  -4% -   11%) 0.093
                      DismaxTerm      896.35      (3.7%)     1210.67      (8.1%)   35.1% (  22% -   48%) 0.000
                            Term      981.81      (3.7%)     1344.47      (6.9%)   36.9% (  25% -   49%) 0.000

Exhaustive

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
              FilteredAndHighMed      204.99      (1.6%)      201.23      (1.8%)   -1.8% (  -5% -    1%) 0.035
                          IntNRQ        8.18      (6.0%)        8.05      (5.2%)   -1.6% ( -12% -   10%) 0.556
                   TermMonthSort     3497.23      (5.2%)     3441.48      (1.6%)   -1.6% (  -8% -    5%) 0.409
                CountAndHighHigh       94.33      (2.7%)       92.87      (3.0%)   -1.5% (  -7% -    4%) 0.284
                       And3Terms      584.28      (3.2%)      575.55      (3.7%)   -1.5% (  -8% -    5%) 0.383
                          OrMany        0.97      (5.6%)        0.96      (9.2%)   -1.3% ( -15% -   14%) 0.733
                     CountOrMany       11.97      (4.6%)       11.83      (4.3%)   -1.2% (  -9% -    8%) 0.597
               FilteredAnd3Terms      139.54      (2.7%)      138.19      (2.3%)   -1.0% (  -5% -    4%) 0.441
                 CountOrHighHigh       95.76      (3.0%)       94.86      (3.4%)   -0.9% (  -7% -    5%) 0.556
                 CountAndHighMed      138.99      (4.2%)      137.74      (2.6%)   -0.9% (  -7% -    6%) 0.606
                AndMedOrHighHigh       54.89      (1.9%)       54.44      (2.3%)   -0.8% (  -4% -    3%) 0.441
                      TermDTSort      226.01      (1.4%)      224.45      (1.1%)   -0.7% (  -3% -    1%) 0.278
               TermDayOfYearSort      284.64      (0.7%)      282.69      (1.2%)   -0.7% (  -2% -    1%) 0.180
                  CountOrHighMed      171.81      (2.8%)      170.78      (1.9%)   -0.6% (  -5% -    4%) 0.620
                          IntSet      597.04      (2.5%)      593.54      (2.1%)   -0.6% (  -5% -    4%) 0.617
             CountFilteredOrMany       13.33      (1.5%)       13.26      (1.5%)   -0.5% (  -3% -    2%) 0.479
             FilteredAndHighHigh       34.56      (1.8%)       34.39      (1.9%)   -0.5% (  -4% -    3%) 0.596
         CountFilteredOrHighHigh       38.70      (1.5%)       38.53      (1.7%)   -0.5% (  -3% -    2%) 0.567
                     AndHighHigh       10.80      (0.7%)       10.76      (1.1%)   -0.4% (  -2% -    1%) 0.377
                   TermTitleSort      169.21      (2.5%)      168.52      (1.4%)   -0.4% (  -4% -    3%) 0.692
                         Respell       82.28      (1.9%)       81.95      (2.0%)   -0.4% (  -4% -    3%) 0.682
                        Wildcard       37.88      (4.5%)       37.81      (4.2%)   -0.2% (  -8% -    8%) 0.926
      FilteredOr2Terms2StopWords        9.94      (2.1%)        9.93      (1.5%)   -0.1% (  -3% -    3%) 0.874
          CountFilteredOrHighMed       53.57      (0.9%)       53.52      (1.2%)   -0.1% (  -2% -    1%) 0.851
             CountFilteredPhrase       91.72      (1.5%)       91.69      (2.0%)   -0.0% (  -3% -    3%) 0.976
                         Prefix3        5.93      (3.2%)        5.93      (3.0%)   -0.0% (  -6% -    6%) 0.990
     FilteredAnd2Terms2StopWords      357.95      (2.1%)      357.99      (1.1%)    0.0% (  -3% -    3%) 0.988
                    AndStopWords       13.39      (1.2%)       13.39      (1.4%)    0.0% (  -2% -    2%) 0.962
                    CombinedTerm       24.93      (1.5%)       24.94      (1.6%)    0.0% (  -2% -    3%) 0.955
                          Phrase        3.84      (2.8%)        3.84      (5.0%)    0.1% (  -7% -    8%) 0.978
                  FilteredIntNRQ       16.94      (0.7%)       16.95      (1.4%)    0.1% (  -1% -    2%) 0.879
                FilteredOr3Terms       21.83      (3.5%)       21.86      (2.6%)    0.1% (  -5% -    6%) 0.935
                 FilteredPrefix3        8.43      (1.4%)        8.44      (1.5%)    0.2% (  -2% -    3%) 0.830
               FilteredOrHighMed       21.71      (0.4%)       21.76      (0.6%)    0.2% (   0% -    1%) 0.387
              FilteredOrHighHigh       16.06      (1.7%)       16.10      (1.5%)    0.3% (  -2% -    3%) 0.749
                      AndHighMed       75.44      (2.1%)       75.64      (2.6%)    0.3% (  -4% -    5%) 0.826
             And2Terms2StopWords      297.11      (1.7%)      297.98      (1.5%)    0.3% (  -2% -    3%) 0.710
                 AndHighOrMedMed       43.16      (1.6%)       43.31      (2.0%)    0.3% (  -3% -    3%) 0.702
             CountFilteredIntNRQ       19.77      (0.4%)       19.85      (0.7%)    0.4% (   0% -    1%) 0.170
              CombinedAndHighMed       61.79      (1.7%)       62.07      (1.5%)    0.4% (  -2% -    3%) 0.583
            FilteredAndStopWords       18.77      (1.4%)       18.85      (1.5%)    0.4% (  -2% -    3%) 0.541
                 DismaxOrHighMed       11.22      (4.2%)       11.27      (4.9%)    0.4% (  -8% -    9%) 0.843
                          Fuzzy2       80.82      (3.2%)       81.20      (2.9%)    0.5% (  -5% -    6%) 0.759
                    FilteredTerm       38.29      (2.0%)       38.48      (1.7%)    0.5% (  -3% -    4%) 0.589
              Or2Terms2StopWords        2.49      (5.2%)        2.50     (11.0%)    0.5% ( -14% -   17%) 0.904
             CombinedAndHighHigh       16.43      (1.7%)       16.52      (1.4%)    0.6% (  -2% -    3%) 0.456
                       CountTerm    10806.25      (6.3%)    10880.88      (8.0%)    0.7% ( -12% -   16%) 0.848
                  FilteredPhrase       82.93      (1.6%)       83.52      (2.0%)    0.7% (  -2% -    4%) 0.426
                        SpanNear       34.73      (4.0%)       34.99      (2.3%)    0.8% (  -5% -    7%) 0.643
                    SloppyPhrase       25.49      (6.8%)       25.70      (4.9%)    0.8% ( -10% -   13%) 0.790
                DismaxOrHighHigh        4.21      (3.7%)        4.25      (4.4%)    0.8% (  -6% -    9%) 0.677
                     CountPhrase        6.48      (2.1%)        6.54      (2.1%)    0.9% (  -3% -    5%) 0.397
                      OrHighRare        4.33      (1.4%)        4.37      (1.2%)    0.9% (  -1% -    3%) 0.163
             FilteredOrStopWords        8.37      (1.3%)        8.46      (1.5%)    1.1% (  -1% -    3%) 0.126
                      DismaxTerm       48.75      (5.9%)       49.34      (3.4%)    1.2% (  -7% -   11%) 0.613
                     OrStopWords        3.33      (4.4%)        3.37     (11.3%)    1.3% ( -13% -   17%) 0.762
                  FilteredOrMany        1.86      (2.1%)        1.88      (1.6%)    1.3% (  -2% -    5%) 0.158
                      OrHighHigh        8.31      (3.8%)        8.43     (11.6%)    1.5% ( -13% -   17%) 0.728
                          Fuzzy1       47.01      (1.5%)       47.72      (5.7%)    1.5% (  -5% -    8%) 0.462
                       OrHighMed        6.75      (4.0%)        6.86     (11.9%)    1.6% ( -13% -   18%) 0.714
                IntervalsOrdered       31.01      (4.7%)       31.59      (3.0%)    1.9% (  -5% -   10%) 0.341
                        Or3Terms       24.10      (0.7%)       24.75      (9.6%)    2.7% (  -7% -   13%) 0.434
               CombinedOrHighMed        6.98      (5.0%)        7.23      (3.9%)    3.6% (  -5% -   13%) 0.116
              CombinedOrHighHigh        1.53      (5.4%)        1.59      (3.7%)    3.9% (  -4% -   13%) 0.093
                            Term       73.22      (1.7%)      119.72      (4.4%)   63.5% (  56% -   70%) 0.000

github-actions · 2025-05-24T11:22:11Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

github-actions · 2025-05-24T16:40:17Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

github-actions · 2025-05-24T16:50:58Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

jpountz

Nice speedup! Term queries are fast, though a term query on the is one of the slowest queries in the Tantivy benchmark, so it's nice to get it optimized.

jpountz · 2025-05-25T19:07:42Z

lucene/core/src/java/org/apache/lucene/search/BatchScoreBulkScorer.java

+  public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
+    if (collector.competitiveIterator() != null) {
+      return new Weight.DefaultBulkScorer(scorer).score(collector, acceptDocs, min, max);
+    }


I wonder if this should be an implementation detail of DefaultBulkScorer instead of a different class. Doing something like

if (scoreMode == TOP_SCORES && competitiveIterator == null) { // new optimization } else { // existing DefaultBulkScorer code }

Thanks for feedback! I moved the impl into DefaultBulkScorer.

if (scoreMode == TOP_SCORES && competitiveIterator == null)

As description showing, exhaustive execution get optimized as well so i use scoreMode.needsScores instead.

github-actions · 2025-05-25T20:07:40Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

github-actions · 2025-05-25T20:31:58Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

jpountz

Sorry about my last suggestion, I had missed that DefaultBulkScorer had no way to know if scores are needed or not yet, so I think I like your previous approach a bit better to keep DefaultBulkScorer clean.

jpountz · 2025-05-26T12:35:47Z

lucene/core/src/java/org/apache/lucene/search/TermScorer.java


+    if (impactsDisi != null) {
+      impactsDisi.ensureCompetitive();
+    }


I wonder if we should rather put it at the beginning of the below for loop. For instance, imagine that the first block of docs returned only has deleted docs, then it will fetch a new block. It would be good to check if this block is competitive before fetching this new block as well?

Oh, nice catch!

This reverts commit 8b25eb3.

This reverts commit 8ec9930.

github-actions · 2025-05-26T13:14:45Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

// nightly-benchmarks-results-changed //

* main: (32 commits) update os.makedirs with pathlib mkdir (apache#14710) Optimize AbstractKnnVectorQuery#createBitSet with intoBitset (apache#14674) Implement #docIDRunEnd() on PostingsEnum. (apache#14693) Speed up TermQuery (apache#14709) Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion. (apache#14701) Fix WindowsFS test failure seen on Policeman Jenkins (apache#14706) Use a temporary repository location to download certain ecj versions ("drops") (apache#14703) Add assumption to ignore occasional test failures due to disconnected graphs (apache#14696) Return MatchNoDocsQuery when IndexOrDocValuesQuery::rewrite does not match (apache#14700) Minor access modifier adjustment to a couple of lucene90 backward compat types (apache#14695) Speed up exhaustive evaluation. (apache#14679) Specify and test that IOContext is immutable (apache#14686) deps(java): bump org.gradle.toolchains.foojay-resolver-convention (apache#14691) deps(java): bump org.eclipse.jgit:org.eclipse.jgit (apache#14692) Clean up how the test framework creates asserting scorables. (apache#14452) Make competitive iterators more robust. (apache#14532) Remove DISIDocIdStream. (apache#14550) Implement AssertingPostingsEnum#intoBitSet. (apache#14675) Fix patience knn queries to work with seeded knn queries (apache#14688) Added toString() method to BytesRefBuilder (apache#14676) ...

jpountz · 2025-05-31T14:11:57Z

This change yielded a good speedup on nightly benchmarks, I pushed an annotation. https://benchmarks.mikemccandless.com/Term.html

Speed up term query

7f0ed2a

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking May 24, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking May 24, 2025

github-actions bot added the module:core/search label May 24, 2025

gf2121 added 3 commits May 25, 2025 00:24

iter

487bbc8

iter

6ad87d1

iter

b55336a

tidy

ab6773d

jpountz reviewed May 25, 2025

View reviewed changes

feedback iter

8ec9930

fix

8b25eb3

gf2121 mentioned this pull request May 26, 2025

Move HitQueue in TopScoreDocCollector to a LongHeap #14714

Merged

jpountz reviewed May 26, 2025

View reviewed changes

gf2121 added 3 commits May 26, 2025 21:09

Revert "fix"

2152fd1

This reverts commit 8b25eb3.

Revert "feedback iter"

bd4ccbb

This reverts commit 8ec9930.

move into loop

0fafc36

gf2121 and others added 2 commits May 26, 2025 21:16

CHANGES

1d84b7f

Merge branch 'main' into opt_term_query

381034a

github-actions bot added this to the 10.3.0 milestone May 26, 2025

jpountz approved these changes May 26, 2025

View reviewed changes

gf2121 merged commit 12c3041 into apache:main May 26, 2025
7 checks passed

github-project-automation bot moved this from Open to Merged in OpenSearch Lucene & Core Performance Tracking May 26, 2025

asf-gitbox-commits pushed a commit that referenced this pull request May 26, 2025

Speed up TermQuery (#14709)

adbf9a9

// nightly-benchmarks-results-changed //

RamakrishnaChilaka mentioned this pull request Aug 30, 2025

Adding 3-ary LongHeap to speed up collectors like TopDoc*Collectors #15140

Merged

hossman mentioned this pull request Sep 26, 2025

Change in behavior using SimpleCollector+TopScoreDocCollector between 10.2 and 10.3 when scoreMode==COMPLETE #15239

Open

Conversation

gf2121 commented May 24, 2025

Uh oh!

github-actions bot commented May 24, 2025

Uh oh!

github-actions bot commented May 24, 2025

Uh oh!

github-actions bot commented May 24, 2025

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

jpountz May 25, 2025

Choose a reason for hiding this comment

Uh oh!

gf2121 May 25, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 25, 2025

Uh oh!

github-actions bot commented May 25, 2025

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

jpountz May 26, 2025

Choose a reason for hiding this comment

Uh oh!

gf2121 May 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 26, 2025

Uh oh!

Uh oh!

jpountz commented May 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants