Skip to content

feat: Add index_of_first_not_null and index_of_last_not_null Expr and Series methods#22880

Draft
alexander-beedie wants to merge 1 commit intopola-rs:mainfrom
alexander-beedie:first-last-not-null
Draft

feat: Add index_of_first_not_null and index_of_last_not_null Expr and Series methods#22880
alexander-beedie wants to merge 1 commit intopola-rs:mainfrom
alexander-beedie:first-last-not-null

Conversation

@alexander-beedie
Copy link
Copy Markdown
Collaborator

@alexander-beedie alexander-beedie commented May 22, 2025

New Expr/Series Methods

  • .index_of_first_not_null(): returns the index of the first value that is not null.
  • .index_of_last_not_null(): returns the index of the last value that is not null.

The first can be constructed from existing expressions as .is_not_null().arg_max(), but has an edge case (if all values are null it will return 0 instead of None). The second doesn't have an obvious/clean clean construction (arguably the first isn't that obvious). We can get faster performance (and be clearer, and correct in all cases) with dedicated functionality.

🚀 Core Performance Optimisations

Moved to a separate PR (#22897) while we decide on the API here.


🕐 Timings (vs Pandas)

Setup

Create some 10,000,000 element Series with different characteristics...
Note: equivalent Pandas methods are first_valid_index1 and last_valid_index2.

import polars as pl

# all null
s1 = pl.Series([None], dtype=pl.Int64).extend_constant(None, 9_999_999)

# all not-null
s2 = pl.Series(range(10_000_000), dtype=pl.Int64)

# only the first value is null
s3 = pl.Series([None], dtype=pl.Int64).extend_constant(0, 9_999_999)

# only the last value is null
s4 = pl.Series(range(9_999_999), dtype=pl.Int64).extend_constant(None, 1)

# only non-null value is halfway through
s5 = (
    pl.Series([None], dtype=pl.Int64)
      .extend_constant(None, 4_999_998)
      .extend_constant(1, 1)
      .extend_constant(None, 5_000_000)
)
for s in (s1, s2, s3, s4, s5):
    ps = s.to_pandas()
    
    %timeit ps.first_valid_index()
    %timeit s.index_of_first_not_null()
    %timeit ps.last_valid_index()
    %timeit s.index_of_last_not_null()

Results

You can see that for the first four series we have really good fast-paths, executing in nanoseconds where Pandas takes milliseconds; only the final Series requires any real work, and there we are still ~100x faster on 10,000,000 elements.

series operation  pandas (ns) polars (ns) speedup
s1 first not null 1,380,000 45.5 30,330x
s1 last not null 4,090,000 46.8 87,393x
s2 first not null 2,650,000 44.9 59,020x
s2 last not null 2,960,000 52.2 56,705x
s3 first not null 1,290,000 55.9 23,077x
s3 last not null 4,450,000 60.9 73,071x
s4 first not null 1,340,000 55.8 24,014x
s4 last not null 4,160,000 61.1 68,085x
s5 first not null 1,400,000 21,600 65x
s5 last not null 4,320,000 25,400 170x

Test machine: Apple Silicon M3 Max.
Results normalised to nanoseconds; Pandas was reporting in ms, and Polars μs,ns.

Unit Tests

Added lots of new test coverage, both for the new functions and for index_of and arg_max, since they have new fast-paths. Includes a parametric test (which I ran for 10s of thousands of iterations before committing, just to be on the safe side ;)

Footnotes

  1. pandas.DataFrame.first_valid_index

  2. pandas.DataFrame.last_valid_index

@alexander-beedie alexander-beedie changed the title feat: add index_of_first_not_null and index_of_last_not_null Expr and Series methods, faster low-level null search feat: Add index_of_first_not_null and index_of_last_not_null Expr and Series methods May 22, 2025
@github-actions github-actions Bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels May 22, 2025
@alexander-beedie alexander-beedie added the performance Performance issues or improvements label May 22, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2025

Codecov Report

Attention: Patch coverage is 86.89655% with 19 lines in your changes missing coverage. Please review.

Project coverage is 80.66%. Comparing base (5993d95) to head (8b54ff0).

Files with missing lines Patch % Lines
crates/polars-ops/src/series/ops/index_of.rs 86.36% 12 Missing ⚠️
crates/polars-plan/src/dsl/function_expr/mod.rs 50.00% 4 Missing ⚠️
.../polars-python/src/lazyframe/visitor/expr_nodes.rs 0.00% 2 Missing ⚠️
...ates/polars-plan/src/dsl/function_expr/index_of.rs 94.73% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #22880   +/-   ##
=======================================
  Coverage   80.66%   80.66%           
=======================================
  Files        1672     1672           
  Lines      221997   222085   +88     
  Branches     2798     2798           
=======================================
+ Hits       179065   179151   +86     
- Misses      42265    42267    +2     
  Partials      667      667           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@orlp
Copy link
Copy Markdown
Member

orlp commented May 22, 2025

I think these functions are far too specific in their current form. I think there are multiple use-cases one could have:

  1. Find the first/last non-null value. As I mentioned here I'm open to adding an ignore_nulls parameter to first and last, as this is a really common use-case. Currently you have to write pl.col.x.filter(pl.col.x.is_not_null()).first() and similarly for last, which isn't as efficient. Once this exist we can also add an optimization pass for this.

  2. Find the index of the first/last non-null value. I think this should not have a dedicated function for nulls but instead be a function on arbitrary boolean masks as it's much less common and in general we would like to push users away from index-based solutions (as they're not streaming friendly).

So, generalizing to arbitrary boolean masks, we have the same two questions again:

  1. Find the first/last value using an arbitrary boolean mask. Currently this has to be written pl.col.x.filter(cond).first() and pl.col.x.filter(cond).last(). I'm not strongly opposed to Expr.first_where and Expr.last_where which take a boolean mask of equal length to its first and selects the first/last value, with an optimization past to turn filter().first()/filter().last() into this.

  2. Find the index of the first/last true value in an arbitrary boolean mask. We already have an efficient invocation this for the first, namely .index_of(True), but not for the last true value. I would be open to adding a default keyword-only argument index_of(val, *, last = False) which when specified to be True gives the index of the last matching element. Then we could add an optimization pass which turns pl.col.x.arg_true().first() and pl.col.x.arg_true().last() into index_of calls.

@alexander-beedie
Copy link
Copy Markdown
Collaborator Author

alexander-beedie commented May 23, 2025

I think this should not have a dedicated function for nulls but instead be a function on arbitrary boolean masks as it's much less common

Looking at the large amount of Pandas code we are hoping to translate, I can assure you that it is actually quite common, hence needing some equivalent/clear functionality on our (Polars) side.

But yes, having two separate dedicated/specialised methods with long-winded names like this can be improved! Maybe a companion to index_of like index_of_not, adding a new "strategy" param to both that could default to "first" but also allow "last"? Or just a boolean "last" param that defaults to False. Would map conceptually to is_null and is_not_null, but be more generic 🤔

Part of the problem for making the API clean/clear is that there is no single token we can pass in that just means "not null" (like... lit(None).not(), hah), and setting up boolean masks can (sometimes) be clunky. Anyway, I'll think about the comments and see what works and looks clean (an "ignore_nulls" param for first/last would be a good related addition - can look at that too) 👍

In the meantime have broken out the performance optimisations into a separate PR so that this one can be more API focused and not hold-up the speedups.

@alexander-beedie alexander-beedie removed the performance Performance issues or improvements label May 24, 2025
@alexander-beedie alexander-beedie marked this pull request as draft August 4, 2025 07:58
@ritchie46 ritchie46 force-pushed the main branch 3 times, most recently from ddf5907 to d0914d4 Compare September 27, 2025 11:06
@ritchie46 ritchie46 force-pushed the main branch 3 times, most recently from 90ceb7b to e9fce55 Compare October 26, 2025 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants