feat: Add `index_of_first_not_null` and `index_of_last_not_null` Expr and Series methods by alexander-beedie · Pull Request #22880 · pola-rs/polars

alexander-beedie · 2025-05-22T10:54:05Z

New Expr/Series Methods

.index_of_first_not_null(): returns the index of the first value that is not null.
.index_of_last_not_null(): returns the index of the last value that is not null.

The first can be constructed from existing expressions as .is_not_null().arg_max(), but has an edge case (if all values are null it will return 0 instead of None). The second doesn't have an obvious/clean clean construction (arguably the first isn't that obvious). We can get faster performance (and be clearer, and correct in all cases) with dedicated functionality.

🚀 Core Performance Optimisations

Moved to a separate PR (#22897) while we decide on the API here.

🕐 Timings (vs Pandas)

Setup

Create some 10,000,000 element Series with different characteristics...
Note: equivalent Pandas methods are first_valid_index¹ and last_valid_index².

import polars as pl

# all null
s1 = pl.Series([None], dtype=pl.Int64).extend_constant(None, 9_999_999)

# all not-null
s2 = pl.Series(range(10_000_000), dtype=pl.Int64)

# only the first value is null
s3 = pl.Series([None], dtype=pl.Int64).extend_constant(0, 9_999_999)

# only the last value is null
s4 = pl.Series(range(9_999_999), dtype=pl.Int64).extend_constant(None, 1)

# only non-null value is halfway through
s5 = (
    pl.Series([None], dtype=pl.Int64)
      .extend_constant(None, 4_999_998)
      .extend_constant(1, 1)
      .extend_constant(None, 5_000_000)
)
for s in (s1, s2, s3, s4, s5):
    ps = s.to_pandas()
    
    %timeit ps.first_valid_index()
    %timeit s.index_of_first_not_null()
    %timeit ps.last_valid_index()
    %timeit s.index_of_last_not_null()

Results

You can see that for the first four series we have really good fast-paths, executing in nanoseconds where Pandas takes milliseconds; only the final Series requires any real work, and there we are still ~100x faster on 10,000,000 elements.

series	operation	pandas (ns)	polars (ns)	speedup
s1	first not null	1,380,000	45.5	30,330x
s1	last not null	4,090,000	46.8	87,393x
s2	first not null	2,650,000	44.9	59,020x
s2	last not null	2,960,000	52.2	56,705x
s3	first not null	1,290,000	55.9	23,077x
s3	last not null	4,450,000	60.9	73,071x
s4	first not null	1,340,000	55.8	24,014x
s4	last not null	4,160,000	61.1	68,085x
s5	first not null	1,400,000	21,600	65x
s5	last not null	4,320,000	25,400	170x

Test machine: Apple Silicon M3 Max.
Results normalised to nanoseconds; Pandas was reporting in ms, and Polars μs,ns.

Unit Tests

Added lots of new test coverage, both for the new functions and for index_of and arg_max, since they have new fast-paths. Includes a parametric test (which I ran for 10s of thousands of iterations before committing, just to be on the safe side ;)

codecov · 2025-05-22T11:15:52Z

Codecov Report

Attention: Patch coverage is 86.89655% with 19 lines in your changes missing coverage. Please review.

Project coverage is 80.66%. Comparing base (5993d95) to head (8b54ff0).

Files with missing lines	Patch %	Lines
crates/polars-ops/src/series/ops/index_of.rs	86.36%	12 Missing ⚠️
crates/polars-plan/src/dsl/function_expr/mod.rs	50.00%	4 Missing ⚠️
.../polars-python/src/lazyframe/visitor/expr_nodes.rs	0.00%	2 Missing ⚠️
...ates/polars-plan/src/dsl/function_expr/index_of.rs	94.73%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #22880   +/-   ##
=======================================
  Coverage   80.66%   80.66%           
=======================================
  Files        1672     1672           
  Lines      221997   222085   +88     
  Branches     2798     2798           
=======================================
+ Hits       179065   179151   +86     
- Misses      42265    42267    +2     
  Partials      667      667

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

orlp · 2025-05-22T14:45:07Z

I think these functions are far too specific in their current form. I think there are multiple use-cases one could have:

Find the first/last non-null value. As I mentioned here I'm open to adding an ignore_nulls parameter to first and last, as this is a really common use-case. Currently you have to write pl.col.x.filter(pl.col.x.is_not_null()).first() and similarly for last, which isn't as efficient. Once this exist we can also add an optimization pass for this.
Find the index of the first/last non-null value. I think this should not have a dedicated function for nulls but instead be a function on arbitrary boolean masks as it's much less common and in general we would like to push users away from index-based solutions (as they're not streaming friendly).

So, generalizing to arbitrary boolean masks, we have the same two questions again:

Find the first/last value using an arbitrary boolean mask. Currently this has to be written pl.col.x.filter(cond).first() and pl.col.x.filter(cond).last(). I'm not strongly opposed to Expr.first_where and Expr.last_where which take a boolean mask of equal length to its first and selects the first/last value, with an optimization past to turn filter().first()/filter().last() into this.
Find the index of the first/last true value in an arbitrary boolean mask. We already have an efficient invocation this for the first, namely .index_of(True), but not for the last true value. I would be open to adding a default keyword-only argument index_of(val, *, last = False) which when specified to be True gives the index of the last matching element. Then we could add an optimization pass which turns pl.col.x.arg_true().first() and pl.col.x.arg_true().last() into index_of calls.

alexander-beedie · 2025-05-23T06:46:50Z

I think this should not have a dedicated function for nulls but instead be a function on arbitrary boolean masks as it's much less common

Looking at the large amount of Pandas code we are hoping to translate, I can assure you that it is actually quite common, hence needing some equivalent/clear functionality on our (Polars) side.

But yes, having two separate dedicated/specialised methods with long-winded names like this can be improved! Maybe a companion to index_of like index_of_not, adding a new "strategy" param to both that could default to "first" but also allow "last"? Or just a boolean "last" param that defaults to False. Would map conceptually to is_null and is_not_null, but be more generic 🤔

Part of the problem for making the API clean/clear is that there is no single token we can pass in that just means "not null" (like... lit(None).not(), hah), and setting up boolean masks can (sometimes) be clunky. Anyway, I'll think about the comments and see what works and looks clean (an "ignore_nulls" param for first/last would be a good related addition - can look at that too) 👍

In the meantime have broken out the performance optimisations into a separate PR so that this one can be more API focused and not hold-up the speedups.

… and Series methods

alexander-beedie requested review from MarcoGorelli, c-peters, orlp, reswqa, ritchie46 and wence- as code owners May 22, 2025 10:54

github-actions Bot added the title needs formatting label May 22, 2025

alexander-beedie changed the title ~~feat: add index_of_first_not_null and index_of_last_not_null Expr and Series methods, faster low-level null search~~ feat: Add index_of_first_not_null and index_of_last_not_null Expr and Series methods May 22, 2025

github-actions Bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels May 22, 2025

alexander-beedie added the performance Performance issues or improvements label May 22, 2025

alexander-beedie force-pushed the first-last-not-null branch from af4a6d4 to 4ed6604 Compare May 22, 2025 11:17

alexander-beedie mentioned this pull request May 23, 2025

perf: Optimise low-level null scans and arg_max for bools (when chunked) #22897

Merged

alexander-beedie force-pushed the first-last-not-null branch from 4ed6604 to a3eb176 Compare May 23, 2025 16:41

feat: add index_of_first_not_null and index_of_last_not_null Expr…

8b54ff0

… and Series methods

alexander-beedie force-pushed the first-last-not-null branch from a3eb176 to 8b54ff0 Compare May 23, 2025 17:19

alexander-beedie removed the performance Performance issues or improvements label May 24, 2025

alexander-beedie marked this pull request as draft August 4, 2025 07:58

ritchie46 force-pushed the main branch 3 times, most recently from ddf5907 to d0914d4 Compare September 27, 2025 11:06

ritchie46 force-pushed the main branch 3 times, most recently from 90ceb7b to e9fce55 Compare October 26, 2025 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `index_of_first_not_null` and `index_of_last_not_null` Expr and Series methods#22880

feat: Add `index_of_first_not_null` and `index_of_last_not_null` Expr and Series methods#22880
alexander-beedie wants to merge 1 commit intopola-rs:mainfrom
alexander-beedie:first-last-not-null

alexander-beedie commented May 22, 2025 •

edited

Loading

Uh oh!

codecov Bot commented May 22, 2025 •

edited

Loading

Uh oh!

orlp commented May 22, 2025 •

edited

Loading

Uh oh!

alexander-beedie commented May 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexander-beedie commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Expr/Series Methods

🚀 Core Performance Optimisations

🕐 Timings (vs Pandas)

Setup

Results

Unit Tests

Footnotes

Uh oh!

codecov Bot commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

orlp commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexander-beedie commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexander-beedie commented May 22, 2025 •

edited

Loading

codecov Bot commented May 22, 2025 •

edited

Loading

orlp commented May 22, 2025 •

edited

Loading

alexander-beedie commented May 23, 2025 •

edited

Loading