feat: Add index_of_first_not_null and index_of_last_not_null Expr and Series methods#22880
feat: Add index_of_first_not_null and index_of_last_not_null Expr and Series methods#22880alexander-beedie wants to merge 1 commit intopola-rs:mainfrom
index_of_first_not_null and index_of_last_not_null Expr and Series methods#22880Conversation
index_of_first_not_null and index_of_last_not_null Expr and Series methods, faster low-level null searchindex_of_first_not_null and index_of_last_not_null Expr and Series methods
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #22880 +/- ##
=======================================
Coverage 80.66% 80.66%
=======================================
Files 1672 1672
Lines 221997 222085 +88
Branches 2798 2798
=======================================
+ Hits 179065 179151 +86
- Misses 42265 42267 +2
Partials 667 667 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
af4a6d4 to
4ed6604
Compare
|
I think these functions are far too specific in their current form. I think there are multiple use-cases one could have:
So, generalizing to arbitrary boolean masks, we have the same two questions again:
|
Looking at the large amount of Pandas code we are hoping to translate, I can assure you that it is actually quite common, hence needing some equivalent/clear functionality on our (Polars) side. But yes, having two separate dedicated/specialised methods with long-winded names like this can be improved! Maybe a companion to Part of the problem for making the API clean/clear is that there is no single token we can pass in that just means "not null" (like... In the meantime have broken out the performance optimisations into a separate PR so that this one can be more API focused and not hold-up the speedups. |
4ed6604 to
a3eb176
Compare
… and Series methods
a3eb176 to
8b54ff0
Compare
ddf5907 to
d0914d4
Compare
90ceb7b to
e9fce55
Compare
New Expr/Series Methods
.index_of_first_not_null(): returns the index of the first value that is not null..index_of_last_not_null(): returns the index of the last value that is not null.The first can be constructed from existing expressions as
.is_not_null().arg_max(), but has an edge case (if all values are null it will return 0 instead of None). The second doesn't have an obvious/clean clean construction (arguably the first isn't that obvious). We can get faster performance (and be clearer, and correct in all cases) with dedicated functionality.🚀 Core Performance Optimisations
Moved to a separate PR (#22897) while we decide on the API here.
🕐 Timings (vs Pandas)
Setup
Create some 10,000,000 element
Serieswith different characteristics...Note: equivalent Pandas methods are
first_valid_index1 andlast_valid_index2.Results
You can see that for the first four series we have really good fast-paths, executing in nanoseconds where Pandas takes milliseconds; only the final Series requires any real work, and there we are still ~100x faster on 10,000,000 elements.
Test machine: Apple Silicon M3 Max.
Results normalised to nanoseconds; Pandas was reporting in
ms, and Polarsμs,ns.Unit Tests
Added lots of new test coverage, both for the new functions and for
index_ofandarg_max, since they have new fast-paths. Includes a parametric test (which I ran for 10s of thousands of iterations before committing, just to be on the safe side ;)Footnotes
pandas.DataFrame.first_valid_index ↩
pandas.DataFrame.last_valid_index ↩