Skip to content

fix: Support nested dtypes from Python in search_sorted()#22633

Draft
itamarst wants to merge 26 commits intopola-rs:mainfrom
itamarst:21100-search_sorted-nesting
Draft

fix: Support nested dtypes from Python in search_sorted()#22633
itamarst wants to merge 26 commits intopola-rs:mainfrom
itamarst:21100-search_sorted-nesting

Conversation

@itamarst
Copy link
Copy Markdown
Contributor

@itamarst itamarst commented May 6, 2025

Fixes #21100

Also:

  • Make search_sorted() from Python work with nested dtypes, keeping in mind some ambiguities in the API.

@github-actions github-actions Bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels May 6, 2025
@itamarst
Copy link
Copy Markdown
Contributor Author

itamarst commented May 6, 2025

Discovered a bug, possibly related to what @coastalwhite wanted fixed in previous iteration of PR:

>>> pl.Series([[1], [2], [3]]).search_sorted([2])
1
>>> pl.Series([[1], [2], [3]]).search_sorted([3])
2
>>> pl.Series([[1], [2], [3]]).search_sorted([2, 3])
2

That last value ought to have returned None, there is no list [2, 3] in the series.

Update: Never mind, forgot this wasn't exact search, it's "where would value go to be inserted in correct sorted order".

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2025

Codecov Report

Attention: Patch coverage is 96.82540% with 2 lines in your changes missing coverage. Please review.

Project coverage is 80.40%. Comparing base (718d5e5) to head (198d54b).

Files with missing lines Patch % Lines
crates/polars-ops/src/series/ops/search_sorted.rs 87.50% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #22633      +/-   ##
==========================================
+ Coverage   80.35%   80.40%   +0.04%     
==========================================
  Files        1682     1682              
  Lines      223237   223289      +52     
  Branches     2804     2804              
==========================================
+ Hits       179389   179532     +143     
+ Misses      43179    43089      -90     
+ Partials      669      668       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@itamarst
Copy link
Copy Markdown
Contributor Author

itamarst commented May 7, 2025

@coastalwhite I've ported the code, and now looking at addressing your comments.

  • You brought up pl.Series([1, 2, 3]).search_sorted([2, 3]) as a problem, but actually that is fine as far as signature goes. Signature says list[NonNestedLiteral] -> Series, and that's what happens.
  • pl.Series([[1], [2], [3]).search_sorted([2, 3]) does actually have the wrong signature, since it returns an integer index.

The obvious solution to get consistent signatures is to always interpret lists as literals. But that is not backwards compatible. pl.Series([1, 2, 3]).search_sorted([2, 3]) will stop returning a pl.Series and instead raise an error.

Some options

Option 1. What's in the initial PR

  1. Existing supported types are backwards compatible in that they take a list of values as if it were a pl.Series i.e. multi-search, and return a pl.Series.
  2. Newly supported types (list and array) only accept pl.Series for multi-search, and a list is interpreted as a pl.List or pl.Array.

In Polars 2, where backwards compatibility could be dropped, one would then change the function so that lists are always interpreted as literals, and you always have to pass pl.Series for multi-search for all dtypes.

The problem with this is inconsistent type signatures. I do not see any way to fix this.

The benefit is fully backwards compatible, and nicer usage for pl.List/pl.Array dtypes.

This may be best demonstrated with some examples:

# For non-nested types, both lists and Series result in Series (multi-search):
>>> pl.Series([17, 5]).search_sorted([17, 2, 5])
shape: (3,)
Series: '' [u32]
[
        2
        0
        0
]
>>> pl.Series([17, 5]).search_sorted(pl.Series([17, 2, 5]))
shape: (3,)
Series: '' [u32]
[
        2
        0
        0
]
>>> pl.Series([17, 5]).search_sorted(5)
0
# For a nested type, a list results in an int, but a Series results in a Series (multi-search):
>>> pl.Series([[17], [5]]).search_sorted([5])
0
>>> pl.Series([[17], [5]]).search_sorted(pl.Series([5]))
shape: (1,)
Series: '' [u32]
[
        0
]

Option 2. Lists mean multi-search for pl.List and pl.Array dtypes

In this case search_sorted([1, 2]) always means multi-search. This is very annoying from a Python user perspective when using pl.List/pl.Array dtypes, since it's not DWIM.

Option 3. ???

I get the impression you were suggesting something else in #21266 (review) but I don't understand what, unfortunately.

@coastalwhite Can you maybe provide a Python example of both the signature, and what inputs/outputs you expect for primitive vs nested types given a list and given a Series?

@itamarst itamarst marked this pull request as ready for review May 7, 2025 15:38
@coastalwhite
Copy link
Copy Markdown
Collaborator

I think this should get a similar treatment to .is_in in such a way that we deprecate the usage of pl.Series([1, 2, 3]).search_sorted([2, 3]) while also supporting nested types properly. For .is_in() this was done by changing the signature and adding an implicit .implode() in type coercion if the levels of nesting mismatched between the two arguments.

For search_sorted, we should not change the signature, but we should add an implicit .explode() in type coercion if the righthand side is 1 nesting level higher than the left side and throw a deprecation warning. Look at

let left_nl = type_left.nesting_level();
for the implicit .implode() on .is_in().

If you do this and remove the list_as_series=True, pl.Series([1, 2, 3]).search_sorted([2, 3]) will still work but throw a deprecation warning. pl.Series([1, 2, 3]).search_sorted(pl.Series([2, 3])) will still work as before and you will be able to pass pl.Series([[1], [2, 3]]).search_sorted([2, 3]) correctly.

@itamarst
Copy link
Copy Markdown
Contributor Author

itamarst commented May 8, 2025

So I was already doing list_as_series=False. Adding a deprecation could be as easy as adding a Python deprecation on the Python side for the explicit casting of list to Series that the code is currently doing. Whereas there's a lot of Rust code in type_coercion/is_in.rs.

Is there a reason to do it on the Rust side?

@coastalwhite
Copy link
Copy Markdown
Collaborator

So I was already doing list_as_series=False. Adding a deprecation could be as easy as adding a Python deprecation on the Python side for the explicit casting of list to Series that the code is currently doing. Whereas there's a lot of Rust code in type_coercion/is_in.rs.

Is there a reason to do it on the Rust side?

You cannot do this on the Python side as you do not know the type on the python side. You cannot do this Python trick for arbitrary nested datatypes and not for arbitrary expressions.

@itamarst
Copy link
Copy Markdown
Contributor Author

itamarst commented May 8, 2025

OK, Rust it is.

@itamarst
Copy link
Copy Markdown
Contributor Author

itamarst commented May 8, 2025

To what extent could the existing code in type_coercion/is_in.rs be replaced by using CastingRules::FirstArgLossless? Not the nested level checks, but the last part, where it has a bunch of casting logic? or does hitting this code path preclude reusing the rest of the casting logic?

@coastalwhite
Copy link
Copy Markdown
Collaborator

To what extent could the existing code in type_coercion/is_in.rs be replaced by using CastingRules::FirstArgLossless? Not the nested level checks, but the last part, where it has a bunch of casting logic? or does hitting this code path preclude reusing the rest of the casting logic?

Well, it is similar but not the same (is_in casts to the list type not the direct type). I looked at merging the two when I wrote the .implode() thing, but I ultimately decided it would be more complexity than benefit.

@itamarst itamarst marked this pull request as draft May 13, 2025 15:26
@itamarst
Copy link
Copy Markdown
Contributor Author

Next I will be merging this forward and addressing the review comments.

@itamarst itamarst marked this pull request as ready for review May 28, 2025 18:51
@itamarst
Copy link
Copy Markdown
Contributor Author

@coastalwhite ready for review again, hopefully.

@itamarst itamarst requested a review from coastalwhite May 28, 2025 18:57
@itamarst
Copy link
Copy Markdown
Contributor Author

@coastalwhite merged forward, hopefully ready for review again.

@itamarst
Copy link
Copy Markdown
Contributor Author

Can't figure out how to get rid of the "changes requested" bit in the UI 😞

@itamarst itamarst marked this pull request as draft July 7, 2025 19:27
@itamarst
Copy link
Copy Markdown
Contributor Author

itamarst commented Jul 7, 2025

Can simplify this once #23458 is merged, so turning back into draft. Also might want to get categorical working while I'm at it.

@itamarst
Copy link
Copy Markdown
Contributor Author

itamarst commented Jul 8, 2025

And after further thought I think switching search_sorted to FirstArgLossLess is a mistake, so will remove that part.

@ritchie46 ritchie46 force-pushed the main branch 3 times, most recently from ddf5907 to d0914d4 Compare September 27, 2025 11:06
@ritchie46 ritchie46 force-pushed the main branch 3 times, most recently from 90ceb7b to e9fce55 Compare October 26, 2025 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Bug fix python Related to Python Polars rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

index_of() with Series of length 1 and dtype Array doesn't work

3 participants