Skip to content

Support Pandas 3#1009

Merged
ehsantn merged 518 commits intomainfrom
ehsan/pd_3_rc2
Feb 17, 2026
Merged

Support Pandas 3#1009
ehsantn merged 518 commits intomainfrom
ehsan/pd_3_rc2

Conversation

@ehsantn
Copy link
Collaborator

@ehsantn ehsantn commented Jan 22, 2026

Changes included in this PR

As title. Major changes include:

  • Datetime/timedelta arrays usually default to microsecond instead of nanosecond in Pandas 3. Made nullable datetime the default array type for series/dataframe datetime data to normalize.
  • There are a bunch of new string data types that cause comparison mismatch issues in testing.
  • The pandas comparison functions are now more strict about different NA sentinels (np.nan vs None) so had to made a lot of changes in our tests.
  • Setting values of a Pandas object inplace indirectly doesn't work anymore (e.g. df[df.A>3]["B"] = 3).
  • Many API removals and changes.

Testing strategy

Existing unit tests.

User facing changes

None.

Checklist

  • Pipelines passed before requesting review. To run CI you must include [run CI] in your commit message.
  • I am familiar with the Contributing Guide
  • I have installed + ran pre-commit hooks.

@ehsantn ehsantn changed the title Support Pandas 3rc2 Support Pandas 3 Jan 22, 2026
@codecov
Copy link

codecov bot commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 65.33333% with 104 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.45%. Comparing base (c33fbb5) to head (668318f).
⚠️ Report is 209 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1009      +/-   ##
==========================================
+ Coverage   66.68%   68.45%   +1.77%     
==========================================
  Files         186      195       +9     
  Lines       66795    68055    +1260     
  Branches     9507     9705     +198     
==========================================
+ Hits        44543    46589    +2046     
+ Misses      19572    18603     -969     
- Partials     2680     2863     +183     

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Collaborator

@DrTodd13 DrTodd13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

arr_type = bodo.types.boolean_array_type

if arr_type == types.Array(types.NPDatetime("us"), 1, "C"):
# Make sure datetime64 arrays are ns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this making sure they are ns?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatetimeArrayType normalizes the unit to nanosecond during unboxing (through Arrow).

arr_type = bodo.types.DatetimeArrayType(None)

# Make sure timedelta64 arrays are ns
if isinstance(arr_type, types.Array) and isinstance(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elif?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a specific case to timedelta.


# We make all Series data arrays contiguous during unboxing to avoid type errors
# see test_df_query_stringliteral_expr
if isinstance(arr_type, types.Array):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe one check for isinstance(arr_type, types.Array) and inside that if the additional checks above with an else for this current category. Eliminate 3 checks for isinstance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Refactored this section to have only one types.Array type check.

normalize=False,
name=None,
closed=None,
inclusive="both",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, Pandas 3 only once this is merged?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can't keep many versions of different APIs practically. These minor differences shouldn't matter much anyways.

not isinstance(on_data_type, types.Array)
or on_data_type.dtype != bodo.types.datetime64ns
):
) and not on_data_type == bodo.types.DatetimeArrayType(None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!= instead of not== ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

n_int64 = bodo.hiframes.datetime_timedelta_ext.cast_numpy_timedelta_to_int(dt64)
return pd.Timedelta(n_int64)
def convert_numpy_timedelta64_to_pd_timedelta(td64): # pragma: no cover
return td64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only needs conversion in jitted code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not called in non-jitted code. This is just a placeholder.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should throw an exception then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, we have a lot of these in the code base. Not worth going through them right now I think.

func_text += " out_arr[i] = ts." + field + "\n"
else:
func_text += f" out_arr[i] = arr[i].{field}\n"
call_parans = "()" if field == "weekday" else ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

params?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

func_text += " min_val = bodo.libs.array_ops.array_op_min(arr)\n"
func_text += " max_val = bodo.libs.array_ops.array_op_max(arr)\n"
if dtype == bodo.types.datetime64ns:
if dtype == bodo.types.datetime64ns or isinstance(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one isinstance with two possible targets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a line of code in between.

if isinstance(values, (pa.Array, pa.ChunkedArray)) and (
pa.types.is_string(values.type) or _is_string_view(values.type)
# Bodo change: allow dictionary-encoded string arrays
# or (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove dead code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the disabled code around for documentation and context to help with later upgrades.

@@ -239,6 +246,7 @@ nccl = ">=2.18"
numba = ">=0.60,<0.62.0"
pyarrow = "21.0.*"
libarrow = "21.0.*"
pandas = ">=2.2.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications of 2.2 here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's for compatibility with older pyarrow dependency in the GPU env?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just moved here in this PR.

@DrTodd13 DrTodd13 requested a review from Copilot February 16, 2026 18:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

@scott-routledge2 scott-routledge2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ehsantn LGTM!

assert generated_ctes == 1


@pytest.mark.jit_dependency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this test require JIT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has apply in it. I don't know how it worked before.



@pytest.mark.parametrize("filter", ["IS_NULL", "IS_NOT_NULL", "IS_IN"])
# TODO: fix Pandas 3 issues with IS_NULL and IS_NOT_NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open a followup issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def test_pq_read_types(fname, datapath, memory_leak_check):
def test_impl(fname):
return pd.read_parquet(fname)
return pd.read_parquet(fname, dtype_backend="pyarrow")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update our docs/examples to reflect changes to parameters like requiring dtype_backend="pyarrow" in read csv/parquet calls?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't require this parameter. This is just for testing to make sure data types match and we don't run into unnecessary issues.

@@ -1641,10 +1767,12 @@ def _test_equal(
reset_index,
)
elif py_out is pd.NaT:
assert py_out is bodo_out
# TODO: return pd.NaT for pd.to_datetime(None) and pd.to_timedelta(ts_val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -239,6 +246,7 @@ nccl = ">=2.18"
numba = ">=0.60,<0.62.0"
pyarrow = "21.0.*"
libarrow = "21.0.*"
pandas = ">=2.2.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's for compatibility with older pyarrow dependency in the GPU env?

@ehsantn ehsantn merged commit 5768abf into main Feb 17, 2026
35 of 50 checks passed
@ehsantn ehsantn deleted the ehsan/pd_3_rc2 branch February 17, 2026 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants