Skip to content

Convert to BodoDataFrame/BodoSeries on fallback#855

Merged
scott-routledge2 merged 18 commits intomainfrom
scott/make_bodo_on_fallback
Oct 6, 2025
Merged

Convert to BodoDataFrame/BodoSeries on fallback#855
scott-routledge2 merged 18 commits intomainfrom
scott/make_bodo_on_fallback

Conversation

@scott-routledge2
Copy link
Contributor

@scott-routledge2 scott-routledge2 commented Sep 30, 2025

Changes included in this PR

Generate BodoSeries and BodoDataFrames after running an unsupported function in Pandas. Also adds extra error checking to from_pandas

Testing strategy

User facing changes

Better error messages in from_pandas. Result of fallback methods returns BodoDataFrames/Series

Checklist

  • Pipelines passed before requesting review. To run CI you must include [run CI] in your commit message.
  • I am familiar with the Contributing Guide
  • I have installed + ran pre-commit hooks.


# Convert objects to Bodo before returning them to the user.
if FallbackContext.is_top_level():
return convert_to_bodo(py_res)
Copy link
Contributor Author

@scott-routledge2 scott-routledge2 Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My observation was that Pandas methods call a lot of internal functions we do not support (example: xs, copy), so we can keep the DataFrame as Pandas for the internal calls and only convert when returning back to the user.

This is also currently hiding a small bug in a lot of the tests that I haven't figured out yet, but seemed minor to me:

df = pd.DataFrame({"A": [1, 2, 3], "B": ["a", "b", "c"]}, index = [1,2,3])
bdf = bd.from_pandas(df)

bdf1 = bdf.rename_axis("index123")
bdf2 = bdf1.copy()
print("bodo result: ", bdf2.index.name)

pdf1 = df.rename_axis("index123")
pdf2 = pdf1.copy()
print("pandas result: ", pdf2.index.name)

Pandas result: "index123", Bodo result: None (index name doesn't propagate in some places)

@codecov
Copy link

codecov bot commented Oct 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.98%. Comparing base (c33fbb5) to head (8f9e8f2).
⚠️ Report is 66 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #855      +/-   ##
==========================================
+ Coverage   66.68%   68.98%   +2.30%     
==========================================
  Files         186      191       +5     
  Lines       66795    67217     +422     
  Branches     9507     9531      +24     
==========================================
+ Hits        44543    46373    +1830     
+ Misses      19572    18021    -1551     
- Partials     2680     2823     +143     

"""Tests that slicing returns the correct value and does not trigger data fetch unnecessarily"""
lazy_manager, pandas_manager = single_pandas_managers

if pandas_manager == SingleArrayManager:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem to be an existing issues that was exposed by this PR now that there is more conversion between pandas and bodo happening. I can open a followup to investigate but in my opinion it is not as big of a priority since BlockManager is the default and ArrayManager will be removed in Pandas 3.0

Copy link
Collaborator

@DrTodd13 DrTodd13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Scott. Looks pretty good.

for c in df.columns:
if isinstance(df[c], pd.DataFrame):
raise BodoLibNotImplementedException(
f"from_pandas(): Duplicate column name: '{c}'."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does df[c] ever become itself a dataframe and why is that labelled as a duplicate column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df[c] returns a dataframe with all columns named "c" in the case of duplicates

@scott-routledge2 scott-routledge2 marked this pull request as ready for review October 3, 2025 19:28
Copy link
Collaborator

@ehsantn ehsantn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert sub.returncode == 0


@pytest.mark.skip("TODO: Fix flakey test on CI.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's open an issue and put on oncall board not to forget.

)
new_columns = []
for c in df.columns:
if isinstance(df[c], pd.DataFrame):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using df.columns.has_duplicates is simpler and more reliable. columns is an Index, which is sort of a set and should have this info internally I think.

@scott-routledge2 scott-routledge2 merged commit 8f2c217 into main Oct 6, 2025
26 checks passed
@scott-routledge2 scott-routledge2 deleted the scott/make_bodo_on_fallback branch October 6, 2025 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants