Convert to BodoDataFrame/BodoSeries on fallback#855
Conversation
|
|
||
| # Convert objects to Bodo before returning them to the user. | ||
| if FallbackContext.is_top_level(): | ||
| return convert_to_bodo(py_res) |
There was a problem hiding this comment.
My observation was that Pandas methods call a lot of internal functions we do not support (example: xs, copy), so we can keep the DataFrame as Pandas for the internal calls and only convert when returning back to the user.
This is also currently hiding a small bug in a lot of the tests that I haven't figured out yet, but seemed minor to me:
df = pd.DataFrame({"A": [1, 2, 3], "B": ["a", "b", "c"]}, index = [1,2,3])
bdf = bd.from_pandas(df)
bdf1 = bdf.rename_axis("index123")
bdf2 = bdf1.copy()
print("bodo result: ", bdf2.index.name)
pdf1 = df.rename_axis("index123")
pdf2 = pdf1.copy()
print("pandas result: ", pdf2.index.name)Pandas result: "index123", Bodo result: None (index name doesn't propagate in some places)
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #855 +/- ##
==========================================
+ Coverage 66.68% 68.98% +2.30%
==========================================
Files 186 191 +5
Lines 66795 67217 +422
Branches 9507 9531 +24
==========================================
+ Hits 44543 46373 +1830
+ Misses 19572 18021 -1551
- Partials 2680 2823 +143 |
| """Tests that slicing returns the correct value and does not trigger data fetch unnecessarily""" | ||
| lazy_manager, pandas_manager = single_pandas_managers | ||
|
|
||
| if pandas_manager == SingleArrayManager: |
There was a problem hiding this comment.
These seem to be an existing issues that was exposed by this PR now that there is more conversion between pandas and bodo happening. I can open a followup to investigate but in my opinion it is not as big of a priority since BlockManager is the default and ArrayManager will be removed in Pandas 3.0
DrTodd13
left a comment
There was a problem hiding this comment.
Thanks Scott. Looks pretty good.
bodo/pandas/base.py
Outdated
| for c in df.columns: | ||
| if isinstance(df[c], pd.DataFrame): | ||
| raise BodoLibNotImplementedException( | ||
| f"from_pandas(): Duplicate column name: '{c}'." |
There was a problem hiding this comment.
How does df[c] ever become itself a dataframe and why is that labelled as a duplicate column?
There was a problem hiding this comment.
df[c] returns a dataframe with all columns named "c" in the case of duplicates
ehsantn
left a comment
There was a problem hiding this comment.
Thanks @scott-routledge2!
| assert sub.returncode == 0 | ||
|
|
||
|
|
||
| @pytest.mark.skip("TODO: Fix flakey test on CI.") |
There was a problem hiding this comment.
Let's open an issue and put on oncall board not to forget.
bodo/pandas/base.py
Outdated
| ) | ||
| new_columns = [] | ||
| for c in df.columns: | ||
| if isinstance(df[c], pd.DataFrame): |
There was a problem hiding this comment.
Using df.columns.has_duplicates is simpler and more reliable. columns is an Index, which is sort of a set and should have this info internally I think.
Changes included in this PR
Generate BodoSeries and BodoDataFrames after running an unsupported function in Pandas. Also adds extra error checking to from_pandas
Testing strategy
User facing changes
Better error messages in from_pandas. Result of fallback methods returns BodoDataFrames/Series
Checklist
[run CI]in your commit message.