Skip to content

Refactor Guides Docs#882

Merged
scott-routledge2 merged 21 commits intomainfrom
scott/refactor_docs_guides
Oct 20, 2025
Merged

Refactor Guides Docs#882
scott-routledge2 merged 21 commits intomainfrom
scott/refactor_docs_guides

Conversation

@scott-routledge2
Copy link
Contributor

@scott-routledge2 scott-routledge2 commented Oct 16, 2025

Changes included in this PR

Adds Bodo DataFrames Guide to docs/guides page and notebook to examples/#Tutorials

Refactors guides to be in docs/guides folders. Moves all JIT guides to a subdirectory of guides, fixes links etc.

The important changes are in docs/docs/guides/dataframes/dataframes_intro.md and mkdocs.yaml for the updated side bar.

Testing strategy

Ran notebook in pixi env

User facing changes

Docs

Checklist

  • Pipelines passed before requesting review. To run CI you must include [run CI] in your commit message.
  • I am familiar with the Contributing Guide
  • I have installed + ran pre-commit hooks.

@DrTodd13
Copy link
Collaborator

I wonder if we should mention the ability to execute an expensive plan resulting in little data that we know will be reused later?

bodo.pandas {#bodopandas}
===========

Bodo.pandas is an optimized and distributed dataframe library that is a
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this page was unreachable. Is there any info here we want to add to another page/section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The #'s in the link were leading to the wrong page.

@scott-routledge2
Copy link
Contributor Author

I wonder if we should mention the ability to execute an expensive plan resulting in little data that we know will be reused later?

Is this referring to CTE? I think that would be nice to have. Do you have a good example?

@scott-routledge2 scott-routledge2 marked this pull request as ready for review October 17, 2025 18:11
@DrTodd13
Copy link
Collaborator

I wonder if we should mention the ability to execute an expensive plan resulting in little data that we know will be reused later?

Is this referring to CTE? I think that would be nice to have. Do you have a good example?

We can detect and optimize CTE within one query. The issue is when you have repeated computations across queries and so our CTE optimization can do nothing in that case. This is the reason for persist in dask. So, something like this:

import bodo.pandas as bd

df = bd.read_parquet("bigdata.parquet")

# Expensive transformation (e.g. multi-join + groupby)
expensive = df.merge(other, on="key").groupby("category").agg({"value": "sum"})

expensive.execute_plan()

# Without execute_plan above, this would trigger execution of expensive.
print(expensive + 7)

# Without execute_plan above, this would also trigger a second execution of expensive.
result = print(some_other_computation(expensive[expensive["value"] > 1000]))

Copy link
Collaborator

@ehsantn ehsantn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scott-routledge2! This is a big improvement. See minor comments below.

Copy link
Collaborator

@DrTodd13 DrTodd13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed work. Looks good.


`pd.read_parquet` and `pd.read_iceberg` are lazy APIs, meaning that no actual data is read until needed in a subsequent operation.

You can also create BodoDataFrames from a Pandas DataFrame using the `from_pandas` function, which is useful when working with third party libraries that return Pandas DataFrames.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to do bodo.pandas.DataFrame(pandas_dataframe) for conversion. Do we want to mention this option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought bodo.pandas.DataFrame wasn't stable but if it is just a wrapper around from_pandas then maybe we should. @ehsantn thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be simpler to just use bodo.pandas.DataFrame in some of the examples that construct dataframes, so I'll include it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just calls from_pandas so whichever looks better to you should be fine.

@scott-routledge2 scott-routledge2 merged commit dd2dcbb into main Oct 20, 2025
13 checks passed
@scott-routledge2 scott-routledge2 deleted the scott/refactor_docs_guides branch October 20, 2025 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants