Skip to content

Speed up TPCH.#856

Merged
DrTodd13 merged 21 commits intomainfrom
todd/tpch_testing
Oct 3, 2025
Merged

Speed up TPCH.#856
DrTodd13 merged 21 commits intomainfrom
todd/tpch_testing

Conversation

@DrTodd13
Copy link
Collaborator

Changes included in this PR

Traverse right-side of cross-product so CTE can find duplicate computations.
Run to_datetime as cfunc over a map instead of Python.

Testing strategy

run_ci

User facing changes

None

Checklist

  • Pipelines passed before requesting review. To run CI you must include [run CI] in your commit message.
  • I am familiar with the Contributing Guide
  • I have installed + ran pre-commit hooks.

@codecov
Copy link

codecov bot commented Sep 30, 2025

Codecov Report

❌ Patch coverage is 30.43478% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.96%. Comparing base (c33fbb5) to head (03ad320).
⚠️ Report is 64 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #856      +/-   ##
==========================================
+ Coverage   66.68%   68.96%   +2.27%     
==========================================
  Files         186      191       +5     
  Lines       66795    67170     +375     
  Branches     9507     9523      +16     
==========================================
+ Hits        44543    46321    +1778     
+ Misses      19572    18027    -1545     
- Partials     2680     2822     +142     

@DrTodd13
Copy link
Collaborator Author

DrTodd13 commented Oct 1, 2025

Is there a better way to flag the tests that have to_datetime in them as now requiring JIT?

Copy link
Contributor

@IsaacWarren IsaacWarren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DrTodd13

Comment on lines 506 to 512
# Declare function to be compiled to run to_datetime over series.
func = "def bodo_to_datetime(x):\n"
# Embed format string as constant in function.
func += f" return pd.to_datetime(x, format='{in_kwargs['format']}')\n"
# Create the function from string.
to_datetime_func = bodo_spawn_exec(func, {"pd": pd}, {}, __name__)
return arg.map(to_datetime_func)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use a c++ kernel for this maybe using Arrow instead of the compiler? Not saying it needs done now but maybe a comment for a followup

@scott-routledge2
Copy link
Contributor

Is there a better way to flag the tests that have to_datetime in them as now requiring JIT?

Not that I can think of? Maybe we could change the test command to run each test module as a separate session? (I think runtests.py does this. That way would could tell the first function to import JIT in each file at least.

Copy link
Contributor

@scott-routledge2 scott-routledge2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DrTodd13, LGTM.

# Embed format string as constant in function.
func += f" return pd.to_datetime(x, format='{in_kwargs['format']}')\n"
# Create the function from string.
to_datetime_func = bodo_spawn_exec(func, {"pd": pd}, {}, __name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this needs bodo.jit(func, cache=True) here?

@DrTodd13 DrTodd13 merged commit 2d66959 into main Oct 3, 2025
25 of 26 checks passed
@DrTodd13 DrTodd13 deleted the todd/tpch_testing branch October 3, 2025 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants