Skip to content

Fixes for TPCH scripts#986

Merged
scott-routledge2 merged 16 commits intomainfrom
scott/tpch_init_results
Dec 19, 2025
Merged

Fixes for TPCH scripts#986
scott-routledge2 merged 16 commits intomainfrom
scott/tpch_init_results

Conversation

@scott-routledge2
Copy link
Contributor

@scott-routledge2 scott-routledge2 commented Dec 19, 2025

Changes included in this PR

  • Fix typos in Dask queries and use LocalCluster instead of threads
  • Refactor EMR script to run each query as a separate step
  • Add warmup run to dask/spark cluster execution to match Bodo
  • Update env.yml with final package versions

Testing strategy

User facing changes

Checklist

  • Pipelines passed before requesting review. To run CI you must include [run CI] in your commit message.
  • I am familiar with the Contributing Guide
  • I have installed + ran pre-commit hooks.


def run_queries(query_nums, dataset_path, scale_factor) -> None:
total_start = time.time()
with Client(): # Use default LocalCluster settings
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using distributed scheduler in local mode (better performance/generally recommended):
https://dask-local.readthedocs.io/en/latest/setup/single-distributed.html

@scott-routledge2 scott-routledge2 marked this pull request as ready for review December 19, 2025 17:34
| dask | 2025.11.0 |
| dask-cloudprovider | 2025.9.0 |
| PySpark | 3.5.2 |
| PySpark | 3.5.5 |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version matches with EMR 7.9.0 as well as the conda version from single node/local config mode.

Copy link
Collaborator

@DrTodd13 DrTodd13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@scott-routledge2 scott-routledge2 merged commit 352c6f6 into main Dec 19, 2025
15 of 16 checks passed
@scott-routledge2 scott-routledge2 deleted the scott/tpch_init_results branch December 19, 2025 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants